docs: add quick reference guide for DP-442038 workaround

tom-dyar · tom-dyar · commit 70befb835b74 · 2025-11-25T07:08:18.000-05:00
Create condensed guide for IRIS EMBEDDING model caching solution:
- Problem: 720x slowdown from repeated model loading
- Solution: Python embedding cache layer (Feature 051)
- Result: 1405x speedup (20 min → 0.85s for 1,746 docs)

Guide includes:
- Quick overview of the issue and fix
- Usage examples with code snippets
- Performance comparison table
- Architecture diagrams
- Troubleshooting section
- References to full documentation

Complements existing comprehensive report (IRIS_EMBEDDING_PERFORMANCE_REPORT.md)
with a more accessible quick-start format.
diff --git a/docs/QUICK_GUIDE_DP442038_WORKAROUND.md b/docs/QUICK_GUIDE_DP442038_WORKAROUND.md
@@ -0,0 +1,269 @@
+# Quick Guide: DP-442038 Workaround (720x Model Loading Overhead)
+
+**JIRA Issue**: DP-442038
+**Problem**: IRIS EMBEDDING reloads model from disk on every document insert
+**Impact**: 720x slowdown (20 minutes for 1,746 documents)
+**Solution**: Python embedding cache layer (Feature 051)
+**Result**: 1405x speedup (0.85 seconds for same workload)
+
+---
+
+## The Problem
+
+IRIS native `%Embedding.SentenceTransformers` reloads the embedding model **on every INSERT**:
+
+```sql
+-- This triggers model reload for EACH document
+INSERT INTO documents (id, content) VALUES (1, 'text...');
+-- Model loads from disk (400MB), generates embedding, then exits
+INSERT INTO documents (id, content) VALUES (2, 'text...');
+-- Model loads AGAIN from disk (400MB), generates embedding, exits
+```
+
+**Bottleneck**: 1,746 documents × 400MB model = 698GB disk I/O + 20 minutes
+
+---
+
+## The Workaround: Python Embedding Cache
+
+We intercept IRIS EMBEDDING calls and cache models in Python memory:
+
+### Architecture
+```
+IRIS SQL INSERT
+    ↓
+%Embedding.SentenceTransformers
+    ↓
+iris_vector_rag.embeddings.iris_embedding (cache layer)
+    ↓
+Cached model in memory (99%+ cache hits)
+    ↓
+Return embedding to IRIS
+```
+
+### Key Files
+1. **`iris_vector_rag/embeddings/manager.py`** - Model cache manager (singleton pattern)
+2. **`iris_vector_rag/embeddings/iris_embedding.py`** - IRIS integration layer
+3. **`iris_vector_rag/config/embedding_config.py`** - Configuration validation
+
+---
+
+## Usage
+
+### 1. Configure IRIS EMBEDDING
+
+```python
+from iris_vector_rag.embeddings.iris_embedding import configure_embedding
+
+# Configure embedding model in IRIS
+configure_embedding(
+    connection,
+    config_name="medical_embeddings",
+    model_name="sentence-transformers/all-MiniLM-L6-v2",
+    dimension=384,
+    device="cpu",  # or "cuda" for GPU
+    use_cache=True  # CRITICAL: enables caching
+)
+```
+
+### 2. Create Table with EMBEDDING Column
+
+```sql
+CREATE TABLE RAG.Documents (
+    doc_id VARCHAR(255),
+    content VARCHAR(5000),
+    embedding VECTOR(DOUBLE, 384) EMBEDDING('medical_embeddings')
+)
+```
+
+### 3. Insert Documents (Cache Automatically Applied)
+
+```python
+# First insert: Model loads (2-3 seconds)
+cursor.execute(
+    "INSERT INTO RAG.Documents (doc_id, content) VALUES (?, ?)",
+    ("doc1", "Patient presents with symptoms...")
+)
+
+# Subsequent inserts: Cache hit (15-25ms each)
+for i in range(1000):
+    cursor.execute(
+        "INSERT INTO RAG.Documents (doc_id, content) VALUES (?, ?)",
+        (f"doc{i}", f"Document {i} content...")
+    )
+```
+
+---
+
+## Performance Comparison
+
+| Scenario | Method | Time | Speedup |
+|----------|--------|------|---------|
+| **Without Cache** | IRIS native (DP-442038) | 20 min | 1x (baseline) |
+| **With Cache** | Python cache layer | 0.85s | **1405x** |
+
+### Cache Statistics
+- **Cache hit rate**: 99%+
+- **Cache hit time**: ~15-25ms
+- **Cache miss time**: ~2-3s (first load only)
+- **Memory overhead**: ~400MB per model
+
+---
+
+## How It Works
+
+### 1. Model Cache Manager (Singleton Pattern)
+
+```python
+# iris_vector_rag/embeddings/manager.py:21-45
+_SENTENCE_TRANSFORMER_CACHE: Dict[str, Any] = {}  # Module-level singleton
+_CACHE_LOCK = threading.Lock()
+
+def _get_cached_sentence_transformer(model_name: str, device: str = "cpu"):
+    """Get or create cached SentenceTransformer model."""
+    cache_key = f"{model_name}:{device}"
+
+    with _CACHE_LOCK:
+        if cache_key in _SENTENCE_TRANSFORMER_CACHE:
+            logger.debug(f"✓ Cache hit for {cache_key}")
+            return _SENTENCE_TRANSFORMER_CACHE[cache_key]  # 99%+ hit rate
+
+        # Cache miss: Load model (happens once per model)
+        logger.info(f"Loading model {model_name} (cache miss)")
+        model = SentenceTransformer(model_name, device=device)
+        _SENTENCE_TRANSFORMER_CACHE[cache_key] = model
+        return model
+```
+
+### 2. IRIS Integration Layer
+
+```python
+# iris_vector_rag/embeddings/iris_embedding.py:337-385
+def generate_embeddings(
+    texts: List[str],
+    config_name: str,
+    connection: Any
+) -> List[List[float]]:
+    """Generate embeddings using cached model (called by IRIS)."""
+
+    # Get config from IRIS
+    config = get_config(connection, config_name)
+
+    # Get cached model (99%+ cache hits)
+    model = _get_cached_sentence_transformer(
+        config["model_name"],
+        config["device"]
+    )
+
+    # Generate embeddings (15-25ms for cached model)
+    embeddings = model.encode(texts, show_progress_bar=False)
+
+    return embeddings.tolist()
+```
+
+### 3. Cache Statistics Tracking
+
+```python
+# Track cache performance
+stats = get_cache_stats(connection, "medical_embeddings")
+print(f"Cache hits: {stats['cache_hits']}")
+print(f"Cache misses: {stats['cache_misses']}")
+print(f"Hit rate: {stats['cache_hit_rate']:.1%}")
+print(f"Avg embedding time: {stats['avg_embedding_time_ms']:.2f}ms")
+```
+
+---
+
+## Validation Tests
+
+### Contract Tests (TDD)
+Location: `tests/contract/test_iris_embedding_contract.py`
+
+```python
+def test_cache_hit_rate_target():
+    """Verify 80%+ cache hit rate for repeated calls."""
+    # Generate 1000 embeddings with same model
+    for _ in range(10):
+        embeddings = generate_embeddings(texts * 100, "test_config", conn)
+
+    stats = get_cache_stats(conn, "test_config")
+    assert stats["total_embeddings"] >= 1000
+    assert stats["cache_hit_rate"] >= 0.80  # 80% target
+```
+
+### Performance Benchmarks
+Location: `tests/performance/test_iris_embedding_performance.py`
+
+```python
+def test_performance_benchmark_1746_texts():
+    """Validate actual DP-442038 scenario performance."""
+    texts = [f"Document {i} content..." for i in range(1746)]
+
+    start = time.time()
+    for batch in chunks(texts, 100):
+        generate_embeddings(batch, "medical_embeddings", conn)
+    elapsed = time.time() - start
+
+    # Baseline: 1200s (20 minutes without caching)
+    # Target: <30s (40x improvement)
+    # Achieved: 0.85s (1405x improvement)
+    assert elapsed < 30.0
+```
+
+---
+
+## Troubleshooting
+
+### Issue: Model Still Reloading on Every Call
+**Symptom**: Slow performance (>1 second per embedding)
+**Cause**: `use_cache=False` in configuration
+**Fix**:
+```python
+# Reconfigure with caching enabled
+configure_embedding(
+    connection,
+    config_name="medical_embeddings",
+    model_name="sentence-transformers/all-MiniLM-L6-v2",
+    dimension=384,
+    use_cache=True  # ← Must be True
+)
+```
+
+### Issue: Memory Errors with Large Models
+**Symptom**: `OutOfMemoryError` when loading model
+**Cause**: Model too large for available RAM
+**Fix**: Use smaller model or enable GPU
+```python
+configure_embedding(
+    connection,
+    config_name="medical_embeddings",
+    model_name="sentence-transformers/all-MiniLM-L6-v2",  # Use smaller model
+    dimension=384,
+    device="cuda"  # Or use GPU if available
+)
+```
+
+### Issue: Cache Not Persisting Between IRIS Sessions
+**Symptom**: First call slow after IRIS restart
+**Cause**: Cache is in-memory only (by design)
+**Expected Behavior**: This is normal - first call after restart loads model (~2-3s), subsequent calls use cache
+
+---
+
+## References
+
+- **Full Documentation**: `docs/IRIS_EMBEDDING_PERFORMANCE_REPORT.md`
+- **Feature Spec**: `specs/051-add-native-iris/spec.md`
+- **Contract Tests**: `tests/contract/test_iris_embedding_contract.py`
+- **Performance Tests**: `tests/performance/test_iris_embedding_performance.py`
+- **Integration Tests**: `tests/integration/test_iris_embedding_integration.py`
+
+---
+
+## Summary
+
+**DP-442038 Workaround**: Python embedding cache layer eliminates 720x model loading overhead by keeping models in memory across IRIS SQL operations.
+
+**Key Benefit**: 1405x speedup (20 minutes → 0.85 seconds for 1,746 documents)
+
+**Implementation**: Transparent to users - configure once with `use_cache=True`, cache applies automatically to all subsequent IRIS EMBEDDING operations.