Skip to content

Commit 70befb8

Browse files
committed
docs: add quick reference guide for DP-442038 workaround
Create condensed guide for IRIS EMBEDDING model caching solution: - Problem: 720x slowdown from repeated model loading - Solution: Python embedding cache layer (Feature 051) - Result: 1405x speedup (20 min → 0.85s for 1,746 docs) Guide includes: - Quick overview of the issue and fix - Usage examples with code snippets - Performance comparison table - Architecture diagrams - Troubleshooting section - References to full documentation Complements existing comprehensive report (IRIS_EMBEDDING_PERFORMANCE_REPORT.md) with a more accessible quick-start format.
1 parent bf9e14e commit 70befb8

File tree

1 file changed

+269
-0
lines changed

1 file changed

+269
-0
lines changed
Lines changed: 269 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,269 @@
1+
# Quick Guide: DP-442038 Workaround (720x Model Loading Overhead)
2+
3+
**JIRA Issue**: DP-442038
4+
**Problem**: IRIS EMBEDDING reloads model from disk on every document insert
5+
**Impact**: 720x slowdown (20 minutes for 1,746 documents)
6+
**Solution**: Python embedding cache layer (Feature 051)
7+
**Result**: 1405x speedup (0.85 seconds for same workload)
8+
9+
---
10+
11+
## The Problem
12+
13+
IRIS native `%Embedding.SentenceTransformers` reloads the embedding model **on every INSERT**:
14+
15+
```sql
16+
-- This triggers model reload for EACH document
17+
INSERT INTO documents (id, content) VALUES (1, 'text...');
18+
-- Model loads from disk (400MB), generates embedding, then exits
19+
INSERT INTO documents (id, content) VALUES (2, 'text...');
20+
-- Model loads AGAIN from disk (400MB), generates embedding, exits
21+
```
22+
23+
**Bottleneck**: 1,746 documents × 400MB model = 698GB disk I/O + 20 minutes
24+
25+
---
26+
27+
## The Workaround: Python Embedding Cache
28+
29+
We intercept IRIS EMBEDDING calls and cache models in Python memory:
30+
31+
### Architecture
32+
```
33+
IRIS SQL INSERT
34+
35+
%Embedding.SentenceTransformers
36+
37+
iris_vector_rag.embeddings.iris_embedding (cache layer)
38+
39+
Cached model in memory (99%+ cache hits)
40+
41+
Return embedding to IRIS
42+
```
43+
44+
### Key Files
45+
1. **`iris_vector_rag/embeddings/manager.py`** - Model cache manager (singleton pattern)
46+
2. **`iris_vector_rag/embeddings/iris_embedding.py`** - IRIS integration layer
47+
3. **`iris_vector_rag/config/embedding_config.py`** - Configuration validation
48+
49+
---
50+
51+
## Usage
52+
53+
### 1. Configure IRIS EMBEDDING
54+
55+
```python
56+
from iris_vector_rag.embeddings.iris_embedding import configure_embedding
57+
58+
# Configure embedding model in IRIS
59+
configure_embedding(
60+
connection,
61+
config_name="medical_embeddings",
62+
model_name="sentence-transformers/all-MiniLM-L6-v2",
63+
dimension=384,
64+
device="cpu", # or "cuda" for GPU
65+
use_cache=True # CRITICAL: enables caching
66+
)
67+
```
68+
69+
### 2. Create Table with EMBEDDING Column
70+
71+
```sql
72+
CREATE TABLE RAG.Documents (
73+
doc_id VARCHAR(255),
74+
content VARCHAR(5000),
75+
embedding VECTOR(DOUBLE, 384) EMBEDDING('medical_embeddings')
76+
)
77+
```
78+
79+
### 3. Insert Documents (Cache Automatically Applied)
80+
81+
```python
82+
# First insert: Model loads (2-3 seconds)
83+
cursor.execute(
84+
"INSERT INTO RAG.Documents (doc_id, content) VALUES (?, ?)",
85+
("doc1", "Patient presents with symptoms...")
86+
)
87+
88+
# Subsequent inserts: Cache hit (15-25ms each)
89+
for i in range(1000):
90+
cursor.execute(
91+
"INSERT INTO RAG.Documents (doc_id, content) VALUES (?, ?)",
92+
(f"doc{i}", f"Document {i} content...")
93+
)
94+
```
95+
96+
---
97+
98+
## Performance Comparison
99+
100+
| Scenario | Method | Time | Speedup |
101+
|----------|--------|------|---------|
102+
| **Without Cache** | IRIS native (DP-442038) | 20 min | 1x (baseline) |
103+
| **With Cache** | Python cache layer | 0.85s | **1405x** |
104+
105+
### Cache Statistics
106+
- **Cache hit rate**: 99%+
107+
- **Cache hit time**: ~15-25ms
108+
- **Cache miss time**: ~2-3s (first load only)
109+
- **Memory overhead**: ~400MB per model
110+
111+
---
112+
113+
## How It Works
114+
115+
### 1. Model Cache Manager (Singleton Pattern)
116+
117+
```python
118+
# iris_vector_rag/embeddings/manager.py:21-45
119+
_SENTENCE_TRANSFORMER_CACHE: Dict[str, Any] = {} # Module-level singleton
120+
_CACHE_LOCK = threading.Lock()
121+
122+
def _get_cached_sentence_transformer(model_name: str, device: str = "cpu"):
123+
"""Get or create cached SentenceTransformer model."""
124+
cache_key = f"{model_name}:{device}"
125+
126+
with _CACHE_LOCK:
127+
if cache_key in _SENTENCE_TRANSFORMER_CACHE:
128+
logger.debug(f"✓ Cache hit for {cache_key}")
129+
return _SENTENCE_TRANSFORMER_CACHE[cache_key] # 99%+ hit rate
130+
131+
# Cache miss: Load model (happens once per model)
132+
logger.info(f"Loading model {model_name} (cache miss)")
133+
model = SentenceTransformer(model_name, device=device)
134+
_SENTENCE_TRANSFORMER_CACHE[cache_key] = model
135+
return model
136+
```
137+
138+
### 2. IRIS Integration Layer
139+
140+
```python
141+
# iris_vector_rag/embeddings/iris_embedding.py:337-385
142+
def generate_embeddings(
143+
texts: List[str],
144+
config_name: str,
145+
connection: Any
146+
) -> List[List[float]]:
147+
"""Generate embeddings using cached model (called by IRIS)."""
148+
149+
# Get config from IRIS
150+
config = get_config(connection, config_name)
151+
152+
# Get cached model (99%+ cache hits)
153+
model = _get_cached_sentence_transformer(
154+
config["model_name"],
155+
config["device"]
156+
)
157+
158+
# Generate embeddings (15-25ms for cached model)
159+
embeddings = model.encode(texts, show_progress_bar=False)
160+
161+
return embeddings.tolist()
162+
```
163+
164+
### 3. Cache Statistics Tracking
165+
166+
```python
167+
# Track cache performance
168+
stats = get_cache_stats(connection, "medical_embeddings")
169+
print(f"Cache hits: {stats['cache_hits']}")
170+
print(f"Cache misses: {stats['cache_misses']}")
171+
print(f"Hit rate: {stats['cache_hit_rate']:.1%}")
172+
print(f"Avg embedding time: {stats['avg_embedding_time_ms']:.2f}ms")
173+
```
174+
175+
---
176+
177+
## Validation Tests
178+
179+
### Contract Tests (TDD)
180+
Location: `tests/contract/test_iris_embedding_contract.py`
181+
182+
```python
183+
def test_cache_hit_rate_target():
184+
"""Verify 80%+ cache hit rate for repeated calls."""
185+
# Generate 1000 embeddings with same model
186+
for _ in range(10):
187+
embeddings = generate_embeddings(texts * 100, "test_config", conn)
188+
189+
stats = get_cache_stats(conn, "test_config")
190+
assert stats["total_embeddings"] >= 1000
191+
assert stats["cache_hit_rate"] >= 0.80 # 80% target
192+
```
193+
194+
### Performance Benchmarks
195+
Location: `tests/performance/test_iris_embedding_performance.py`
196+
197+
```python
198+
def test_performance_benchmark_1746_texts():
199+
"""Validate actual DP-442038 scenario performance."""
200+
texts = [f"Document {i} content..." for i in range(1746)]
201+
202+
start = time.time()
203+
for batch in chunks(texts, 100):
204+
generate_embeddings(batch, "medical_embeddings", conn)
205+
elapsed = time.time() - start
206+
207+
# Baseline: 1200s (20 minutes without caching)
208+
# Target: <30s (40x improvement)
209+
# Achieved: 0.85s (1405x improvement)
210+
assert elapsed < 30.0
211+
```
212+
213+
---
214+
215+
## Troubleshooting
216+
217+
### Issue: Model Still Reloading on Every Call
218+
**Symptom**: Slow performance (>1 second per embedding)
219+
**Cause**: `use_cache=False` in configuration
220+
**Fix**:
221+
```python
222+
# Reconfigure with caching enabled
223+
configure_embedding(
224+
connection,
225+
config_name="medical_embeddings",
226+
model_name="sentence-transformers/all-MiniLM-L6-v2",
227+
dimension=384,
228+
use_cache=True # ← Must be True
229+
)
230+
```
231+
232+
### Issue: Memory Errors with Large Models
233+
**Symptom**: `OutOfMemoryError` when loading model
234+
**Cause**: Model too large for available RAM
235+
**Fix**: Use smaller model or enable GPU
236+
```python
237+
configure_embedding(
238+
connection,
239+
config_name="medical_embeddings",
240+
model_name="sentence-transformers/all-MiniLM-L6-v2", # Use smaller model
241+
dimension=384,
242+
device="cuda" # Or use GPU if available
243+
)
244+
```
245+
246+
### Issue: Cache Not Persisting Between IRIS Sessions
247+
**Symptom**: First call slow after IRIS restart
248+
**Cause**: Cache is in-memory only (by design)
249+
**Expected Behavior**: This is normal - first call after restart loads model (~2-3s), subsequent calls use cache
250+
251+
---
252+
253+
## References
254+
255+
- **Full Documentation**: `docs/IRIS_EMBEDDING_PERFORMANCE_REPORT.md`
256+
- **Feature Spec**: `specs/051-add-native-iris/spec.md`
257+
- **Contract Tests**: `tests/contract/test_iris_embedding_contract.py`
258+
- **Performance Tests**: `tests/performance/test_iris_embedding_performance.py`
259+
- **Integration Tests**: `tests/integration/test_iris_embedding_integration.py`
260+
261+
---
262+
263+
## Summary
264+
265+
**DP-442038 Workaround**: Python embedding cache layer eliminates 720x model loading overhead by keeping models in memory across IRIS SQL operations.
266+
267+
**Key Benefit**: 1405x speedup (20 minutes → 0.85 seconds for 1,746 documents)
268+
269+
**Implementation**: Transparent to users - configure once with `use_cache=True`, cache applies automatically to all subsequent IRIS EMBEDDING operations.

0 commit comments

Comments
 (0)