Commit fe74a7d
committed
feat(convo_miner): parallel file processing + batched upserts
Port PR MemPalace#416's parallelization pattern from miner.py to convo_miner.py.
PR MemPalace#416 only batched project mining (mempalace mine <dir>), leaving
conversation mining (mempalace mine <dir> --mode convos) on the slow
per-chunk collection.add() path. Since convo mining is the primary
ingest path for large corpora like ~/.claude/projects/ (~1.5 GB of
Claude Code JSONL), this omission negated most of the speedup potential
from both PR MemPalace#416 (batching) and PR MemPalace#442 (GPU embedding).
Changes:
1. Extract process_convo_file_cpu() — pure CPU worker that normalizes,
chunks, detects room, and builds drawer records. Thread-safe by
construction: no ChromaDB calls, no shared state, all inputs
passed explicitly. Returns (source_file, room, records, room_counts_delta)
or None for skipped files.
2. Rewrite mine_convos() ingest loop:
- Pre-filter pending files with file_already_mined() (sequential,
matches PR MemPalace#416's pattern)
- Submit all pending files to ThreadPoolExecutor with
MAX_WORKERS = min(32, cpu_count()*2) for parallel normalize/chunk
- Main thread accumulates records into batch_ids/docs/metas lists
and flushes via collection.upsert() every BATCH_SIZE (128) records
- try/finally around the executor guarantees final flush_batch()
runs even if a worker raises, preventing silent loss of up to
BATCH_SIZE-1 pending drawers
- Per-worker exceptions are caught and logged instead of aborting
the whole run (each file is independent)
3. Keep dry_run path sequential — matches miner.py, preserves original
output formatting (per-file [DRY RUN] lines, room distribution),
uses the same extracted worker for consistency.
4. Switch collection.add() -> collection.upsert() — idempotent, removes
the try/except 'already exists' dance, matches miner.py.
Performance expectations (M-series Mac with MEMPALACE_EMBEDDING_DEVICE=mps):
Before (single-file sequential loop + per-chunk .add()):
502 drawers in 12.3s = 40.7 drawers/s
After (parallel reads + batched upserts):
Expected ~3-5x improvement from batching alone (GPU finally gets
meaningful batch sizes), plus another ~2-3x from parallelizing
the normalize() step on large JSONL files. Combined: ~5-15x.
All 556 tests still pass, including tests/test_convo_miner.py which
exercises the real ChromaDB write path end-to-end.
No changes required to the public API. Callers (cli.py, mcp_server.py)
are unaffected.1 parent 66c2825 commit fe74a7d
1 file changed
Lines changed: 161 additions & 80 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| 17 | + | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
| |||
28 | 29 | | |
29 | 30 | | |
30 | 31 | | |
| 32 | + | |
| 33 | + | |
31 | 34 | | |
32 | 35 | | |
33 | 36 | | |
| |||
229 | 232 | | |
230 | 233 | | |
231 | 234 | | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
232 | 316 | | |
233 | 317 | | |
234 | 318 | | |
| |||
270 | 354 | | |
271 | 355 | | |
272 | 356 | | |
273 | | - | |
274 | | - | |
275 | | - | |
276 | | - | |
277 | | - | |
278 | | - | |
279 | | - | |
280 | | - | |
281 | | - | |
282 | | - | |
283 | | - | |
284 | | - | |
285 | | - | |
286 | | - | |
287 | | - | |
288 | | - | |
289 | | - | |
290 | | - | |
291 | | - | |
292 | | - | |
293 | | - | |
294 | | - | |
295 | | - | |
296 | | - | |
297 | | - | |
298 | | - | |
299 | | - | |
300 | | - | |
301 | | - | |
302 | | - | |
303 | | - | |
304 | | - | |
305 | | - | |
306 | | - | |
307 | | - | |
308 | | - | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
309 | 366 | | |
310 | 367 | | |
311 | 368 | | |
312 | | - | |
| 369 | + | |
313 | 370 | | |
314 | | - | |
| 371 | + | |
315 | 372 | | |
316 | | - | |
317 | | - | |
318 | | - | |
319 | | - | |
320 | | - | |
321 | | - | |
322 | | - | |
323 | | - | |
324 | | - | |
325 | | - | |
326 | | - | |
327 | | - | |
328 | | - | |
329 | | - | |
330 | | - | |
331 | | - | |
332 | | - | |
333 | | - | |
334 | | - | |
335 | | - | |
336 | | - | |
337 | | - | |
338 | | - | |
339 | | - | |
340 | | - | |
341 | | - | |
342 | | - | |
343 | | - | |
344 | | - | |
345 | | - | |
346 | | - | |
347 | | - | |
348 | | - | |
349 | | - | |
350 | | - | |
351 | | - | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
352 | 396 | | |
353 | | - | |
354 | | - | |
355 | | - | |
356 | | - | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
357 | 400 | | |
358 | | - | |
359 | | - | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
360 | 441 | | |
361 | 442 | | |
362 | 443 | | |
| |||
0 commit comments