feat: implement paralleled vectorizing and thread-safe processing in … #1111

Blackjack200 · 2025-11-21T03:59:49Z

…KnowledgeStorePresenter

Summary by CodeRabbit

Refactor
- Improved background processing: per-file operation serialization to avoid concurrent conflicts and a configurable concurrency limiter for safer, more efficient parallel task execution.
- Better task tracking and termination: multiple concurrent tasks are managed and can be cleanly terminated and cleaned up.
- More reliable file/chunk lifecycle handling, yielding steadier updates and fewer race conditions.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…KnowledgeStorePresenter

coderabbitai · 2025-11-21T03:59:58Z

Walkthrough

Adds per-file task queues to serialize vector operations for the same file and refactors task processing to a bounded concurrent runner with configurable maxConcurrency; replaces single-task tracking with a runningTasks map and adds lifecycle/cleanup changes.

Changes

Cohort / File(s)	Summary
Per-file task queuing `src/main/presenter/knowledgePresenter/knowledgeStorePresenter.ts`	Adds `fileQueueMap` and `enqueueFileTask()` to serialize vectorP operations per file. Wraps file-level vector operations (update/insert/delete, chunk ops, status updates, finish/error flows) with the per-file queue and adapts some function signatures to run inside the queue.
Bounded concurrent task processing `src/main/presenter/knowledgePresenter/knowledgeTaskPresenter.ts`	Replaces single currentTask with `runningTasks` map and `maxConcurrency` (constructor param, default 16). Reworks `processQueue()` to spawn up to `maxConcurrency` concurrent runners, updates task/status methods to account for multiple running tasks, adds `destroy()` to terminate/cleanup all tasks, and adjusts termination logic to remove tasks from `runningTasks`.

Sequence Diagram(s)

sequenceDiagram
    participant Q as Global Queue
    participant R as Runner Pool
    participant RT as runningTasks
    participant T as Task Worker

    Q->>R: spawn up to maxConcurrency runners
    Note right of R: maxConcurrency configurable (default 16)

    par Concurrent runners
        R->>Q: dequeue task -> add to RT
        Q-->>T: task executed (async)
    end

    T->>RT: on finally remove from runningTasks
    RT->>R: trigger processQueue() to continue

sequenceDiagram
    participant App as Application
    participant Store as KnowledgeStorePresenter
    participant FQ as Per-File Queue
    participant Vec as VectorP

    App->>Store: addFile(fileId)
    Store->>FQ: enqueueFileTask(fileId, updateFile)
    App->>Store: processFileAsync(fileId)
    Store->>FQ: enqueueFileTask(fileId, insertChunks)

    FQ->>Vec: execute updateFile()
    Vec-->>FQ: complete
    Note over FQ: subsequent tasks for same file run sequentially
    FQ->>Vec: execute insertChunks()
    Vec-->>FQ: complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay special attention to:
- Correctness of enqueueFileTask chaining and error handling in knowledgeStorePresenter.ts.
- Concurrency lifecycle in knowledgeTaskPresenter.ts: creation/removal of runningTasks, abort controller handling, and recursive processQueue() behavior.
- All call sites updated to use per-file queue and adjusted function signatures.
- Proper cleanup in the new destroy() method and backward compatibility where KnowledgeTaskPresenter is constructed.

Poem

🐇
Queues hum softly, one file at a time,
Runners leap forward in bounded rhyme,
Chunks march in order, no chaos in sight,
Tasks bloom and finish from morning to night,
A rabbit cheers for concurrency's light.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: implement paralleled vectorizing and thread-safe processing in …' is directly related to the main changes, which introduce per-file task queuing (thread-safe serial execution per file) and parallel task execution (maxConcurrency up to 16 concurrent tasks) across the KnowledgeStorePresenter and KnowledgeTaskPresenter.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Blackjack200 · 2025-11-21T04:01:01Z

I haven't tested it sufficiently yet so I converted this PR into draft.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

src/main/presenter/knowledgePresenter/knowledgeTaskPresenter.ts (2)

12-20: Concurrent tracking via runningTasks looks good; consider small robustness tweaks

The switch from a single “current task” to runningTasks and maxConcurrency is consistent across status helpers and should give correct aggregate stats and activity checks.

Two minor follow-ups:

maxConcurrency is accepted as-is (including 0/negative), which would stall processing. Consider clamping to >= 1 in the constructor or throwing on invalid values.

The header comment still says “sequential execution”; it might be worth updating to reflect bounded parallelism to avoid confusion for future readers.

Also, CI is flagging Prettier; please run the formatter once you settle the implementation.

Also applies to: 61-64, 77-83, 89-91, 94-101, 103-109, 111-117

29-47: Termination semantics rely on cooperative tasks; document the contract

removeTasks now correctly:

Drops matching tasks from the queue.

Aborts and cleans up any matching entries in runningTasks via terminateTask.

destroy then calls removeTasks(() => true), clears the queue, aborts remaining controllers, and clears runningTasks, which is consistent.

One behavioral nuance: terminateTask removes a running task from runningTasks immediately after aborting its controller, so the concurrency limiter may start new work even if the aborted task’s run implementation doesn’t actually respect AbortSignal and keeps running. This is fine if KnowledgeChunkTask.run is expected to be cooperative w.r.t. signal, but it’s worth documenting that assumption so future callers don’t accidentally violate the contract.

Also applies to: 119-130, 132-139

src/main/presenter/knowledgePresenter/knowledgeStorePresenter.ts (2)

89-93: Be aware of changed failure semantics around vectorP.* operations

All these call sites now go through enqueueFileTask, so until its error-handling is adjusted:

Failures in updateFile/insertFile during addFile will not reject addFile; callers may think a file was registered while it wasn’t.

Failures in insertChunks, updateChunkStatus, insertVector, deleteFile, deleteChunksByFile, deleteVectorsByFile, and file status updates in onFileFinish/handleFileProcessingError will be logged but treated as success by their callers.

Once enqueueFileTask is updated to propagate errors (as suggested earlier), these sites will again behave like the previous direct await-based implementation, while still benefiting from per-file serialization.

Also applies to: 127-128, 140-143, 156-158, 213-221, 271-273, 299-300, 317-318, 334-335, 370-371

325-342: Deletion/re-add/destroy flows mostly align; consider a couple of lifecycle refinements

deleteFile now cancels queued/running chunk tasks and deletes the file via the per-file queue, which is consistent with the new model.

reAddFile cleans out chunks and vectors via the per-file queue before calling addFile, which is sensible.

Two optional improvements:

reAddFile task cancellation.
If reAddFile can be called while there are still active tasks for that file, you may want to invoke this.taskP.cancelTasksByFile(fileId) here as well, mirroring deleteFile.

destroy/close and fileQueueMap.
Both destroy and close cancel tasks and clear fileProgressMap but don’t touch fileQueueMap. Once you adjust enqueueFileTask to clean up entries, that may be sufficient, but if there’s a chance of in-flight queued operations at teardown time, explicitly clearing fileQueueMap (after ensuring you’re OK with abandoning pending DB work) would make lifecycle semantics clearer.

These are not correctness blockers but worth considering for long-running processes.

Also applies to: 365-373, 442-465

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 988b01b and 834b441.

📒 Files selected for processing (2)

src/main/presenter/knowledgePresenter/knowledgeStorePresenter.ts (12 hunks)
src/main/presenter/knowledgePresenter/knowledgeTaskPresenter.ts (7 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/main/presenter/knowledgePresenter/knowledgeTaskPresenter.ts (1)

src/shared/types/presenters/legacy.presenters.d.ts (3)

KnowledgeChunkTask (1645-1656)

TaskStatusSummary (1659-1663)

TaskQueueStatus (1666-1670)

🪛 GitHub Actions: PR Check

src/main/presenter/knowledgePresenter/knowledgeTaskPresenter.ts

[warning] 1-1: Code style issues found by Prettier. Run 'prettier --write' to fix.

🔇 Additional comments (1)

src/main/presenter/knowledgePresenter/knowledgeTaskPresenter.ts (1)

141-170: Queue scheduling and cleanup logic is sound

The reworked processQueue loop:

Correctly pulls from queue while runningTasks.size < maxConcurrency.

Tracks each started task in runningTasks.

Uses an async IIFE per task with a finally block that always cleans up controllers and runningTasks, logs completion, and then triggers another processQueue() pass.

The split between:

cooperative cancellation (success path + controller.signal.aborted ⇒ onTerminate)

vs. rejected AbortError (DOMException) in the catch

gives clear semantics for termination vs. “real” failure.

Overall this looks correct and should respect the concurrency cap under the assumption that tasks honor the abort signal.

coderabbitai · 2025-11-21T04:04:50Z

src/main/presenter/knowledgePresenter/knowledgeStorePresenter.ts

+  // --- 新增：按文件队列保证 vectorP 线程安全 ---
+  private fileQueueMap = new Map<string, Promise<void>>()
+
+  private async enqueueFileTask(fileId: string, task: () => Promise<void>): Promise<void> {
+    const last = this.fileQueueMap.get(fileId) ?? Promise.resolve()
+    const next = last.then(task).catch((err) => {
+      console.error(`[RAG] Error in queued task for file ${fileId}:`, err)
+    })
+    this.fileQueueMap.set(fileId, next)
+    await next
+  }


⚠️ Potential issue | 🟠 Major

enqueueFileTask currently swallows errors and can leak queue entries

The per-file queue idea is solid, but the current implementation has two important side effects:

Errors are swallowed for all callers.
next = last.then(task).catch(...) logs and then resolves, so any await this.enqueueFileTask(...) (e.g., in addFile, deleteFile, reAddFile, onFileFinish, etc.) will never see a rejection, even if an underlying vectorP.* call fails. This changes behavior versus the previous direct await this.vectorP.* and can leave callers thinking an operation succeeded when it did not.

fileQueueMap entries are never cleared.
The map only ever grows; once a file (or any ID key) has queued work, its resolved Promise stays in fileQueueMap indefinitely. Over many files/chunks this becomes a leak.

A simple pattern that keeps per-file sequencing, lets callers observe errors, and still prevents a failed task from blocking subsequent ones is:

private async enqueueFileTask(fileId: string, task: () => Promise<void>): Promise<void> { const prev = this.fileQueueMap.get(fileId) ?? Promise.resolve(); const next = prev.then(task); const wrapped = next .catch((err) => { console.error(`[RAG] Error in queued task for file ${fileId}:`, err); }) .finally(() => { if (this.fileQueueMap.get(fileId) === wrapped) { this.fileQueueMap.delete(fileId); } }); this.fileQueueMap.set(fileId, wrapped); return next; // callers still see rejection if task fails }

This preserves serialization, avoids unbounded map growth, and restores proper error propagation to callers.

🤖 Prompt for AI Agents

In src/main/presenter/knowledgePresenter/knowledgeStorePresenter.ts around lines 27 to 37, the current enqueueFileTask swallows errors and never clears fileQueueMap entries; change it to get the previous promise with Promise.resolve(), set next = prev.then(task) (so callers receive the real rejection), create a wrapped promise that catches/logs errors and in finally deletes the map entry only if the stored value equals the wrapped promise, store wrapped in fileQueueMap, and return next (not wrapped) so callers observe errors while later tasks still run and the map does not leak.

src/main/presenter/knowledgePresenter/knowledgeStorePresenter.ts

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/main/presenter/knowledgePresenter/knowledgeTaskPresenter.ts (1)
132-169: Replace DOMException check with signal.aborted and use structured logger instead of console

The code at lines 132-169 uses error instanceof DOMException to detect abort errors, which is fragile in Node/Electron environments and may throw a ReferenceError if DOMException is undefined. Additionally, it violates coding guidelines by using console.log and console.error instead of structured logging.

Changes required:
Line 155: Replace DOMException instanceof check with controller.signal.aborted to detect abort errors reliably:
const isAbortError =
  controller.signal.aborted || (error as any)?.name === 'AbortError'
if (isAbortError) {
Lines 156, 158, 163: Replace all console calls with structured logger. Import logger at the top:
import logger from '../../../shared/logger'
Then use:
logger.info('[RAG TASK] Task %s aborted during execution.', task.id)
logger.error('[RAG TASK] Task %s failed with error:', task.id, { error })
logger.info('[RAG TASK] Task %s finished.', task.id)

🧹 Nitpick comments (3)

src/main/presenter/knowledgePresenter/knowledgeTaskPresenter.ts (3)
1-20: Align class documentation with new concurrent execution model

The class-level comment still says “sequential execution”, but with runningTasks and maxConcurrency you now allow up to maxConcurrency tasks to run in parallel. It would be clearer to update the doc to something like “bounded concurrent execution (default 16 tasks)” so callers understand the behavior and the meaning of maxConcurrency.

You might also consider mentioning the concurrency behavior in any higher-level presenter docs that reference this class, so the global scheduler’s semantics are explicit.

29-47: Cancellation now covers running tasks; consider semantics for queued task termination callbacks

Extending removeTasks to iterate runningTasks and call terminateTask for matching entries is a good improvement and ensures running tasks are actually aborted and removed.

One open design point: for tasks that are still in the queue (never started), you currently remove them and abort their controller, but you never invoke onTerminate. If callers expect onTerminate to fire whenever a task is canceled (regardless of whether it had started), you may want to invoke it when filtering queued tasks:
this.queue = this.queue.filter((task) => {
  if (filter(task)) {
    console.log(`[RAG TASK] Removing queued task: ${task.id}`)
    this.terminateTask(task.id)
+   task.onTerminate?.()
    return false
  }
  return true
})
If onTerminate is intended only for tasks that had actually begun execution, the current behavior is fine; just worth double‑checking the contract.

119-130: Destroy behavior thoroughly cleans up queue, controllers, and runningTasks

destroy() now:

Cancels all queued and running tasks via removeTasks(() => true), which drives terminateTask and aborts controllers.

Resets this.queue to [].

Aborts any remaining controllers and clears both controllers and runningTasks.

This provides a robust cleanup point for the presenter. The extra controllers.forEach((c) => c.abort()) after removeTasks is slightly redundant in the current implementation (since terminateTask already aborts), but it’s harmless and defensive if new code ever adds controllers without going through removeTasks.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 834b441 and 8719a51.

📒 Files selected for processing (2)

src/main/presenter/knowledgePresenter/knowledgeStorePresenter.ts (12 hunks)
src/main/presenter/knowledgePresenter/knowledgeTaskPresenter.ts (7 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/main/presenter/knowledgePresenter/knowledgeStorePresenter.ts

🧰 Additional context used

📓 Path-based instructions (13)

**/*.{ts,tsx,js,jsx,vue}