The way GetPreExistingChunksIdsAsync is currently implemented:
|
int insertedCount; |
|
do |
|
{ |
|
insertedCount = 0; |
|
|
|
await foreach (var record in _vectorStoreCollection!.GetAsync( |
|
filter: record => (string)record[DocumentIdName]! == document.Identifier, |
|
top: MaxTopCount, |
|
cancellationToken: cancellationToken).ConfigureAwait(false)) |
|
{ |
|
keys.Add(record[KeyName]!); |
|
insertedCount++; |
|
} |
|
} |
|
while (insertedCount == MaxTopCount); |
We have two bugs:
- in case there are more matching records than
MaxTopCount, we are going to keep fetching and adding the same records over and over. We need to provide the Skip parameter:
await foreach (var record in _vectorStoreCollection!.GetAsync(
filter: record => (string)record[DocumentIdName]! == document.Identifier,
top: MaxTopCount,
options: new() { Skip = insertedCount }, // THE FIX
cancellationToken: cancellationToken).ConfigureAwait(false))
- the loop counter should not be reset at every iteration, otherwise we might end up getting an endless loop
We need a test that reproduces the problem. If preparing MaxTopCount-many records is too complex, we can use #if !RELEASE and set the value to low number in order to make testing easier in Debug builds (our CI always runs them)
const int MaxTopCount =
#if RELEASE
1_000;
#else
10; // Use smaller batch size in debug to be able to test the looping logic without needing to insert a lot of records.
#endif
The way
GetPreExistingChunksIdsAsyncis currently implemented:extensions/src/Libraries/Microsoft.Extensions.DataIngestion/Writers/VectorStoreWriter.cs
Lines 175 to 189 in 4d68b44
We have two bugs:
MaxTopCount, we are going to keep fetching and adding the same records over and over. We need to provide theSkipparameter:We need a test that reproduces the problem. If preparing MaxTopCount-many records is too complex, we can use
#if !RELEASEand set the value to low number in order to make testing easier in Debug builds (our CI always runs them)