Skip to content

[MEDI] Bugs in GetPreExistingChunksIdsAsync #7310

@adamsitnik

Description

@adamsitnik

The way GetPreExistingChunksIdsAsync is currently implemented:

int insertedCount;
do
{
insertedCount = 0;
await foreach (var record in _vectorStoreCollection!.GetAsync(
filter: record => (string)record[DocumentIdName]! == document.Identifier,
top: MaxTopCount,
cancellationToken: cancellationToken).ConfigureAwait(false))
{
keys.Add(record[KeyName]!);
insertedCount++;
}
}
while (insertedCount == MaxTopCount);

We have two bugs:

  • in case there are more matching records than MaxTopCount, we are going to keep fetching and adding the same records over and over. We need to provide the Skip parameter:
await foreach (var record in _vectorStoreCollection!.GetAsync(
    filter: record => (string)record[DocumentIdName]! == document.Identifier,
    top: MaxTopCount,
    options: new() { Skip = insertedCount }, // THE FIX
    cancellationToken: cancellationToken).ConfigureAwait(false))
  • the loop counter should not be reset at every iteration, otherwise we might end up getting an endless loop

We need a test that reproduces the problem. If preparing MaxTopCount-many records is too complex, we can use #if !RELEASE and set the value to low number in order to make testing easier in Debug builds (our CI always runs them)

        const int MaxTopCount =
#if RELEASE
            1_000;
#else
            10; // Use smaller batch size in debug to be able to test the looping logic without needing to insert a lot of records.
#endif

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions