Reindex job reliability improvements#5331
Conversation
- Add internal MemorySafeBatchSize (2000) to prevent OutOfMemoryException when processing large reindex batches with large FHIR resources - Split processing into SQL Server path (surrogate ID batching) and Cosmos DB path (continuation tokens) - SQL Server path fetches resources in smaller memory-safe chunks by advancing StartSurrogateId after each batch - Customer-configured MaximumNumberOfResourcesPerQuery still controls total job size - Convert BulkUpdateSearchParameterIndicesAsync version conflict from error to warning (SqlServerFhirDataStore) - Add unit test for memory-safe batch processing
…nd retry on OutOfMemoryException
…e SQL timeouts and Cosmos DB request rate limits
…Cosmos DB request rate limits; log version conflicts as errors without failing the job
…and Cosmos DB 429 errors
| { | ||
| // Version conflicts can occur when resources are updated during reindex. | ||
| // Log warning and continue - conflicting resources will be picked up in the next reindex cycle. | ||
| _logger.LogWarning(ex, "Version conflict during reindex batch update. Some resources were modified during reindex and will be reprocessed in a subsequent cycle."); |
There was a problem hiding this comment.
Can you elaborate a bit? Will we consider these resource as successfully reindexed? My guess is not, and they won't have the hash updated.
There was a problem hiding this comment.
Yes, we would consider this successful. At this point, the service already has full knowledge of the current search parameter state. When a record is updated by another operation, it is processed using the latest effective search parameter configuration (for example, if a new search parameter has been added, the corresponding updates to the SearchParameter table are applied automatically, making an explicit reindex unnecessary).
Therefore, by updating only records where History = 0, we can reasonably assume that we are updating only what is required. Any records that are excluded can be assumed to have already been handled appropriately under the latest search parameter state.
| /// Retry policy for Cosmos DB 429 (TooManyRequests) errors. | ||
| /// Uses the RetryAfter hint from Cosmos DB if available, otherwise waits 1-5 seconds. | ||
| /// </summary> | ||
| private static readonly AsyncPolicy _requestRateRetries = Policy |
There was a problem hiding this comment.
👏👏👏
But also - why only retry 3 times? I think a background job could retry many, many times and still be okay. For reindex, the job could persist through a period of heavy traffic.
Also could be worth seeing if we can use the background retry policy from this class:
There was a problem hiding this comment.
considered moving it there as well, but wanted to keep it simple and isolated to what is needed in Reindex for now.
Regarding retry count, chose 3 just to remain consistent and as a happy medium to failing fast
There was a problem hiding this comment.
Given your knowledge of customer experience, would it make sense to put these in a configuration class that can be overriden by env variable in a production environment?
Description
This pull request improves the reliability and resilience of the FHIR reindexing jobs by introducing smarter retry logic for transient database errors and by making the resource fetching process more robust against out-of-memory (OOM) errors. The changes include new retry policies for Cosmos DB rate limiting, batch size adjustments in response to OOM exceptions, and a refactored approach to processing resource batches in both SQL Server and Cosmos DB scenarios.
Resilience and Retry Logic Improvements:
RetryAfterhint from exceptions, and combined it with the existing SQL timeout retry policy for search parameter status updates. All updates to search parameter statuses now use this combined policy to handle both SQL and Cosmos DB transient errors. [1] [2] [3] [4]Out-of-Memory Handling and Batch Size Management:
FallbackBatchSizeOnOOM) and an_effectiveBatchSizefield inReindexProcessingJobto dynamically reduce the number of resources fetched per batch if anOutOfMemoryExceptionis encountered. The batch size is now adjustable at runtime to avoid memory pressure.OutOfMemoryExceptionso that the caller can handle it by reducing the batch size and retrying. [1] [2] [3]Refactored Batch Processing Logic:
ProcessQueryAsyncto determine whether to use surrogate ID batching (SQL Server) or continuation token pagination (Cosmos DB), and to delegate to specialized methods for each scenario. This approach improves memory safety and ensures that large datasets are handled efficiently without risking OOM errors.Related issues
Addresses
AB#180785
AB#180786.
Testing
Describe how this change was tested.
FHIR Team Checklist
Semver Change (docs)
Patch|Skip|Feature|Breaking (reason)