Reindex job reliability improvements by jestradaMS · Pull Request #5331 · microsoft/fhir-server

jestradaMS · 2026-01-14T15:33:15Z

Description

This pull request improves the reliability and resilience of the FHIR reindexing jobs by introducing smarter retry logic for transient database errors and by making the resource fetching process more robust against out-of-memory (OOM) errors. The changes include new retry policies for Cosmos DB rate limiting, batch size adjustments in response to OOM exceptions, and a refactored approach to processing resource batches in both SQL Server and Cosmos DB scenarios.

Resilience and Retry Logic Improvements:

Added a retry policy for Cosmos DB 429 (TooManyRequests) errors, using the RetryAfter hint from exceptions, and combined it with the existing SQL timeout retry policy for search parameter status updates. All updates to search parameter statuses now use this combined policy to handle both SQL and Cosmos DB transient errors. [1] [2] [3] [4]

Out-of-Memory Handling and Batch Size Management:

Introduced a fallback batch size (FallbackBatchSizeOnOOM) and an _effectiveBatchSize field in ReindexProcessingJob to dynamically reduce the number of resources fetched per batch if an OutOfMemoryException is encountered. The batch size is now adjustable at runtime to avoid memory pressure.
Updated the resource fetching logic to use the current effective batch size and to propagate OutOfMemoryException so that the caller can handle it by reducing the batch size and retrying. [1] [2] [3]

Refactored Batch Processing Logic:

Refactored ProcessQueryAsync to determine whether to use surrogate ID batching (SQL Server) or continuation token pagination (Cosmos DB), and to delegate to specialized methods for each scenario. This approach improves memory safety and ensures that large datasets are handled efficiently without risking OOM errors.

Related issues

Addresses
AB#180785
AB#180786.

Testing

Describe how this change was tested.

FHIR Team Checklist

Update the title of the PR to be succinct and less than 65 characters
Add a milestone to the PR for the sprint that it is merged (i.e. add S47)
Tag the PR with the type of update: Bug, Build, Dependencies, Enhancement, New-Feature or Documentation
Tag the PR with Open source, Azure API for FHIR (CosmosDB or common code) or Azure Healthcare APIs (SQL or common code) to specify where this change is intended to be released.
Tag the PR with Schema Version backward compatible or Schema Version backward incompatible or Schema Version unchanged if this adds or updates Sql script which is/is not backward compatible with the code.
When changing or adding behavior, if your code modifies the system design or changes design assumptions, please create and include an ADR.
CI is green before merge
Review squash-merge requirements

Semver Change (docs)

Patch|Skip|Feature|Breaking (reason)

- Add internal MemorySafeBatchSize (2000) to prevent OutOfMemoryException when processing large reindex batches with large FHIR resources - Split processing into SQL Server path (surrogate ID batching) and Cosmos DB path (continuation tokens) - SQL Server path fetches resources in smaller memory-safe chunks by advancing StartSurrogateId after each batch - Customer-configured MaximumNumberOfResourcesPerQuery still controls total job size - Convert BulkUpdateSearchParameterIndicesAsync version conflict from error to warning (SqlServerFhirDataStore) - Add unit test for memory-safe batch processing

…nd retry on OutOfMemoryException

…e SQL timeouts and Cosmos DB request rate limits

…Cosmos DB request rate limits; log version conflicts as errors without failing the job

…and Cosmos DB 429 errors

feordin · 2026-01-14T19:07:14Z

src/Microsoft.Health.Fhir.Core/Features/Operations/Reindex/ReindexProcessingJob.cs

+                    {
+                        // Version conflicts can occur when resources are updated during reindex.
+                        // Log warning and continue - conflicting resources will be picked up in the next reindex cycle.
+                        _logger.LogWarning(ex, "Version conflict during reindex batch update. Some resources were modified during reindex and will be reprocessed in a subsequent cycle.");


Can you elaborate a bit? Will we consider these resource as successfully reindexed? My guess is not, and they won't have the hash updated.

Yes, we would consider this successful. At this point, the service already has full knowledge of the current search parameter state. When a record is updated by another operation, it is processed using the latest effective search parameter configuration (for example, if a new search parameter has been added, the corresponding updates to the SearchParameter table are applied automatically, making an explicit reindex unnecessary).

Therefore, by updating only records where History = 0, we can reasonably assume that we are updating only what is required. Any records that are excluded can be assumed to have already been handled appropriately under the latest search parameter state.

fhir-server/src/Microsoft.Health.Fhir.SqlServer/Features/Schema/Sql/Sprocs/UpdateResourceSearchParams.sql

Line 81 in d5b00d3

WHERE B.IsHistory = 0

mikaelweave · 2026-01-14T18:52:11Z

src/Microsoft.Health.Fhir.Core/Features/Operations/Reindex/ReindexOrchestratorJob.cs

+        /// Retry policy for Cosmos DB 429 (TooManyRequests) errors.
+        /// Uses the RetryAfter hint from Cosmos DB if available, otherwise waits 1-5 seconds.
+        /// </summary>
+        private static readonly AsyncPolicy _requestRateRetries = Policy


👏👏👏

But also - why only retry 3 times? I think a background job could retry many, many times and still be okay. For reindex, the job could persist through a period of heavy traffic.

Also could be worth seeing if we can use the background retry policy from this class:

https://github.com/microsoft/fhir-server/blob/main/src/Microsoft.Health.Fhir.CosmosDb/Features/Storage/RetryExceptionPolicyFactory.cs

considered moving it there as well, but wanted to keep it simple and isolated to what is needed in Reindex for now.

Regarding retry count, chose 3 just to remain consistent and as a happy medium to failing fast

Given your knowledge of customer experience, would it make sense to put these in a configuration class that can be overriden by env variable in a production environment?

src/Microsoft.Health.Fhir.Core/Features/Operations/Reindex/ReindexProcessingJob.cs

jestradaMS added 3 commits January 14, 2026 07:53

Implement OOM handling in ReindexProcessingJob to reduce batch size a…

c3433ff

…nd retry on OutOfMemoryException

Implement retry policies for search parameter status updates to handl…

4c8a5d8

…e SQL timeouts and Cosmos DB request rate limits

jestradaMS added this to the CY25Q3/2Wk13 milestone Jan 14, 2026

jestradaMS requested a review from a team as a code owner January 14, 2026 15:33

jestradaMS added 2 commits January 14, 2026 11:55

Implement retry policies for bulk updates to handle SQL timeouts and …

d0a929a

…Cosmos DB request rate limits; log version conflicts as errors without failing the job

Add retry policy for search parameter updates to handle SQL timeouts …

f870b30

…and Cosmos DB 429 errors

feordin reviewed Jan 14, 2026

View reviewed changes

mikaelweave reviewed Jan 14, 2026

View reviewed changes

jestradaMS enabled auto-merge (squash) January 14, 2026 20:21

mikaelweave approved these changes Jan 14, 2026

View reviewed changes

jestradaMS merged commit c8b4412 into main Jan 14, 2026
60 of 62 checks passed

jestradaMS deleted the users/jestrada/reindexjobupdates-01142026 branch January 14, 2026 20:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reindex job reliability improvements#5331

Reindex job reliability improvements#5331
jestradaMS merged 5 commits intomainfrom
users/jestrada/reindexjobupdates-01142026

jestradaMS commented Jan 14, 2026 •

edited

Loading

Uh oh!

feordin Jan 14, 2026

Uh oh!

jestradaMS Jan 14, 2026 •

edited

Loading

Uh oh!

mikaelweave Jan 14, 2026

Uh oh!

jestradaMS Jan 14, 2026

Uh oh!

mikaelweave Jan 14, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jestradaMS commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Testing

FHIR Team Checklist

Semver Change (docs)

Uh oh!

feordin Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

jestradaMS Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikaelweave Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

jestradaMS Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

mikaelweave Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jestradaMS commented Jan 14, 2026 •

edited

Loading

jestradaMS Jan 14, 2026 •

edited

Loading