Skip to content

Reindex job reliability improvements#5331

Merged
jestradaMS merged 5 commits intomainfrom
users/jestrada/reindexjobupdates-01142026
Jan 14, 2026
Merged

Reindex job reliability improvements#5331
jestradaMS merged 5 commits intomainfrom
users/jestrada/reindexjobupdates-01142026

Conversation

@jestradaMS
Copy link
Contributor

@jestradaMS jestradaMS commented Jan 14, 2026

Description

This pull request improves the reliability and resilience of the FHIR reindexing jobs by introducing smarter retry logic for transient database errors and by making the resource fetching process more robust against out-of-memory (OOM) errors. The changes include new retry policies for Cosmos DB rate limiting, batch size adjustments in response to OOM exceptions, and a refactored approach to processing resource batches in both SQL Server and Cosmos DB scenarios.

Resilience and Retry Logic Improvements:

  • Added a retry policy for Cosmos DB 429 (TooManyRequests) errors, using the RetryAfter hint from exceptions, and combined it with the existing SQL timeout retry policy for search parameter status updates. All updates to search parameter statuses now use this combined policy to handle both SQL and Cosmos DB transient errors. [1] [2] [3] [4]

Out-of-Memory Handling and Batch Size Management:

  • Introduced a fallback batch size (FallbackBatchSizeOnOOM) and an _effectiveBatchSize field in ReindexProcessingJob to dynamically reduce the number of resources fetched per batch if an OutOfMemoryException is encountered. The batch size is now adjustable at runtime to avoid memory pressure.
  • Updated the resource fetching logic to use the current effective batch size and to propagate OutOfMemoryException so that the caller can handle it by reducing the batch size and retrying. [1] [2] [3]

Refactored Batch Processing Logic:

  • Refactored ProcessQueryAsync to determine whether to use surrogate ID batching (SQL Server) or continuation token pagination (Cosmos DB), and to delegate to specialized methods for each scenario. This approach improves memory safety and ensures that large datasets are handled efficiently without risking OOM errors.

Related issues

Addresses
AB#180785
AB#180786.

Testing

Describe how this change was tested.

FHIR Team Checklist

  • Update the title of the PR to be succinct and less than 65 characters
  • Add a milestone to the PR for the sprint that it is merged (i.e. add S47)
  • Tag the PR with the type of update: Bug, Build, Dependencies, Enhancement, New-Feature or Documentation
  • Tag the PR with Open source, Azure API for FHIR (CosmosDB or common code) or Azure Healthcare APIs (SQL or common code) to specify where this change is intended to be released.
  • Tag the PR with Schema Version backward compatible or Schema Version backward incompatible or Schema Version unchanged if this adds or updates Sql script which is/is not backward compatible with the code.
  • When changing or adding behavior, if your code modifies the system design or changes design assumptions, please create and include an ADR.
  • CI is green before merge Build Status
  • Review squash-merge requirements

Semver Change (docs)

Patch|Skip|Feature|Breaking (reason)

- Add internal MemorySafeBatchSize (2000) to prevent OutOfMemoryException when processing large reindex batches with large FHIR resources
- Split processing into SQL Server path (surrogate ID batching) and Cosmos DB path (continuation tokens)
- SQL Server path fetches resources in smaller memory-safe chunks by advancing StartSurrogateId after each batch
- Customer-configured MaximumNumberOfResourcesPerQuery still controls total job size
- Convert BulkUpdateSearchParameterIndicesAsync version conflict from error to warning (SqlServerFhirDataStore)
- Add unit test for memory-safe batch processing
…e SQL timeouts and Cosmos DB request rate limits
@jestradaMS jestradaMS added this to the CY25Q3/2Wk13 milestone Jan 14, 2026
@jestradaMS jestradaMS requested a review from a team as a code owner January 14, 2026 15:33
@jestradaMS jestradaMS added Bug Bug bug bug. Bug-Reliability Reliability related bugs. Azure API for FHIR Label denotes that the issue or PR is relevant to the Azure API for FHIR Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs No-PaaS-breaking-change labels Jan 14, 2026
…Cosmos DB request rate limits; log version conflicts as errors without failing the job
{
// Version conflicts can occur when resources are updated during reindex.
// Log warning and continue - conflicting resources will be picked up in the next reindex cycle.
_logger.LogWarning(ex, "Version conflict during reindex batch update. Some resources were modified during reindex and will be reprocessed in a subsequent cycle.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit? Will we consider these resource as successfully reindexed? My guess is not, and they won't have the hash updated.

Copy link
Contributor Author

@jestradaMS jestradaMS Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we would consider this successful. At this point, the service already has full knowledge of the current search parameter state. When a record is updated by another operation, it is processed using the latest effective search parameter configuration (for example, if a new search parameter has been added, the corresponding updates to the SearchParameter table are applied automatically, making an explicit reindex unnecessary).

Therefore, by updating only records where History = 0, we can reasonably assume that we are updating only what is required. Any records that are excluded can be assumed to have already been handled appropriately under the latest search parameter state.

/// Retry policy for Cosmos DB 429 (TooManyRequests) errors.
/// Uses the RetryAfter hint from Cosmos DB if available, otherwise waits 1-5 seconds.
/// </summary>
private static readonly AsyncPolicy _requestRateRetries = Policy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏👏👏

But also - why only retry 3 times? I think a background job could retry many, many times and still be okay. For reindex, the job could persist through a period of heavy traffic.

Also could be worth seeing if we can use the background retry policy from this class:

https://github.com/microsoft/fhir-server/blob/main/src/Microsoft.Health.Fhir.CosmosDb/Features/Storage/RetryExceptionPolicyFactory.cs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considered moving it there as well, but wanted to keep it simple and isolated to what is needed in Reindex for now.

Regarding retry count, chose 3 just to remain consistent and as a happy medium to failing fast

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given your knowledge of customer experience, would it make sense to put these in a configuration class that can be overriden by env variable in a production environment?

@jestradaMS jestradaMS enabled auto-merge (squash) January 14, 2026 20:21
@jestradaMS jestradaMS merged commit c8b4412 into main Jan 14, 2026
60 of 62 checks passed
@jestradaMS jestradaMS deleted the users/jestrada/reindexjobupdates-01142026 branch January 14, 2026 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Azure API for FHIR Label denotes that the issue or PR is relevant to the Azure API for FHIR Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs Bug Bug bug bug. Bug-Reliability Reliability related bugs. No-PaaS-breaking-change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants