Skip to content

Fix reindex scalability#5324

Merged
jestradaMS merged 7 commits intomainfrom
users/jestrada/reindex-fixes-20250112
Jan 13, 2026
Merged

Fix reindex scalability#5324
jestradaMS merged 7 commits intomainfrom
users/jestrada/reindex-fixes-20250112

Conversation

@jestradaMS
Copy link
Contributor

@jestradaMS jestradaMS commented Jan 12, 2026

Description

This pull request introduces several improvements to the reindexing job orchestration and SQL Server search logic to enhance scalability and reliability, especially for large datasets. The key changes include batching surrogate ID range queries to prevent timeouts, streaming job creation for better pipelining, and simplifying SQL logic for surrogate ID queries. Additionally, new configuration options are added to control batching behavior.

Reindexing Scalability and Pipelining Improvements:

  • The reindex orchestrator now fetches surrogate ID ranges in batches (controlled by the new NumberOfParallelRecordRanges setting in ReindexJobConfiguration) and streams job creation, allowing workers to begin processing before all ranges are fetched. This reduces timeouts and speeds up large reindex operations. [1] [2] [3]

  • Refactored job creation logic into a new CreateAndEnqueueJobDefinitionsAsync method to support immediate, batched job enqueuing and cleaner code organization. [1] [2]

  • Improved logging for job creation and enqueuing, providing more granular information about batches and resource types. [1] [2]

SQL Server Search Query Simplification and Logging:

  • Consolidated the logic for querying surrogate IDs with and without searchParamHash into a single method, removing the redundant SearchForReindexSurrogateIdsWithoutSearchParamHashAsync and simplifying SQL command construction. [1] [2] [3] [4]

  • Added detailed logging to SQL Server search operations to aid in debugging and monitoring batch sizes and query parameters.

Related issues

Addresses AB#180432.

Testing

Describe how this change was tested.

FHIR Team Checklist

  • Update the title of the PR to be succinct and less than 65 characters
  • Add a milestone to the PR for the sprint that it is merged (i.e. add S47)
  • Tag the PR with the type of update: Bug, Build, Dependencies, Enhancement, New-Feature or Documentation
  • Tag the PR with Open source, Azure API for FHIR (CosmosDB or common code) or Azure Healthcare APIs (SQL or common code) to specify where this change is intended to be released.
  • Tag the PR with Schema Version backward compatible or Schema Version backward incompatible or Schema Version unchanged if this adds or updates Sql script which is/is not backward compatible with the code.
  • When changing or adding behavior, if your code modifies the system design or changes design assumptions, please create and include an ADR.
  • CI is green before merge Build Status
  • Review squash-merge requirements

Semver Change (docs)

Patch|Skip|Feature|Breaking (reason)

- Consolidate SearchForReindexSurrogateIdsBySearchParamHashAsync to handle
  both with and without searchParamHash cases using batched queries
- Remove unbatched SearchForReindexSurrogateIdsWithoutSearchParamHashAsync
  which caused full table scans on large Resource tables (154M+ rows)
- Add NumberOfParallelRecordRanges config property (default 100) to
  ReindexJobConfiguration for controlling batch size
- Refactor ReindexOrchestratorJob.EnqueueQueryProcessingJobsAsync to fetch
  surrogate ID ranges in batches, following the Export job pattern
- Stream and enqueue job definitions immediately after each batch to allow
  workers to start processing sooner
- Add cancellation checks between batches for graceful shutdown
@jestradaMS jestradaMS added this to the CY25Q3/2Wk13 milestone Jan 12, 2026
@jestradaMS jestradaMS requested a review from a team as a code owner January 12, 2026 18:38
@jestradaMS jestradaMS added the Bug Bug bug bug. label Jan 12, 2026
@jestradaMS jestradaMS added Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs No-PaaS-breaking-change labels Jan 12, 2026
do
{
// Check for cancellation between batches
if (cancellationToken.IsCancellationRequested || _jobInfo.CancelRequested)

Check warning

Code scanning / CodeQL

Constant condition

Condition is always false because of [access to property CancelRequested](1).

Copilot Autofix

AI about 2 months ago

General fix approach: Remove the constant &&/|| component from the condition so that only dynamic, meaningful checks remain. Here, we retain cancellationToken.IsCancellationRequested and remove _jobInfo.CancelRequested from the two conditions that CodeQL has identified as involving a constant false value.

Concrete best fix for this file:

  • At line 416, simplify:
    • From if (cancellationToken.IsCancellationRequested || _jobInfo.CancelRequested)
    • To if (cancellationToken.IsCancellationRequested)
  • At line 499, simplify:
    • From if (cancellationToken.IsCancellationRequested || _jobInfo.CancelRequested)
    • To if (cancellationToken.IsCancellationRequested)

This keeps the standard cancellation token behavior intact and matches what the analyzer already believes is the effective logic (since _jobInfo.CancelRequested is always false in this context). No new imports, methods, or definitions are required; we are only simplifying existing conditions within EnqueueQueryProcessingJobsAsync.

Suggested changeset 1
src/Microsoft.Health.Fhir.Core/Features/Operations/Reindex/ReindexOrchestratorJob.cs

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/src/Microsoft.Health.Fhir.Core/Features/Operations/Reindex/ReindexOrchestratorJob.cs b/src/Microsoft.Health.Fhir.Core/Features/Operations/Reindex/ReindexOrchestratorJob.cs
--- a/src/Microsoft.Health.Fhir.Core/Features/Operations/Reindex/ReindexOrchestratorJob.cs
+++ b/src/Microsoft.Health.Fhir.Core/Features/Operations/Reindex/ReindexOrchestratorJob.cs
@@ -413,7 +413,7 @@
 
         private async Task<IReadOnlyList<long>> EnqueueQueryProcessingJobsAsync(IList<JobInfo> existingProcessingJobs, CancellationToken cancellationToken)
         {
-            if (cancellationToken.IsCancellationRequested || _jobInfo.CancelRequested)
+            if (cancellationToken.IsCancellationRequested)
             {
                 throw new OperationCanceledException("Reindex operation cancelled by customer.");
             }
@@ -496,7 +496,7 @@
                         do
                         {
                             // Check for cancellation between batches
-                            if (cancellationToken.IsCancellationRequested || _jobInfo.CancelRequested)
+                            if (cancellationToken.IsCancellationRequested)
                             {
                                 throw new OperationCanceledException("Reindex operation cancelled by customer.");
                             }
EOF
@@ -413,7 +413,7 @@

private async Task<IReadOnlyList<long>> EnqueueQueryProcessingJobsAsync(IList<JobInfo> existingProcessingJobs, CancellationToken cancellationToken)
{
if (cancellationToken.IsCancellationRequested || _jobInfo.CancelRequested)
if (cancellationToken.IsCancellationRequested)
{
throw new OperationCanceledException("Reindex operation cancelled by customer.");
}
@@ -496,7 +496,7 @@
do
{
// Check for cancellation between batches
if (cancellationToken.IsCancellationRequested || _jobInfo.CancelRequested)
if (cancellationToken.IsCancellationRequested)
{
throw new OperationCanceledException("Reindex operation cancelled by customer.");
}
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
…eIdRanges pattern

- Add SetupGetSurrogateIdRangesMock helper method that uses callback to check startId
- Mocks now return empty list when startId > rangeEnd, simulating batched behavior
- Fixes infinite loop/timeout in tests caused by do-while (ranges.Any()) loop
- Converts all 12 static .Returns() mocks to callback-based returns
- All 23 ReindexOrchestratorJobTests now pass
…s to ensure all resource types are accounted for
@jestradaMS jestradaMS enabled auto-merge (squash) January 12, 2026 22:20
@jestradaMS
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jestradaMS jestradaMS merged commit 9365e39 into main Jan 13, 2026
60 of 61 checks passed
@jestradaMS jestradaMS deleted the users/jestrada/reindex-fixes-20250112 branch January 13, 2026 00:43
mikaelweave pushed a commit that referenced this pull request Jan 13, 2026
* Fix SQL execution timeout in reindex operations

* Enhance reindex job creation logic to support resuming from existing jobs and prevent duplicates

* fix: Update ReindexOrchestratorJobTests mocks for batched GetSurrogateIdRanges pattern

* fix: Extend timeout for processing jobs in ReindexOrchestratorJobTests to ensure all resource types are accounted for

* fix: Simplify reindex job creation logic by removing existing job checks and enhancing logging

* fix: Rename NumberOfParallelRecordRanges to NumberOfRecordRanges for clarity in reindex configuration per Sergey's feedback

* fix: Remove redundant job cancellation check in EnqueueQueryProcessingJobsAsync method
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs Bug Bug bug bug. No-PaaS-breaking-change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants