Port partition split recursion fix from hotfix to main#44770
Conversation
Co-authored-by: simorenoh <[email protected]>
There was a problem hiding this comment.
Pull request overview
This PR ports a critical bug fix for infinite recursion during partition split (410 Gone) errors from the hotfix branch to the main branch. The fix prevents a timeout issue caused by recursive 410 retry logic when the SDK attempts to refresh partition key ranges during a partition split.
Changes:
- Added
_internal_pk_range_fetchflag to mark internal partition key range queries and skip 410 retry logic, breaking the recursion cycle - Implemented state reset (
_has_started=False,_continuation=None) before 410 retries to ensure fetch loops execute correctly - Added structured logging for 410 handling with debug/warning/error levels
- Comprehensive unit tests (645 lines) covering state reset, retry logic, recursion prevention, and memory bounds
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| CHANGELOG.md | Added bug fix entry under 4.15.0b3 |
| azure/cosmos/_execution_context/base_execution_context.py | Added recursion prevention, state reset, and logging for 410 retry logic (sync) |
| azure/cosmos/_execution_context/aio/base_execution_context.py | Added recursion prevention, state reset, and logging for 410 retry logic (async) |
| azure/cosmos/_routing/routing_map_provider.py | Propagated _internal_pk_range_fetch flag to PK range queries and added logging (sync) |
| azure/cosmos/_routing/aio/routing_map_provider.py | Propagated _internal_pk_range_fetch flag to PK range queries and added logging (async) |
| azure/cosmos/_synchronized_request.py | Removed internal flag before HTTP layer (sync) |
| azure/cosmos/aio/_asynchronous_request.py | Removed internal flag before HTTP layer (async) |
| tests/test_partition_split_retry_unit.py | Comprehensive unit tests for state reset, retry behavior, and memory bounds (sync) |
| tests/test_partition_split_retry_unit_async.py | Comprehensive unit tests for state reset, retry behavior, and memory bounds (async) |
| tests/routing/test_routing_map_provider.py | Updated test signature to accept **kwargs |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
allenkim0129
left a comment
There was a problem hiding this comment.
LGTM.
Has some minor comments
sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/aio/base_execution_context.py
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py
Show resolved
Hide resolved
…t retry logic Co-authored-by: simorenoh <[email protected]>
|
Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
bambriz
left a comment
There was a problem hiding this comment.
Thanks @simorenoh Looks good to me.
|
/check-enforcer override |
Port Cosmos DB Partition Split Recursion Fix from PR #44649
This task involves porting critical bug fixes from PR #44649 that was merged to the hotfix branch. The fix addresses an infinite recursion issue during partition split (410 error) handling.
Plan Checklist:
Summary
Successfully ported all changes from PR #44649 to the main branch and addressed PR review feedback:
Key Changes:
Files Modified (10):
All files are syntactically correct and ready for testing with Cosmos DB emulator.
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.