Additional retries on DatabricksError: Workspace X exceeded the concurrent limit of Y requests#391
Additional retries on DatabricksError: Workspace X exceeded the concurrent limit of Y requests#391
DatabricksError: Workspace X exceeded the concurrent limit of Y requests#391Conversation
…urrent limit of Y requests` Platform doesn't seem to send HTTP 429 correctly with this response, otherwise the error would have been `TimeoutError`. This PR adds retries for error responses with this message. Fixes databrickslabs/ucx#401 Signed-off-by: Serge Smertin <[email protected]>
Codecov ReportAll modified lines are covered by tests ✅
📢 Thoughts on this report? Let us know!. |
DatabricksError: Workspace X exceeded the concurrent limit of Y requests (SEV1)DatabricksError: Workspace X exceeded the concurrent limit of Y requests
Signed-off-by: Serge Smertin <[email protected]>
pietern
left a comment
There was a problem hiding this comment.
This does return a 429.
I analyzed the code in the client and the following happens:
_performcalls_make_nicer_error_make_nicer_errorcalls_parse_retry_afterif the status code is 429_parse_retry_afterreturnsNoneif it doesn't have aRetry-Afterheader
So, likely this response does have the right status, but doesn't include the Retry-After header and then goes through the _is_retryable path.
|
@pietern yes, with but now I want it to go through https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/retries.py#L40-L41 All of transient errors eventually have to go away - https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/core.py#L1096-L1102, but there'll always be cases when something is not mapped out correctly. It's a fix for our SEV1 issue - e.g. databrickslabs/ucx#401 happens sometimes 6 hours into running a workflow, which could be frustrating. |
|
Why not retry on all 429s? I get the other bits, but for this issue, if we simply retry on all 429s, we fix this particular error, as well as the others that already produce the right status code but don't include the |
@pietern we can, sure, preferably after #376 is merged. it'll be straightforward to add without #376 it's going to be more hairy than adding a string match 🤷 and it's a fix for our SEV1. |
## Changes Change HTTP client to always retry on 429 and 503, regardless of whether Retry-After header is present. If absent, default to 1 second, as we do today. Subsumes #391 and #396 while we're still discussing error architecture in the SDK. ## Tests Unit test to ensure that 429 and 503 work with and without Retry-After header.
|
#402 ensures that the SDK will retry when the Retry-After header is missing, and I'm filing ES tickets internally with teams whose APIs are not compliant by responding with other status codes when rate limits are being reached. Would that be sufficient? I'd prefer not to add more edge cases into the SDK if possible. |
Merge queue setting changed
|
Thanks for this contribution. We're closing this PR as part of regular housekeeping since it has been inactive for over a year. This is not a reflection on the quality of the work -- if the changes are still relevant, feel free to re-open the PR or open a new one and we'll be happy to review it. |
Changes
Platform doesn't seem to send
Retry-Afterheader correctly with this response, otherwise, the error would have beenTimeoutError. This PR adds retries for error responses with this message.Fixes databrickslabs/ucx#401
Tests
make testrun locallymake fmtapplied