Retry policies don't sleep after operations time out#18548
Retry policies don't sleep after operations time out#18548chlowell merged 3 commits intoAzure:masterfrom
Conversation
| # the authentication policy failed such that the client's request can't | ||
| # succeed--we'll never have a response to it, so propagate the exception | ||
| raise | ||
| except (ServiceRequestTimeoutError, ServiceResponseTimeoutError): |
There was a problem hiding this comment.
Given this scenario, timeout setting is 10 min and default timeout is 5 min.
We should fail on the second TimeoutError but not the first one, right?
There was a problem hiding this comment.
I don't understand what you mean by a first vs. second error. _configure_timeout raises one of these errors when absolute_timeout <= 0. I believe we never want to retry when that's true.
There was a problem hiding this comment.
I meant (ServiceRequestTimeoutError, ServiceResponseTimeoutError) might be raised from transport. In this case, absolute_timeout may still >0 and we do want to retry.
There was a problem hiding this comment.
Nothing else in the library raises these errors today but it's a fair point that a transport or policy could do so. I've removed this except block in favor of checking absolute_timeout in the broader handler.
| raise | ||
| except AzureError as err: | ||
| if self._is_method_retryable(retry_settings, request.http_request): | ||
| if absolute_timeout > 0 and self._is_method_retryable(retry_settings, request.http_request): |
There was a problem hiding this comment.
If absolute_timeout <= 0, we already raised in https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/core/azure-core/azure/core/pipeline/policies/_retry.py#L295?
Not true?
There was a problem hiding this comment.
True, but the problem is that we handle that error here by sleeping and iterating again.
There was a problem hiding this comment.
I might mis-understand, why absolute_timeout can be <= 0 here?
Do you want to move
end_time = time.time()
if absolute_timeout:
absolute_timeout -= (end_time - start_time)here?
There was a problem hiding this comment.
This code is tricky to refactor without breaking something. For now we need to decrement absolute_timeout in the finally block so it's done after every iteration, including ones that end with the continue in the try block. I'm picking low hanging fruit in this PR. I think fixing everything (e.g. #18552) will require more thought and a larger change.
As for how absolute_timeout can be non-positive here, note that nothing breaks the loop when it is. _configure_timeout raises an exception when absolute_timeout <= 0 but this except block handles that exception because it's an AzureError. If the request is retryable, the except block sleeps and we go back to the try block, where _configure_timeout raises again. Repeat until retries are exhausted.
xiangyan99
left a comment
There was a problem hiding this comment.
LGTM.
Please update changelog. :)
Review request for Microsoft.ContainerService to add version 2022-03-02-preview (Azure#18633) * Adds base for updating Microsoft.ContainerService from version preview/2022-02-02-preview to version 2022-03-02-preview * Updates readme * Updates API version in new specs and examples * feat: add `/rotateServiceAccountSigningKeys` API (Azure#18359) * feat: add `/rotateServiceAccountSigningKeys` API * chore: remove duplicated description * feat: add `workloadIdentity` settings to `SecurityProfile` (Azure#18360) * fix: update example name (Azure#18382) * add creationData to mc data (Azure#18414) * add creationData to mc data * fix test * fix format * Add API properties and example JSON for Web App Routing of IngressProfile (Azure#18564) * Add API properties and example JSON for Web App Routing of IngressProfile. * Add a ending period for description to match the style in all other "descriptions" in the same file. * Update readmes for the 2022-03-02-preview dev branch of container service (Azure#18358) * update readme * update sdk readmes * Agentpool alias minor version 2022-03-02-preview (Azure#18381) * Add field currentOrchestratorVersion to support Agentpool alias minor version * Add new exmaple for alias minor version * Remove example for PrettierCheck * Fix typo * Latest patch version supported is 1.22.7 at the moment * Address comments - refine descriptions for fields * feat: add ManagedCluster StorageProfile in 0302preview (Azure#18590) Signed-off-by: Ji An Liu <[email protected]> * Swagger change for ignore-pod-disruption-budget (Azure#18548) * Swagger change for ignore-pod-disruption-budget * Change ignorePodDisruptionBudget to string in example file. * Change ignorePodDisruptionBudget to boolean type. * add effectiveNoProxy for AKS (Azure#18544) * add effectiveNoProxy for AKS * add to correct api * fix lowercase -> uppercase O * spelling * Replace common type definitions with references since 2022-03-02-preview for Microsoft.ContainerService (Azure#18567) * replace Resource * replace SystemData * replace parameters * change track2 to python Co-authored-by: hbc <[email protected]> Co-authored-by: Qingqing <[email protected]> Co-authored-by: Yi Zhang <[email protected]> Co-authored-by: Thalia Wang <[email protected]> Co-authored-by: Ji'an Liu <[email protected]> Co-authored-by: Tong Chen <[email protected]> Co-authored-by: Ace Eldeib <[email protected]>
After an operation times out,
_configure_timeoutraises a subclass of AzureError. Async/RetryPolicy handles AzureError by sleeping and continuing the retry_active loop. The policy won't actually send another request, because_configure_timeoutwill raise at the start of every loop iteration, but it will pointlessly sleep after each iteration until it exhausts its allowed retries.This PR prevents this pointless sleeping by re-raising
_configure_timeoutexceptions. This is safe today because no other policy raises those exceptions. For example, transports raise ServiceRequest/ResponseError when connections time out. However, in principle something downstream from RetryPolicy could raise e.g. ServiceResponseTimeoutError given a request that should be retried; with this change, the request would not be retried. I don't expect that scenario, but I point out the possibility so you can disagree with me as you like 😸