Try to fix async requests getting stuck#466
Conversation
| # backoff_policy, etc., then we can expose broker_transport_options in the top-level celery() wrapper function. | ||
| # Going to try this with defaults first. | ||
| out_broker_transport_options["region"] = os.environ.get("AWS_REGION", "us-west-2") | ||
| out_broker_transport_options["wait_time_seconds"] = 0 |
There was a problem hiding this comment.
just curious: can you explain how this works?
There was a problem hiding this comment.
+1, would be good to explain what's going on with these changes, and/or any testing results showing the impact of the changes.
There was a problem hiding this comment.
Made this change because I found celery/celery#7283
It seems like the Celery default has wait_time_seconds set to 10, which I think means that when the SQS queue is empty, requests from Celery workers to SQS wait for up to 10 seconds so that if a new SQS message arrives in this time, it can send that back to the Celery worker (as opposed to just returning an empty response). It kind of makes sense that if we change this param to 0, it'll prevent requests to SQS from returning a message (and making that message invisible) even after the Celery worker making the request has died?
I made a test deployment with this change; yesterday, I still sometimes saw stuck requests when I'd manually kill the endpoint pod, but I was trying to reproduce that today and didn't see any stuck requests after trying many times, so maybe yesterday was just a fluke 😕
| # Going to try this with defaults first. | ||
| out_broker_transport_options["region"] = os.environ.get("AWS_REGION", "us-west-2") | ||
| out_broker_transport_options["wait_time_seconds"] = 0 | ||
| out_broker_transport_options["polling_interval"] = 5 |
There was a problem hiding this comment.
also does this add any additional overhead for async requests because we're polling more often? would the volume of async GET requests scale proportionally to this
There was a problem hiding this comment.
I think this controls how often the celery workers poll SQS. It shouldn't affect any quantity of inbound requests to the gateway.
seanshi-scale
left a comment
There was a problem hiding this comment.
Could you also put the explanation for wait_time_seconds=0 in the code as well? Think it's something that's quite nontrivial and would make it a lot less confusing in the future.
Pull Request Summary
Change SQS broker options for Celery; decrease wait time for SQS long polling
Not strictly necessary, but also changed
echo_serverfor testingTest Plan and Usage Guide
Test deployment