Skip to content

Try to fix async requests getting stuck#466

Merged
squeakymouse merged 4 commits intomainfrom
katiewu/fix-async-endpoints-stuck
Mar 11, 2024
Merged

Try to fix async requests getting stuck#466
squeakymouse merged 4 commits intomainfrom
katiewu/fix-async-endpoints-stuck

Conversation

@squeakymouse
Copy link
Copy Markdown
Contributor

Pull Request Summary

Change SQS broker options for Celery; decrease wait time for SQS long polling

Not strictly necessary, but also changed echo_server for testing

Test Plan and Usage Guide

Test deployment

@squeakymouse squeakymouse requested a review from a team March 9, 2024 01:03
# backoff_policy, etc., then we can expose broker_transport_options in the top-level celery() wrapper function.
# Going to try this with defaults first.
out_broker_transport_options["region"] = os.environ.get("AWS_REGION", "us-west-2")
out_broker_transport_options["wait_time_seconds"] = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious: can you explain how this works?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, would be good to explain what's going on with these changes, and/or any testing results showing the impact of the changes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change because I found celery/celery#7283

It seems like the Celery default has wait_time_seconds set to 10, which I think means that when the SQS queue is empty, requests from Celery workers to SQS wait for up to 10 seconds so that if a new SQS message arrives in this time, it can send that back to the Celery worker (as opposed to just returning an empty response). It kind of makes sense that if we change this param to 0, it'll prevent requests to SQS from returning a message (and making that message invisible) even after the Celery worker making the request has died?

I made a test deployment with this change; yesterday, I still sometimes saw stuck requests when I'd manually kill the endpoint pod, but I was trying to reproduce that today and didn't see any stuck requests after trying many times, so maybe yesterday was just a fluke 😕

# Going to try this with defaults first.
out_broker_transport_options["region"] = os.environ.get("AWS_REGION", "us-west-2")
out_broker_transport_options["wait_time_seconds"] = 0
out_broker_transport_options["polling_interval"] = 5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also does this add any additional overhead for async requests because we're polling more often? would the volume of async GET requests scale proportionally to this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this controls how often the celery workers poll SQS. It shouldn't affect any quantity of inbound requests to the gateway.

Copy link
Copy Markdown
Contributor

@seanshi-scale seanshi-scale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also put the explanation for wait_time_seconds=0 in the code as well? Think it's something that's quite nontrivial and would make it a lot less confusing in the future.

@squeakymouse squeakymouse merged commit 4b012f0 into main Mar 11, 2024
@squeakymouse squeakymouse deleted the katiewu/fix-async-endpoints-stuck branch March 11, 2024 21:21
This was referenced Mar 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants