Try to fix async requests getting stuck by squeakymouse · Pull Request #466 · scaleapi/llm-engine

squeakymouse · 2024-03-09T01:03:04Z

Pull Request Summary

Change SQS broker options for Celery; decrease wait time for SQS long polling

Not strictly necessary, but also changed echo_server for testing

Test Plan and Usage Guide

Test deployment

ian-scale · 2024-03-09T01:07:27Z

model-engine/model_engine_server/core/celery/app.py

        # backoff_policy, etc., then we can expose broker_transport_options in the top-level celery() wrapper function.
        # Going to try this with defaults first.
        out_broker_transport_options["region"] = os.environ.get("AWS_REGION", "us-west-2")
+        out_broker_transport_options["wait_time_seconds"] = 0


just curious: can you explain how this works?

+1, would be good to explain what's going on with these changes, and/or any testing results showing the impact of the changes.

Made this change because I found celery/celery#7283

It seems like the Celery default has wait_time_seconds set to 10, which I think means that when the SQS queue is empty, requests from Celery workers to SQS wait for up to 10 seconds so that if a new SQS message arrives in this time, it can send that back to the Celery worker (as opposed to just returning an empty response). It kind of makes sense that if we change this param to 0, it'll prevent requests to SQS from returning a message (and making that message invisible) even after the Celery worker making the request has died?

I made a test deployment with this change; yesterday, I still sometimes saw stuck requests when I'd manually kill the endpoint pod, but I was trying to reproduce that today and didn't see any stuck requests after trying many times, so maybe yesterday was just a fluke 😕

ian-scale · 2024-03-09T01:10:21Z

model-engine/model_engine_server/core/celery/app.py

        # Going to try this with defaults first.
        out_broker_transport_options["region"] = os.environ.get("AWS_REGION", "us-west-2")
+        out_broker_transport_options["wait_time_seconds"] = 0
+        out_broker_transport_options["polling_interval"] = 5


also does this add any additional overhead for async requests because we're polling more often? would the volume of async GET requests scale proportionally to this

I think this controls how often the celery workers poll SQS. It shouldn't affect any quantity of inbound requests to the gateway.

seanshi-scale

Could you also put the explanation for wait_time_seconds=0 in the code as well? Think it's something that's quite nontrivial and would make it a lot less confusing in the future.

squeakymouse added 2 commits March 7, 2024 23:32

celery sqs broker options

11c6d35

test echo server runnable image bundle

143615f

squeakymouse requested a review from a team March 9, 2024 01:03

ian-scale reviewed Mar 9, 2024

View reviewed changes

squeakymouse requested a review from seanshi-scale March 9, 2024 01:46

seanshi-scale approved these changes Mar 9, 2024

View reviewed changes

squeakymouse added 2 commits March 11, 2024 20:55

add explanation

4ee6f42

Merge branch 'main' into katiewu/fix-async-endpoints-stuck

0c1be9d

squeakymouse merged commit 4b012f0 into main Mar 11, 2024

squeakymouse deleted the katiewu/fix-async-endpoints-stuck branch March 11, 2024 21:21

This was referenced Mar 30, 2024

Return 400 for botocore client errors #479

Merged

Batch job metrics #480

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to fix async requests getting stuck#466

Try to fix async requests getting stuck#466
squeakymouse merged 4 commits intomainfrom
katiewu/fix-async-endpoints-stuck

squeakymouse commented Mar 9, 2024

Uh oh!

ian-scale Mar 9, 2024

Uh oh!

seanshi-scale Mar 9, 2024

Uh oh!

squeakymouse Mar 9, 2024

Uh oh!

ian-scale Mar 9, 2024

Uh oh!

seanshi-scale Mar 9, 2024

Uh oh!

seanshi-scale left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

squeakymouse commented Mar 9, 2024

Pull Request Summary

Test Plan and Usage Guide

Uh oh!

ian-scale Mar 9, 2024

Choose a reason for hiding this comment

Uh oh!

seanshi-scale Mar 9, 2024

Choose a reason for hiding this comment

Uh oh!

squeakymouse Mar 9, 2024

Choose a reason for hiding this comment

Uh oh!

ian-scale Mar 9, 2024

Choose a reason for hiding this comment

Uh oh!

seanshi-scale Mar 9, 2024

Choose a reason for hiding this comment

Uh oh!

seanshi-scale left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants