fix celery worker profile for s3 access by saiatmakuri · Pull Request #333 · scaleapi/llm-engine

saiatmakuri · 2023-10-19T21:28:52Z

Follow-up to #327

Pull Request Summary

The celery task queue is instantiated with aws_profile provided in this line. Tasks were completed but the final state was not written to s3 and silently failed since the profile used was not the one with s3 write permissions. The real issue is only with the Celery forwarder that is created with the wrong profile. Change only for its instantiation.

Testing Plan

Create a new deployment in our clusters and launched a batch job. The pods were built correctly and the job successfully completed.

seanshi-scale · 2023-10-19T21:45:21Z

model-engine/model_engine_server/core/celery/app.py

        backend_url = "s3://"
        if aws_role is None:
-            aws_profile = os.getenv("AWS_PROFILE", infra_config().profile_ml_worker)
+            aws_profile = os.getenv("S3_WRITE_AWS_PROFILE", infra_config().profile_ml_worker)


do we have S3_WRITE_AWS_PROFILE inside the celery forwarder? just want to make sure that this works for (gateway, endpoint builder, celery forwarder, batch job orchestration pod)

I guess this s3_write_aws_profile means a few things, e.g. in the gateway I think it has to do with some s3 repository for storing llm fine tune jobs, in celery workers it means where completed tasks get written, I think this is fine though since it's all related to s3

guess the perms for this are actually "read, write, maybe list" as opposed to just "write" as well, which probably is fine

discussed offline. reverting back to profile_ml_worker

seanshi-scale

could you test getting async task results from the gateway as well? I think if we know that works then that + (endpoint builds successfully, batch job completes successfully with all the task results) should be sufficient for testing. LGTM once that's done

This reverts commit fe24d63.

This reverts commit f2c253f.

…" (#346) This reverts commit f2c253f.

change profile

dd5bb56

saiatmakuri added the bug Something isn't working label Oct 19, 2023

saiatmakuri requested review from seanshi-scale and yixu34 October 19, 2023 21:28

saiatmakuri self-assigned this Oct 19, 2023

seanshi-scale reviewed Oct 19, 2023

View reviewed changes

seanshi-scale approved these changes Oct 19, 2023

View reviewed changes

saiatmakuri and others added 2 commits October 19, 2023 22:26

fix profile settings

6adb108

Merge branch 'main' into saiatmakuri/fix-celery-app-profile-pt2

4bf96c0

saiatmakuri merged commit fe24d63 into main Oct 19, 2023

saiatmakuri deleted the saiatmakuri/fix-celery-app-profile-pt2 branch October 19, 2023 22:59

yixu34 added a commit that referenced this pull request Oct 23, 2023

Revert "fix celery worker profile for s3 access (#333)"

09aeabf

This reverts commit fe24d63.

yixu34 mentioned this pull request Oct 23, 2023

Revert "fix celery worker profile for s3 access" #345

Merged

yixu34 added a commit that referenced this pull request Oct 23, 2023

Revert "fix celery worker profile for s3 access (#333)" (#345)

f2c253f

This reverts commit fe24d63.

saiatmakuri added a commit that referenced this pull request Oct 23, 2023

Revert "Revert "fix celery worker profile for s3 access (#333)" (#345)"

1f221fe

This reverts commit f2c253f.

saiatmakuri added a commit that referenced this pull request Oct 23, 2023

Revert "Revert "fix celery worker profile for s3 access (#333)" (#345)…

c7dae60

…" (#346) This reverts commit f2c253f.

yunfeng-scale mentioned this pull request Nov 1, 2023

Integrate TensorRT-LLM #358

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix celery worker profile for s3 access#333

fix celery worker profile for s3 access#333
saiatmakuri merged 3 commits intomainfrom
saiatmakuri/fix-celery-app-profile-pt2

saiatmakuri commented Oct 19, 2023 •

edited

Loading

Uh oh!

seanshi-scale Oct 19, 2023

Uh oh!

seanshi-scale Oct 19, 2023 •

edited

Loading

Uh oh!

saiatmakuri Oct 19, 2023

Uh oh!

seanshi-scale left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saiatmakuri commented Oct 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Summary

Testing Plan

Uh oh!

seanshi-scale Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

seanshi-scale Oct 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saiatmakuri Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

seanshi-scale left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

saiatmakuri commented Oct 19, 2023 •

edited

Loading

seanshi-scale Oct 19, 2023 •

edited

Loading