Skip to content

fix celery worker profile for s3 access#333

Merged
saiatmakuri merged 3 commits intomainfrom
saiatmakuri/fix-celery-app-profile-pt2
Oct 19, 2023
Merged

fix celery worker profile for s3 access#333
saiatmakuri merged 3 commits intomainfrom
saiatmakuri/fix-celery-app-profile-pt2

Conversation

@saiatmakuri
Copy link
Copy Markdown
Contributor

@saiatmakuri saiatmakuri commented Oct 19, 2023

Follow-up to #327

Pull Request Summary

The celery task queue is instantiated with aws_profile provided in this line. Tasks were completed but the final state was not written to s3 and silently failed since the profile used was not the one with s3 write permissions. The real issue is only with the Celery forwarder that is created with the wrong profile. Change only for its instantiation.

Testing Plan

Create a new deployment in our clusters and launched a batch job. The pods were built correctly and the job successfully completed.

@saiatmakuri saiatmakuri added the bug Something isn't working label Oct 19, 2023
@saiatmakuri saiatmakuri self-assigned this Oct 19, 2023
backend_url = "s3://"
if aws_role is None:
aws_profile = os.getenv("AWS_PROFILE", infra_config().profile_ml_worker)
aws_profile = os.getenv("S3_WRITE_AWS_PROFILE", infra_config().profile_ml_worker)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have S3_WRITE_AWS_PROFILE inside the celery forwarder? just want to make sure that this works for (gateway, endpoint builder, celery forwarder, batch job orchestration pod)

Copy link
Copy Markdown
Contributor

@seanshi-scale seanshi-scale Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this s3_write_aws_profile means a few things, e.g. in the gateway I think it has to do with some s3 repository for storing llm fine tune jobs, in celery workers it means where completed tasks get written, I think this is fine though since it's all related to s3

guess the perms for this are actually "read, write, maybe list" as opposed to just "write" as well, which probably is fine

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline. reverting back to profile_ml_worker

Copy link
Copy Markdown
Contributor

@seanshi-scale seanshi-scale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you test getting async task results from the gateway as well? I think if we know that works then that + (endpoint builds successfully, batch job completes successfully with all the task results) should be sufficient for testing. LGTM once that's done

@saiatmakuri saiatmakuri merged commit fe24d63 into main Oct 19, 2023
@saiatmakuri saiatmakuri deleted the saiatmakuri/fix-celery-app-profile-pt2 branch October 19, 2023 22:59
yixu34 added a commit that referenced this pull request Oct 23, 2023
yixu34 added a commit that referenced this pull request Oct 23, 2023
saiatmakuri added a commit that referenced this pull request Oct 23, 2023
saiatmakuri added a commit that referenced this pull request Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants