fix celery worker profile for s3 access#333
Conversation
| backend_url = "s3://" | ||
| if aws_role is None: | ||
| aws_profile = os.getenv("AWS_PROFILE", infra_config().profile_ml_worker) | ||
| aws_profile = os.getenv("S3_WRITE_AWS_PROFILE", infra_config().profile_ml_worker) |
There was a problem hiding this comment.
do we have S3_WRITE_AWS_PROFILE inside the celery forwarder? just want to make sure that this works for (gateway, endpoint builder, celery forwarder, batch job orchestration pod)
There was a problem hiding this comment.
I guess this s3_write_aws_profile means a few things, e.g. in the gateway I think it has to do with some s3 repository for storing llm fine tune jobs, in celery workers it means where completed tasks get written, I think this is fine though since it's all related to s3
guess the perms for this are actually "read, write, maybe list" as opposed to just "write" as well, which probably is fine
There was a problem hiding this comment.
discussed offline. reverting back to profile_ml_worker
seanshi-scale
left a comment
There was a problem hiding this comment.
could you test getting async task results from the gateway as well? I think if we know that works then that + (endpoint builds successfully, batch job completes successfully with all the task results) should be sufficient for testing. LGTM once that's done
Follow-up to #327
Pull Request Summary
The celery task queue is instantiated with
aws_profileprovided in this line. Tasks were completed but the final state was not written to s3 and silently failed since the profile used was not the one with s3 write permissions. The real issue is only with the Celery forwarder that is created with the wrong profile. Change only for its instantiation.Testing Plan
Create a new deployment in our clusters and launched a batch job. The pods were built correctly and the job successfully completed.