Properly return PENDING status for docker image batch jobs/fine tune jobs#318
Properly return PENDING status for docker image batch jobs/fine tune jobs#318seanshi-scale merged 11 commits intomainfrom
Conversation
| logger.exception("Got an exception when trying to list the Jobs") | ||
| raise EndpointResourceInfraException from exc | ||
|
|
||
| core_client = get_kubernetes_core_client() |
There was a problem hiding this comment.
apparently this fn is used in the list docker image batch jobs api call for some reason, feels wrong to me, idk how we didn't catch this before, but oh well
There was a problem hiding this comment.
Could you like to the code that is the issue? Is the problem that we are breaking abstraction layers?
There was a problem hiding this comment.
model_engine_server/domain/use_cases/batch_job_use_cases.py:ListDockerImageBatchJobV1UseCase.execute
This feels like we're breaking abstraction layers to me at least (e.g. batch jobs and cron jobs should be different), although I guess that broken abstraction gets propagated through the API as well (since the ListBatchJobs has a trigger_id parameter)
yixu34
left a comment
There was a problem hiding this comment.
Can we add some unit tests for these gateways? We could mock out the k8s layer, since we're at the end of the line anyway.
| return BatchJobStatus.RUNNING # empirically this doesn't happen | ||
| if status.active is not None and status.active > 0: | ||
| return BatchJobStatus.RUNNING # TODO this might be a mix of pending and running | ||
| for pod in pods: |
There was a problem hiding this comment.
Might be worth leaving a comment here on why a single Job resource could have multiple pods due to clarify the logic here.
| logger.exception("Got an exception when trying to list the Jobs") | ||
| raise EndpointResourceInfraException from exc | ||
|
|
||
| core_client = get_kubernetes_core_client() |
There was a problem hiding this comment.
Could you like to the code that is the issue? Is the problem that we are breaking abstraction layers?
Before, jobs would immediately start as RUNNING even if no pods were allocated, or if pods were PENDING.
testing:
started gateway on devbox, made requests to get/list docker image batch jobs on our clusters, saw that a RUNNING job had changed to PENDING correctly.
also unit tests