Skip to content

Properly return PENDING status for docker image batch jobs/fine tune jobs#318

Merged
seanshi-scale merged 11 commits intomainfrom
seanshi/dibjb-pending-actually
Oct 12, 2023
Merged

Properly return PENDING status for docker image batch jobs/fine tune jobs#318
seanshi-scale merged 11 commits intomainfrom
seanshi/dibjb-pending-actually

Conversation

@seanshi-scale
Copy link
Copy Markdown
Contributor

@seanshi-scale seanshi-scale commented Oct 11, 2023

Before, jobs would immediately start as RUNNING even if no pods were allocated, or if pods were PENDING.

testing:
started gateway on devbox, made requests to get/list docker image batch jobs on our clusters, saw that a RUNNING job had changed to PENDING correctly.
also unit tests

@seanshi-scale seanshi-scale self-assigned this Oct 11, 2023
@seanshi-scale seanshi-scale marked this pull request as ready for review October 11, 2023 23:58
logger.exception("Got an exception when trying to list the Jobs")
raise EndpointResourceInfraException from exc

core_client = get_kubernetes_core_client()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently this fn is used in the list docker image batch jobs api call for some reason, feels wrong to me, idk how we didn't catch this before, but oh well

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you like to the code that is the issue? Is the problem that we are breaking abstraction layers?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_engine_server/domain/use_cases/batch_job_use_cases.py:ListDockerImageBatchJobV1UseCase.execute

This feels like we're breaking abstraction layers to me at least (e.g. batch jobs and cron jobs should be different), although I guess that broken abstraction gets propagated through the API as well (since the ListBatchJobs has a trigger_id parameter)

Copy link
Copy Markdown
Member

@yixu34 yixu34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some unit tests for these gateways? We could mock out the k8s layer, since we're at the end of the line anyway.

return BatchJobStatus.RUNNING # empirically this doesn't happen
if status.active is not None and status.active > 0:
return BatchJobStatus.RUNNING # TODO this might be a mix of pending and running
for pod in pods:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth leaving a comment here on why a single Job resource could have multiple pods due to clarify the logic here.

logger.exception("Got an exception when trying to list the Jobs")
raise EndpointResourceInfraException from exc

core_client = get_kubernetes_core_client()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you like to the code that is the issue? Is the problem that we are breaking abstraction layers?

@seanshi-scale seanshi-scale merged commit d30a1a5 into main Oct 12, 2023
@seanshi-scale seanshi-scale deleted the seanshi/dibjb-pending-actually branch October 12, 2023 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants