-
Notifications
You must be signed in to change notification settings - Fork 499
[SDK] Use Katib SDK for E2E Tests #2075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
I've changed tune and train and CMA-ES SDK examples. |
|
@andreyvelich Thanks for improving our E2E. Also, it would be good to modify the E2E test to verify Python SDK operations in various Python versions, as discussed in #2057 (comment). Although, we can follow up on that in another PR. |
| max_trial_count = experiment.spec.max_trial_count + 1 | ||
| parallel_trial_count = experiment.spec.parallel_trial_count + 1 | ||
| print( | ||
| f"Restarting Experiment {exp_namespace}/{exp_name} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this Restarting experiment for random?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge We also test random search Experiment to test LongRunning Experiment.
| verify_experiment_results(katib_client, experiment, exp_name, exp_namespace) | ||
|
|
||
| # Describe the Experiment and Suggestion. | ||
| print(os.popen(f"kubectl describe experiment {exp_name} -n {exp_namespace}").read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add verbose as an optional parameter that can be enabled/disabled to show verbose logs? (Enabled by default)
This will help in tests like conformance tests where we are only interested in final experiment success/failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge I enabled logging for Katib E2Es, what do you think about it ?
| ) | ||
| ) | ||
| try: | ||
| response = utils.FakeResponse(thread.get(constants.APISERVER_TIMEOUT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think It might help to make configurable TIMEOUT. So, can we add duration until timeout to function args?
WDYT?
def get_experiment(
self, name: str, namespace: str = utils.get_default_target_namespace(), duration,
):There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me update it.
| namespace, | ||
| constants.EXPERIMENT_PLURAL, | ||
| name, | ||
| body=client.V1DeleteOptions(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make configurable client.V1DeleteOptions()? It helps users to set grace_period_seconds and more.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we can allow it.
| args = parser.parse_args() | ||
|
|
||
| if args.verbose == "False": | ||
| logging.getLogger().setLevel(logging.WARNING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, set default to Debug and change to INFO when verbose is False?
Also, set start/end logs to INFO and rest all logs to DEBUG. So, if verbose is not set, user will just see logs start and the final experiment status.
| try: | ||
| run_e2e_experiment(katib_client, experiment, exp_name, exp_namespace) | ||
| logging.info("---------------------------------------------------------------") | ||
| logging.info(f"E2E is completed for Experiment: {exp_namespace}/{exp_name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more clarity, keepSucceeded instead of completed
| # Describe the Experiment and Suggestion. | ||
| logging.info( | ||
| os.popen(f"kubectl describe experiment {exp_name} -n {exp_namespace}").read() | ||
| ) | ||
| logging.info("---------------------------------------------------------------") | ||
| logging.info("---------------------------------------------------------------") | ||
| logging.info( | ||
| os.popen(f"kubectl describe suggestion {exp_name} -n {exp_namespace}").read() | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there intentions to use the kubectl command, not Python SDK?
It would be good to use Python SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y I guess, we used kubectl describe for better visibility. Let me just print the Experiment and Suggestion:
logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))
| try: | ||
| run_e2e_experiment(katib_client, experiment, exp_name, exp_namespace) | ||
| logging.info("---------------------------------------------------------------") | ||
| logging.info(f"E2E is completed for Experiment: {exp_namespace}/{exp_name}") | ||
| logging.info("---------------------------------------------------------------") | ||
| logging.info("---------------------------------------------------------------") | ||
| # Delete the Experiment. | ||
| katib_client.delete_experiment(exp_name, exp_namespace) | ||
| except Exception as e: | ||
| logging.info("---------------------------------------------------------------") | ||
| logging.info(f"E2E is failed for Experiment: {exp_namespace}/{exp_name}") | ||
| logging.info("---------------------------------------------------------------") | ||
| logging.info("---------------------------------------------------------------") | ||
| # Delete the Experiment and raise an Exception. | ||
| katib_client.delete_experiment(exp_name, exp_namespace) | ||
| raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we need to delete experiment regardless of pass/fail, I think using finally would be nice
| try: | |
| run_e2e_experiment(katib_client, experiment, exp_name, exp_namespace) | |
| logging.info("---------------------------------------------------------------") | |
| logging.info(f"E2E is completed for Experiment: {exp_namespace}/{exp_name}") | |
| logging.info("---------------------------------------------------------------") | |
| logging.info("---------------------------------------------------------------") | |
| # Delete the Experiment. | |
| katib_client.delete_experiment(exp_name, exp_namespace) | |
| except Exception as e: | |
| logging.info("---------------------------------------------------------------") | |
| logging.info(f"E2E is failed for Experiment: {exp_namespace}/{exp_name}") | |
| logging.info("---------------------------------------------------------------") | |
| logging.info("---------------------------------------------------------------") | |
| # Delete the Experiment and raise an Exception. | |
| katib_client.delete_experiment(exp_name, exp_namespace) | |
| raise e | |
| logging.info("---------------------------------------------------------------") | |
| try: | |
| run_e2e_experiment(katib_client, experiment, exp_name, exp_namespace) | |
| logging.info(f"E2E is completed for Experiment: {exp_namespace}/{exp_name}") | |
| except Exception as e: | |
| logging.info(f"E2E is failed for Experiment: {exp_namespace}/{exp_name}") | |
| raise e | |
| finally: | |
| logging.info("---------------------------------------------------------------") | |
| logging.info("---------------------------------------------------------------") | |
| # Delete the Experiment and raise an Exception. | |
| katib_client.delete_experiment(exp_name, exp_namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Good suggestion.
| parser.add_argument( | ||
| "--verbose", | ||
| type=str, | ||
| default=True, | ||
| choices=("True", "False"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using action=store_true rather than using type=str and choices=("True", "False") for booelan type?
Then we could change the following 238th line from if args.verbose == "False": to if args.verbose:
| PVCs = client.CoreV1Api().list_namespaced_persistent_volume_claim( | ||
| exp_namespace | ||
| ) | ||
| is_deleted = 1 | ||
| for i in PVCs.items: | ||
| if i.metadata.name == resource_name: | ||
| is_deleted = 0 | ||
| if is_deleted == 1: | ||
| raise Exception( | ||
| "PVC is deleted for FromVolume resume policy. " | ||
| f"Alive PVCs: {[i.metadata.name for i in PVCs.items]}." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about change this lines to use read_namespaced_persistent_volume_claim(resource_name, exp_namespace) and check whether this method raise the 404 not found exception or not. And then raise our Exception ?
Since I think i have to read this source code more carefully to understand what this source code wants to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I thought we have only list API 😄
I am not sure why Kubernetes API Python client named get API for Customer Resources and read API for Core Kubernetes Resources.
6d9afd3 to
a84c920
Compare
|
Thanks for your review @tenzen-y @anencore94 @johnugeorge @terrytangyuan. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
tenzen-y
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Looks great! Thanks for your tremendous effort!
/lgtm
|
LGTM Thanks @andreyvelich |
|
Thanks everyone for the review! |
Related: #2024, #2044
I used Katib SDK to run the E2E test.
Also I made the following changes in the SDK:
I used the unify style for our Python APIs and Exceptions, so users can easily understand it. We need to create script to automatically generate docs from our Katib Client. Currently, this doc is outdated: https://github.com/kubeflow/katib/blob/master/sdk/python/v1beta1/docs/KatibClient.md.
Any existing tools that we can use ?
I introduced the following changes in APIs:
get_experiment_status->get_experiment_conditionsto return Experiment conditions.is_experiment_createdthe new API to check Experiment status.is_experiment_runningthe new API to check Experiment status.is_experiment_restartingthe new API to check Experiment status.is_experiment_succeededthe new API to check Experiment status.is_experiment_failedthe new API to check Experiment status.wait_for_experiment_conditionthe new API to wait until Experiment reaches condition. (Similar to Training Operator.)edit_experiment_budgetthe new API to modify Experiment budget in-place.get_suggestionsplit betweenget_suggestionandlist_suggestionssimilar tolist_experiments.get_trialto get Trial CR object.get_optimal_hyperparametersreturnsV1beta1OptimalTrialobject.Please let me know what do you think about API style, can we improve it better ?
It would be great if you could start reviewing this.
Also, if you think that change is too big for upcoming release, we can postpone it.
TODO: I need to update examples that use Katib SDK. Will do it during the week.
/hold
cc @gaocegege @johnugeorge @anencore94 @tenzen-y @kubeflow/wg-training-leads