-
Notifications
You must be signed in to change notification settings - Fork 247
[BUG] RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists. #2698
Copy link
Copy link
Closed
Project-MONAI/MONAI
#7916Labels
bugSomething isn't workingSomething isn't working
Description
2024-07-14 10:02:17,649 - INFO - Load site-1 weights...
2024-07-14 10:02:17,652 - INFO - Converted 148 global variables to match 148 local variables.
2024-07-14 10:02:17,654 - INFO - 'dst' model updated: 148 of 148 variables.
2024-07-14 10:02:17,661 - Communicator - INFO - Received from secure_project server. getTask: train size: 19.3MB (19280090 Bytes) time: 0.301346 seconds
2024-07-14 10:02:17,661 - FederatedClient - INFO - pull_task completed. Task name:train Status:True
2024-07-14 10:02:17,661 - ClientRunner - INFO - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781]: got task assignment: name=train, id=00b8bb4c-1fbd-421d-81b3-19472481fd48
2024-07-14 10:02:17,661 - ClientRunner - INFO - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, task_name=train, task_id=00b8bb4c-1fbd-421d-81b3-19472481fd48]: invoking task executor ClientAlgoExecutor
2024-07-14 10:02:17,662 - INFO - Start site-1 evaluating...
2024-07-14 10:02:17,662 - ClientAlgoExecutor - INFO - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, task_name=train, task_id=00b8bb4c-1fbd-421d-81b3-19472481fd48]: Client trainer got task: train
2024-07-14 10:02:17,662 - ignite.engine.engine.SupervisedEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
2024-07-14 10:02:17,662 - INFO - Load site-2 weights...
2024-07-14 10:02:17,664 - INFO - Converted 148 global variables to match 148 local variables.
2024-07-14 10:02:17,665 - INFO - 'dst' model updated: 148 of 148 variables.
2024-07-14 10:02:17,672 - INFO - Start site-2 evaluating...
2024-07-14 10:02:17,672 - ignite.engine.engine.SupervisedEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
2024-07-14 10:02:17,743 - ignite.engine.engine.SupervisedEvaluator - ERROR - Engine run is terminating due to exception: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
2024-07-14 10:02:17,744 - ERROR - Exception: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 946, in _internal_run_as_gen
self._fire_event(Events.STARTED)
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 425, in _fire_event
func(*first, *(event_args + others), **kwargs)
File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 209, in start
self._set_experiment()
File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 241, in _set_experiment
experiment_id = self.client.create_experiment(self.experiment_name)
File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/client.py", line 599, in create_experiment
return self._tracking_client.create_experiment(name, artifact_location, tags)
File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/_tracking_service/client.py", line 251, in create_experiment
return self.store.create_experiment(
File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 101, in create_experiment
response_proto = self._call_endpoint(CreateExperiment, req_body)
File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 60, in _call_endpoint
return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 290, in call_endpoint
response = verify_rest_response(response, endpoint)
File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 173, in verify_rest_response
raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
2024-07-14 10:02:17,745 - ClientAlgoExecutor - ERROR - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, task_name=train, task_id=00b8bb4c-1fbd-421d-81b3-19472481fd48]: client_algo execute exception: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
2024-07-14 10:02:17,745 - ClientAlgoExecutor - ERROR - Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/monai_nvflare/client_algo_executor.py", line 114, in execute
return self.train(shareable, fl_ctx, abort_signal)
File "/usr/local/lib/python3.10/dist-packages/monai_nvflare/client_algo_executor.py", line 132, in train
test_report = self.client_algo.evaluate(exchangeobj_from_shareable(shareable))
File "/usr/local/lib/python3.10/dist-packages/monai/fl/client/monai_algo.py", line 664, in evaluate
self.evaluator.run(self.trainer.state.epoch + 1)
File "/usr/local/lib/python3.10/dist-packages/monai/engines/evaluator.py", line 150, in run
super().run()
File "/usr/local/lib/python3.10/dist-packages/monai/engines/workflow.py", line 283, in run
super().run(data=self.data_loader, max_epochs=self.state.max_epochs)
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 892, in run
return self._internal_run()
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 935, in _internal_run
return next(self._internal_run_generator)
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 993, in _internal_run_as_gen
self._handle_exception(e)
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 636, in _handle_exception
self._fire_event(Events.EXCEPTION_RAISED, e)
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 425, in _fire_event
func(*first, *(event_args + others), **kwargs)
File "/usr/local/lib/python3.10/dist-packages/monai/handlers/stats_handler.py", line 202, in exception_raised
raise e
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 946, in _internal_run_as_gen
self._fire_event(Events.STARTED)
File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 425, in _fire_event
func(*first, *(event_args + others), **kwargs)
File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 209, in start
self._set_experiment()
File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 241, in _set_experiment
experiment_id = self.client.create_experiment(self.experiment_name)
File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/client.py", line 599, in create_experiment
return self._tracking_client.create_experiment(name, artifact_location, tags)
File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/_tracking_service/client.py", line 251, in create_experiment
return self.store.create_experiment(
File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 101, in create_experiment
response_proto = self._call_endpoint(CreateExperiment, req_body)
File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 60, in _call_endpoint
return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 290, in call_endpoint
response = verify_rest_response(response, endpoint)
File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 173, in verify_rest_response
raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
Looks like there has some issue when running the monai real word example.
When site-2 start evaluating, the monai_nvflare experiment is already exist. Should handle such case.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working