Skip to content

[BUG] RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists. #2698

@KumoLiu

Description

@KumoLiu
2024-07-14 10:02:17,649 - INFO - Load site-1 weights...
2024-07-14 10:02:17,652 - INFO - Converted 148 global variables to match 148 local variables.
2024-07-14 10:02:17,654 - INFO - 'dst' model updated: 148 of 148 variables.
2024-07-14 10:02:17,661 - Communicator - INFO - Received from secure_project server. getTask: train size: 19.3MB (19280090 Bytes) time: 0.301346 seconds
2024-07-14 10:02:17,661 - FederatedClient - INFO - pull_task completed. Task name:train Status:True 
2024-07-14 10:02:17,661 - ClientRunner - INFO - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781]: got task assignment: name=train, id=00b8bb4c-1fbd-421d-81b3-19472481fd48
2024-07-14 10:02:17,661 - ClientRunner - INFO - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, task_name=train, task_id=00b8bb4c-1fbd-421d-81b3-19472481fd48]: invoking task executor ClientAlgoExecutor
2024-07-14 10:02:17,662 - INFO - Start site-1 evaluating...
2024-07-14 10:02:17,662 - ClientAlgoExecutor - INFO - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, task_name=train, task_id=00b8bb4c-1fbd-421d-81b3-19472481fd48]: Client trainer got task: train
2024-07-14 10:02:17,662 - ignite.engine.engine.SupervisedEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
2024-07-14 10:02:17,662 - INFO - Load site-2 weights...
2024-07-14 10:02:17,664 - INFO - Converted 148 global variables to match 148 local variables.
2024-07-14 10:02:17,665 - INFO - 'dst' model updated: 148 of 148 variables.
2024-07-14 10:02:17,672 - INFO - Start site-2 evaluating...
2024-07-14 10:02:17,672 - ignite.engine.engine.SupervisedEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
2024-07-14 10:02:17,743 - ignite.engine.engine.SupervisedEvaluator - ERROR - Engine run is terminating due to exception: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
2024-07-14 10:02:17,744 - ERROR - Exception: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 946, in _internal_run_as_gen
    self._fire_event(Events.STARTED)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 425, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 209, in start
    self._set_experiment()
  File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 241, in _set_experiment
    experiment_id = self.client.create_experiment(self.experiment_name)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/client.py", line 599, in create_experiment
    return self._tracking_client.create_experiment(name, artifact_location, tags)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/_tracking_service/client.py", line 251, in create_experiment
    return self.store.create_experiment(
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 101, in create_experiment
    response_proto = self._call_endpoint(CreateExperiment, req_body)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 60, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 290, in call_endpoint
    response = verify_rest_response(response, endpoint)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 173, in verify_rest_response
    raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
2024-07-14 10:02:17,745 - ClientAlgoExecutor - ERROR - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, task_name=train, task_id=00b8bb4c-1fbd-421d-81b3-19472481fd48]: client_algo execute exception: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
2024-07-14 10:02:17,745 - ClientAlgoExecutor - ERROR - Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/monai_nvflare/client_algo_executor.py", line 114, in execute
    return self.train(shareable, fl_ctx, abort_signal)
  File "/usr/local/lib/python3.10/dist-packages/monai_nvflare/client_algo_executor.py", line 132, in train
    test_report = self.client_algo.evaluate(exchangeobj_from_shareable(shareable))
  File "/usr/local/lib/python3.10/dist-packages/monai/fl/client/monai_algo.py", line 664, in evaluate
    self.evaluator.run(self.trainer.state.epoch + 1)
  File "/usr/local/lib/python3.10/dist-packages/monai/engines/evaluator.py", line 150, in run
    super().run()
  File "/usr/local/lib/python3.10/dist-packages/monai/engines/workflow.py", line 283, in run
    super().run(data=self.data_loader, max_epochs=self.state.max_epochs)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 892, in run
    return self._internal_run()
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 935, in _internal_run
    return next(self._internal_run_generator)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 993, in _internal_run_as_gen
    self._handle_exception(e)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 636, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 425, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/monai/handlers/stats_handler.py", line 202, in exception_raised
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 946, in _internal_run_as_gen
    self._fire_event(Events.STARTED)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 425, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 209, in start
    self._set_experiment()
  File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 241, in _set_experiment
    experiment_id = self.client.create_experiment(self.experiment_name)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/client.py", line 599, in create_experiment
    return self._tracking_client.create_experiment(name, artifact_location, tags)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/_tracking_service/client.py", line 251, in create_experiment
    return self.store.create_experiment(
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 101, in create_experiment
    response_proto = self._call_endpoint(CreateExperiment, req_body)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 60, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 290, in call_endpoint
    response = verify_rest_response(response, endpoint)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 173, in verify_rest_response
    raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.

Looks like there has some issue when running the monai real word example.
When site-2 start evaluating, the monai_nvflare experiment is already exist. Should handle such case.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions