Skip to content

Conversation

@vincbeck
Copy link
Contributor

Add retry option in RedshiftDeleteClusterOperator to retry when an operation is running in the cluster. When an operation is running in the cluster, the deletion fails and throw an exception like below. This option allows users to retry in such scenario.

INFO    [0m airflow.task:taskinstance.py:1278 Executing <Task(RedshiftDeleteClusterOperator): delete_cluster> on 2021-01-01 00:00:00+00:00
INFO    [0m airflow.task:taskinstance.py:1487 Exporting the following env vars:
AIRFLOW_CTX_DAG_OWNER=airflow
AIRFLOW_CTX_DAG_ID=example_redshift_to_s3
AIRFLOW_CTX_TASK_ID=delete_cluster
AIRFLOW_CTX_EXECUTION_DATE=2021-01-01T00:00:00+00:00
AIRFLOW_CTX_TRY_NUMBER=1
AIRFLOW_CTX_DAG_RUN_ID=backfill__2021-01-01T00:00:00+00:00
INFO     airflow.hooks.base:base.py:71 Using connection ID 'aws_default' for task execution.
INFO    botocore.credentials:credentials.py:1180 Found credentials in environment variables.
ERROR   airflow.task:taskinstance.py:1746 Task failed with exception
Traceback (most recent call last):
  File "/opt/airflow/airflow/providers/amazon/aws/operators/redshift_cluster.py", line 492, in execute
    final_cluster_snapshot_identifier=self.final_cluster_snapshot_identifier,
  File "/opt/airflow/airflow/providers/amazon/aws/hooks/redshift_cluster.py", line 112, in delete_cluster
    FinalClusterSnapshotIdentifier=final_cluster_snapshot_identifier,
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 495, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 914, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.InvalidClusterStateFault: An error occurred (InvalidClusterState) when calling the DeleteCluster operation: There is an operation running on the Cluster. Please try to delete it at a later time.
INFO     airflow.task:taskinstance.py:1301 Marking task as FAILED. dag_id=example_redshift_to_s3, task_id=delete_cluster, execution_date=20210101T000000, start_date=20221113T120403, end_date=20221113T120403
ERROR   airflow.executors.debug_executor.DebugExecutor:debug_executor.py:85 Failed to execute task: An error occurred (InvalidClusterState) when calling the DeleteCluster operation: There is an operation running on the Cluster. Please try to delete it at a later time..

Comment on lines 505 to 506
retry: bool = False,
retry_attempts: int = 10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just have one tries: int = 1; having two separate arguments for this feels like a potential tripwire.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. The only downside of it is you dont provide a default value in case you want to retry. If you want to retry, you have to come up with a number of retries which can be not very intuitive for users

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some standardization (see #27276 (comment) )
I think it's better to decide on some standardization and then enforce it across. It's less productive when we discuss this per operator (and also may eventually have different names per operator)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is the same use case though. In Syed's PR, the retry logic handles a bug in boto3. Here it handles a valid use case: removing a Redshift cluster that has a running operation. That's why here, in my opinion, we need a flag to whether activate the retry. Some users might want the operator to fail in such scenario, some might want to retry. Regarding the naming, I am happy to rename the parameters to attempts and attempt_interval to make it more consistent across the operators

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something the goal of these parameters from user perspective is to retry without counting this as Airflow retries so I do find this to be the same case (regardless of the reason that this functionality was needed to begin with)
WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct but the way I see it is one is a bug fix (through a hack) and one is a feature

Copy link
Contributor Author

@vincbeck vincbeck Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe you are correct and at the end we should enforce the retry because most likely everyone wants to retry if the deletion fails because there is a running operation in the cluster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be good now @eladkal

@potiuk potiuk merged commit 2ab5c1f into apache:main Nov 26, 2022
@vincbeck vincbeck deleted the vincbeck/redshift_delete branch December 6, 2022 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants