AIRFLOW-3149 Support dataproc cluster deletion on ERROR #4064

dossett · 2018-10-17T16:58:52Z

Sometimes a dataproc cluster creation results in a
cluster in a state of ERROR, which makes it unsuable.
Subsequent Airflow retries will fail because a cluster
already exists. This change adds the option to delete
an ERROR cluster on creation so that subsequent attempts
might succeed. There are also some other small cleanups.

Make sure you have checked all steps below.

Jira

My PR addresses the following Airflow Jira issues and references them in the PR title.

Description

See commit message above

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

My change does not include tests, I did not see any integration tests in the code base that this could fit into.

Commits

My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added.

Code Quality

Passes flake8

fenglu-db · 2018-10-18T05:58:13Z

Instead of adding the delete_on_error into the dataproc cluster create operator, curious why you can't have a downstream delete cluster operator with a trigger_rule = upstream failed? Would rather keep the operator logic atomic and simple.

dossett · 2018-10-18T13:10:36Z

Hi @fenglu-g thanks for your comment. My goal wasn't just to make sure the ERROR cluster gets deleted but to give the cluster creation a chance to succeed with a retry.

The behavior we have observed is this:

Sometimes a cluster create fails and the cluster exists in an ERROR state
the cluster create operator retries based on our DAG configuration
the retries fail because a cluster with the same name already exists in the ERROR state
after the the retries are exhausted the DAG proceeds with that step as failed

After applying this patch internally we observe:

Sometimes a cluster create fails and the cluster exists in an ERROR state
We immediately delete the cluster within the create operator
the cluster create operator retries based on our DAG configuration
the cluster creation succeeds because whatever led to the initial cluster creation ERROR was a transient problem

It has greatly increased the reliability and stability of our GCP DAGS.

fenglu-db · 2018-10-31T21:15:07Z

Sorry for the late reply.

Thank you @dossett for the detailed explanation. The root cause seems to be that DataprocClusterCreateOperator is not idempotent. Similar to what you have described, how about we re-factor the operator based on the following logic?

check existence of dataproc cluster
if yes and in "running" state, no-op.
else, delete the cluster, and attempt create again.

So we don't have to export this additional delete_cluster_on_error?

dossett · 2018-11-10T01:28:05Z

Hi @fenglu-g thank you for the follow-up! What about taking both approaches? Deleting a cluster first if it already exists, creating the cluster, and if the creation fails and results in an ERROR state deleting it before exiting. That could be the default behavior without the need to add the new parameter (delete_cluster_on_error). Having the operator clean up after itself and not leave errant resources behind seems like a nice property.

stale · 2018-12-25T01:59:19Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

kaxil · 2018-12-31T12:41:09Z

@dossett Deleting the existing cluster without prior warnings, wouldn't be a good option. I would rather go the appraoch @fenglu-g mentioned.

dossett · 2019-01-04T20:47:37Z

Thank you @kaxil and @fenglu-g for the feedback. I will work on this when I'm back in the office next week.

mik-laj · 2019-01-13T20:29:31Z

airflow/contrib/operators/dataproc_operator.py

It's better to put this code in Hook.

Agree with @mik-laj - makes sense to move it to Hooks

Thanks @kaxil and @mik-laj I missed this the first time. I will make this change.

dossett · 2019-01-25T16:37:44Z

@kaxil @fenglu-g I have redone this PR with the new approach per our discussion. Looking forward to additional feedback!

dossett · 2019-01-30T15:18:23Z

Moved common code to the hook and cleaned up some flake8 errors

mik-laj · 2019-01-31T07:38:44Z

airflow/contrib/hooks/gcp_dataproc_hook.py

Suggested change

).execute()

).execute(num_retries=5)

Google API sometimes does not respond correctly. You should send a request again.

Great, thanks!

mik-laj · 2019-02-02T21:50:50Z

airflow/contrib/operators/dataproc_operator.py

Suggested change

self.log.info('Diagnostic information for ERROR cluster available at [' +

output_url .= diagnose_result.get('response').get('outputUri')

self.log.info('Diagnostic information for ERROR cluster available at [%s]', output_url)

You should avoid formatting the text before passing it to the logger. When text and data are passef to the logger separately, special loggers allow to analyze in a deeper way.

Thank you again @mik-laj, that's a great point. Change pushed.

OmerJog · 2019-02-10T12:27:40Z

@dossett Note the flake8 errors:

./airflow/contrib/hooks/gcp_dataproc_hook.py:255:5: E303 too many blank lines (2)
./airflow/contrib/hooks/gcp_dataproc_hook.py:268:5: E303 too many blank lines (2)
./airflow/contrib/operators/dataproc_operator.py:275:5: E303 too many blank lines (2)
./airflow/contrib/operators/dataproc_operator.py:287:5: E303 too many blank lines (2)

dossett · 2019-02-11T17:00:59Z

Thank you @OmerJog, fixed!

dossett · 2019-02-12T22:27:55Z

I seem to have broken the test_cluster_name_log_sub test, so I'll try to figure that out

OmerJog · 2019-03-07T08:15:54Z

@dossett There is still an error:

53) FAIL: test_cluster_name_log_sub (tests.contrib.operators.test_dataproc_operator.DataprocClusterDeleteOperatorTest)
----------------------------------------------------------------------
   Traceback (most recent call last):
    tests/contrib/operators/test_dataproc_operator.py line 511 in test_cluster_name_log_sub
      dataproc_task.execute(None)
   AssertionError: TypeError not raised

dossett · 2019-03-12T17:53:51Z

Yeah @OmerJog I haven't been able to track that down

OmerJog · 2019-04-03T08:49:29Z

@dossett You have some errors:

20) FAIL: test_cluster_name_log_sub (tests.contrib.operators.test_dataproc_operator.DataprocClusterDeleteOperatorTest)
----------------------------------------------------------------------
   Traceback (most recent call last):
    tests/contrib/operators/test_dataproc_operator.py line 511 in test_cluster_name_log_sub
      dataproc_task.execute(None)
   AssertionError: TypeError not raised

dossett · 2019-04-17T17:46:42Z

@OmerJog I went over that test in detail and it seems like the with self.assertRaises(TypeError) was a defensive measure to catch an exception within the test and not that raising a TypeError was part of the test specification. I've deleted that line to see if the rest of the test is successful.

cc @piffall who added that line about 11 months ago

piffall · 2019-04-17T20:19:42Z

@OmerJog I went over that test in detail and it seems like the with self.assertRaises(TypeError) was a defensive measure to catch an exception within the test and not that raising a TypeError was part of the test specification. I've deleted that line to see if the rest of the test is successful.

cc @piffall who added that line about 11 months ago

Hi @dossett, in fact, I didn't add that line, I just keep that test with a minor change.

codecov-io · 2019-04-17T22:30:23Z

Codecov Report

Merging #4064 into master will increase coverage by 0.23%.
The diff coverage is 51.57%.

@@            Coverage Diff             @@
##           master    #4064      +/-   ##
==========================================
+ Coverage   78.76%   78.99%   +0.23%     
==========================================
  Files         481      488       +7     
  Lines       30215    30642     +427     
==========================================
+ Hits        23800    24207     +407     
- Misses       6415     6435      +20

Impacted Files	Coverage Δ
airflow/contrib/operators/dataproc_operator.py	`79.43% <51.28%> (-4.38%)`	⬇️
airflow/contrib/hooks/gcp_dataproc_hook.py	`46.66% <52.94%> (+9.62%)`	⬆️
airflow/contrib/operators/gcs_download_operator.py	`78.12% <0%> (-9.88%)`	⬇️
airflow/api/common/experimental/get_code.py	`76.92% <0%> (-6.42%)`	⬇️
...rflow/api/common/experimental/get_task_instance.py	`84.61% <0%> (-5.39%)`	⬇️
airflow/api/common/experimental/mark_tasks.py	`95.2% <0%> (-1.57%)`	⬇️
airflow/contrib/operators/mlengine_operator.py	`76.4% <0%> (-0.17%)`	⬇️
airflow/kubernetes/pod.py	`100% <0%> (ø)`	⬆️
airflow/contrib/operators/gcs_list_operator.py	`100% <0%> (ø)`	⬆️
airflow/api/common/experimental/get_task.py	`100% <0%> (ø)`	⬆️
... and 53 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2fd7567...eaed844. Read the comment docs.

dossett · 2019-04-18T13:20:27Z

@fenglu-g @kaxil This PR is complete and now passes all tests. We (Etsy) have been running with change locally for a couple of months now and it's been terrific. Clusters in ERROR no longer derail DAGs and automatically retrieving diagnostic information about them has let us diagnose underlying causes with Google support, where before this we couldn't because we had no information.

Thank you @OmerJog and @mik-laj for helpful comments along the way!

fenglu-db

Thank you @dossett, mostly LGTM. Two nits and one test case request.

fenglu-db · 2019-05-07T20:04:22Z

airflow/contrib/hooks/gcp_dataproc_hook.py

nit: s/Exception/AirflowException,

fenglu-db · 2019-05-07T20:18:52Z

airflow/contrib/operators/dataproc_operator.py

just double check, should the regex pattern be [smdh] as well to be consistent with the get_graceful_decommission_timeout?

Possibly? That would require some other substantive changes right below this line to handle the d and h. My change was just to make the regex a little more standardized, but sometimes those can spill over to more substantive changes. I would want to err on the side of making that a separate change.

fenglu-db · 2019-05-07T20:19:59Z

tests/contrib/operators/test_dataproc_operator.py

could you also add a test cast that checks that the error cluster is deleted?

OmerJog · 2019-05-20T05:35:21Z

@dossett Can you address the comments and rebase? @kaxil is picking PRs for Airflow 1.10.4 - it would be nice to have this PR in.

dossett · 2019-05-20T17:12:00Z

Thanks @OmerJog and @fenglu-g ! I've addressed the first two comments, and will rebase and try to add the requested unit test as well. Looking forward to 1.10.4 !

mik-laj · 2019-09-05T13:59:07Z

Hi.

I made a change in the base class - GoogleCloudBaseHook. Your PR may need to be changed. Could you do rebase?

Thanks

Refenence:
#5907

potiuk · 2019-09-06T13:42:42Z

Same here @dossett -> we are just moving from contrib to core and we would love to get this one rebased/merged.

dossett · 2019-09-13T16:40:36Z

Rebase and pushed, awaiting travis

mik-laj · 2019-09-16T10:50:06Z

Pylint is sad.

dossett · 2019-09-16T13:24:44Z

Pushed more changes, I'm learning a lot about pylint 😂

dossett · 2019-09-16T17:28:53Z

@mik-laj pylint is happy now, the kubernetes test suite failed but that seems unrelated to my change?

dossett · 2019-09-16T20:46:36Z

@mik-laj @potiuk All green!

dossett · 2019-09-17T13:18:09Z

Thank you @mik-laj @potiuk @OmerJog @fenglu-g for all the feedback along the way, it's exciting to see this get merged. Looking forward to making more contributions.

mik-laj · 2019-09-17T14:32:50Z

Thank you very much for your cooperation. I am waiting for your next PR. 😸

stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 25, 2018

kaxil removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 31, 2018

mik-laj reviewed Jan 13, 2019

View reviewed changes

dossett force-pushed the AIRFLOW-3149 branch 2 times, most recently from 6750b8c to 8aa742c Compare January 25, 2019 16:31

mik-laj reviewed Jan 31, 2019

View reviewed changes

mik-laj reviewed Feb 2, 2019

View reviewed changes

dossett force-pushed the AIRFLOW-3149 branch from 275855c to c19f2f1 Compare March 29, 2019 19:03

fenglu-db reviewed May 7, 2019

View reviewed changes

dossett force-pushed the AIRFLOW-3149 branch from 7be066f to 5892119 Compare August 17, 2019 11:53

dossett added 8 commits September 13, 2019 10:30

Remove defensive TypeError catch in test

1bdb977

Test fixes

f984939

More test fix

56726ec

Different test mock approach

edf070c

Add requested test

1172356

Fix whitespacing

6ea5bec

Refactor several methods per PR feedback

ed0e24b

rebase

1fc8fa5

dossett force-pushed the AIRFLOW-3149 branch from 5892119 to 1fc8fa5 Compare September 13, 2019 16:40

dossett added 3 commits September 13, 2019 12:15

fix linter and patch paths

35d48c5

More pylin

1c2548e

even more pylint

3c94d0d

Yet more pylint

77fb07a

mik-laj approved these changes Sep 17, 2019

View reviewed changes

mik-laj merged commit 578c57f into apache:master Sep 17, 2019

dossett deleted the AIRFLOW-3149 branch September 17, 2019 13:18

turbaszek mentioned this pull request Sep 24, 2019

[AIRFLOW-5546] Fix DataprocClusterCreateOperator #6178

Closed

5 tasks

turbaszek mentioned this pull request Jun 30, 2020

Improve handling Dataproc cluster creation with ERROR state #9593

Merged

6 tasks

	self.log.info('Diagnostic information for ERROR cluster available at [' +
	output_url .= diagnose_result.get('response').get('outputUri')
	self.log.info('Diagnostic information for ERROR cluster available at [%s]', output_url)

AIRFLOW-3149 Support dataproc cluster deletion on ERROR #4064

AIRFLOW-3149 Support dataproc cluster deletion on ERROR #4064

Uh oh!

Conversation

dossett commented Oct 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Jira

Description

Tests

Commits

Documentation

Code Quality

Uh oh!

fenglu-db commented Oct 18, 2018

Uh oh!

dossett commented Oct 18, 2018

Uh oh!

fenglu-db commented Oct 31, 2018

Uh oh!

dossett commented Nov 10, 2018

Uh oh!

stale bot commented Dec 25, 2018

Uh oh!

kaxil commented Dec 31, 2018

Uh oh!

dossett commented Jan 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dossett commented Jan 25, 2019

Uh oh!

dossett commented Jan 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mik-laj Feb 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OmerJog commented Feb 10, 2019

Uh oh!

dossett commented Feb 11, 2019

Uh oh!

dossett commented Feb 12, 2019

Uh oh!

OmerJog commented Mar 7, 2019

Uh oh!

dossett commented Mar 12, 2019

Uh oh!

OmerJog commented Apr 3, 2019

Uh oh!

dossett commented Apr 17, 2019

Uh oh!

piffall commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dossett commented Apr 18, 2019

Uh oh!

fenglu-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dossett commented Oct 17, 2018 •

edited

Loading

mik-laj Feb 2, 2019 •

edited

Loading

piffall commented Apr 17, 2019 •

edited

Loading

codecov-io commented Apr 17, 2019 •

edited

Loading