Skip to content

Conversation

@dossett
Copy link
Contributor

@dossett dossett commented Oct 17, 2018

Sometimes a dataproc cluster creation results in a
cluster in a state of ERROR, which makes it unsuable.
Subsequent Airflow retries will fail because a cluster
already exists. This change adds the option to delete
an ERROR cluster on creation so that subsequent attempts
might succeed. There are also some other small cleanups.

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title.

Description

  • See commit message above

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

My change does not include tests, I did not see any integration tests in the code base that this could fit into.

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added.

Code Quality

  • Passes flake8

@fenglu-db
Copy link
Contributor

Instead of adding the delete_on_error into the dataproc cluster create operator, curious why you can't have a downstream delete cluster operator with a trigger_rule = upstream failed? Would rather keep the operator logic atomic and simple.

@dossett
Copy link
Contributor Author

dossett commented Oct 18, 2018

Hi @fenglu-g thanks for your comment. My goal wasn't just to make sure the ERROR cluster gets deleted but to give the cluster creation a chance to succeed with a retry.

The behavior we have observed is this:

  • Sometimes a cluster create fails and the cluster exists in an ERROR state
  • the cluster create operator retries based on our DAG configuration
  • the retries fail because a cluster with the same name already exists in the ERROR state
  • after the the retries are exhausted the DAG proceeds with that step as failed

After applying this patch internally we observe:

  • Sometimes a cluster create fails and the cluster exists in an ERROR state
  • We immediately delete the cluster within the create operator
  • the cluster create operator retries based on our DAG configuration
  • the cluster creation succeeds because whatever led to the initial cluster creation ERROR was a transient problem

It has greatly increased the reliability and stability of our GCP DAGS.

@fenglu-db
Copy link
Contributor

Sorry for the late reply.

Thank you @dossett for the detailed explanation. The root cause seems to be that DataprocClusterCreateOperator is not idempotent. Similar to what you have described, how about we re-factor the operator based on the following logic?

  • check existence of dataproc cluster
  • if yes and in "running" state, no-op.
  • else, delete the cluster, and attempt create again.

So we don't have to export this additional delete_cluster_on_error?

@dossett
Copy link
Contributor Author

dossett commented Nov 10, 2018

Hi @fenglu-g thank you for the follow-up! What about taking both approaches? Deleting a cluster first if it already exists, creating the cluster, and if the creation fails and results in an ERROR state deleting it before exiting. That could be the default behavior without the need to add the new parameter (delete_cluster_on_error). Having the operator clean up after itself and not leave errant resources behind seems like a nice property.

@stale
Copy link

stale bot commented Dec 25, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 25, 2018
@kaxil kaxil removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 31, 2018
@kaxil
Copy link
Member

kaxil commented Dec 31, 2018

@dossett Deleting the existing cluster without prior warnings, wouldn't be a good option. I would rather go the appraoch @fenglu-g mentioned.

@dossett
Copy link
Contributor Author

dossett commented Jan 4, 2019

Thank you @kaxil and @fenglu-g for the feedback. I will work on this when I'm back in the office next week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to put this code in Hook.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @mik-laj - makes sense to move it to Hooks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kaxil and @mik-laj I missed this the first time. I will make this change.

@dossett dossett force-pushed the AIRFLOW-3149 branch 2 times, most recently from 6750b8c to 8aa742c Compare January 25, 2019 16:31
@dossett
Copy link
Contributor Author

dossett commented Jan 25, 2019

@kaxil @fenglu-g I have redone this PR with the new approach per our discussion. Looking forward to additional feedback!

@dossett
Copy link
Contributor Author

dossett commented Jan 30, 2019

Moved common code to the hook and cleaned up some flake8 errors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
).execute()
).execute(num_retries=5)

Google API sometimes does not respond correctly. You should send a request again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

Copy link
Member

@mik-laj mik-laj Feb 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.log.info('Diagnostic information for ERROR cluster available at [' +
output_url .= diagnose_result.get('response').get('outputUri')
self.log.info('Diagnostic information for ERROR cluster available at [%s]', output_url)

You should avoid formatting the text before passing it to the logger. When text and data are passef to the logger separately, special loggers allow to analyze in a deeper way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again @mik-laj, that's a great point. Change pushed.

@OmerJog
Copy link
Contributor

OmerJog commented Feb 10, 2019

@dossett Note the flake8 errors:

./airflow/contrib/hooks/gcp_dataproc_hook.py:255:5: E303 too many blank lines (2)
./airflow/contrib/hooks/gcp_dataproc_hook.py:268:5: E303 too many blank lines (2)
./airflow/contrib/operators/dataproc_operator.py:275:5: E303 too many blank lines (2)
./airflow/contrib/operators/dataproc_operator.py:287:5: E303 too many blank lines (2)

@dossett
Copy link
Contributor Author

dossett commented Feb 11, 2019

Thank you @OmerJog, fixed!

@dossett
Copy link
Contributor Author

dossett commented Feb 12, 2019

I seem to have broken the test_cluster_name_log_sub test, so I'll try to figure that out

@OmerJog
Copy link
Contributor

OmerJog commented Mar 7, 2019

@dossett There is still an error:

53) FAIL: test_cluster_name_log_sub (tests.contrib.operators.test_dataproc_operator.DataprocClusterDeleteOperatorTest)
----------------------------------------------------------------------
   Traceback (most recent call last):
    tests/contrib/operators/test_dataproc_operator.py line 511 in test_cluster_name_log_sub
      dataproc_task.execute(None)
   AssertionError: TypeError not raised

@dossett
Copy link
Contributor Author

dossett commented Mar 12, 2019

Yeah @OmerJog I haven't been able to track that down

@OmerJog
Copy link
Contributor

OmerJog commented Apr 3, 2019

@dossett You have some errors:

20) FAIL: test_cluster_name_log_sub (tests.contrib.operators.test_dataproc_operator.DataprocClusterDeleteOperatorTest)
----------------------------------------------------------------------
   Traceback (most recent call last):
    tests/contrib/operators/test_dataproc_operator.py line 511 in test_cluster_name_log_sub
      dataproc_task.execute(None)
   AssertionError: TypeError not raised

@dossett
Copy link
Contributor Author

dossett commented Apr 17, 2019

@OmerJog I went over that test in detail and it seems like the with self.assertRaises(TypeError) was a defensive measure to catch an exception within the test and not that raising a TypeError was part of the test specification. I've deleted that line to see if the rest of the test is successful.

cc @piffall who added that line about 11 months ago

@piffall
Copy link
Contributor

piffall commented Apr 17, 2019

@OmerJog I went over that test in detail and it seems like the with self.assertRaises(TypeError) was a defensive measure to catch an exception within the test and not that raising a TypeError was part of the test specification. I've deleted that line to see if the rest of the test is successful.

cc @piffall who added that line about 11 months ago

Hi @dossett, in fact, I didn't add that line, I just keep that test with a minor change.

@codecov-io
Copy link

codecov-io commented Apr 17, 2019

Codecov Report

Merging #4064 into master will increase coverage by 0.23%.
The diff coverage is 51.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4064      +/-   ##
==========================================
+ Coverage   78.76%   78.99%   +0.23%     
==========================================
  Files         481      488       +7     
  Lines       30215    30642     +427     
==========================================
+ Hits        23800    24207     +407     
- Misses       6415     6435      +20
Impacted Files Coverage Δ
airflow/contrib/operators/dataproc_operator.py 79.43% <51.28%> (-4.38%) ⬇️
airflow/contrib/hooks/gcp_dataproc_hook.py 46.66% <52.94%> (+9.62%) ⬆️
airflow/contrib/operators/gcs_download_operator.py 78.12% <0%> (-9.88%) ⬇️
airflow/api/common/experimental/get_code.py 76.92% <0%> (-6.42%) ⬇️
...rflow/api/common/experimental/get_task_instance.py 84.61% <0%> (-5.39%) ⬇️
airflow/api/common/experimental/mark_tasks.py 95.2% <0%> (-1.57%) ⬇️
airflow/contrib/operators/mlengine_operator.py 76.4% <0%> (-0.17%) ⬇️
airflow/kubernetes/pod.py 100% <0%> (ø) ⬆️
airflow/contrib/operators/gcs_list_operator.py 100% <0%> (ø) ⬆️
airflow/api/common/experimental/get_task.py 100% <0%> (ø) ⬆️
... and 53 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2fd7567...eaed844. Read the comment docs.

@dossett
Copy link
Contributor Author

dossett commented Apr 18, 2019

@fenglu-g @kaxil This PR is complete and now passes all tests. We (Etsy) have been running with change locally for a couple of months now and it's been terrific. Clusters in ERROR no longer derail DAGs and automatically retrieving diagnostic information about them has let us diagnose underlying causes with Google support, where before this we couldn't because we had no information.

Thank you @OmerJog and @mik-laj for helpful comments along the way!

Copy link
Contributor

@fenglu-db fenglu-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dossett, mostly LGTM. Two nits and one test case request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/Exception/AirflowException,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just double check, should the regex pattern be [smdh] as well to be consistent with the get_graceful_decommission_timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly? That would require some other substantive changes right below this line to handle the d and h. My change was just to make the regex a little more standardized, but sometimes those can spill over to more substantive changes. I would want to err on the side of making that a separate change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also add a test cast that checks that the error cluster is deleted?

@OmerJog
Copy link
Contributor

OmerJog commented May 20, 2019

@dossett Can you address the comments and rebase? @kaxil is picking PRs for Airflow 1.10.4 - it would be nice to have this PR in.

@dossett
Copy link
Contributor Author

dossett commented May 20, 2019

Thanks @OmerJog and @fenglu-g ! I've addressed the first two comments, and will rebase and try to add the requested unit test as well. Looking forward to 1.10.4 !

@mik-laj
Copy link
Member

mik-laj commented Sep 5, 2019

Hi.

I made a change in the base class - GoogleCloudBaseHook. Your PR may need to be changed. Could you do rebase?

Thanks

Refenence:
#5907

@potiuk
Copy link
Member

potiuk commented Sep 6, 2019

Same here @dossett -> we are just moving from contrib to core and we would love to get this one rebased/merged.

@dossett
Copy link
Contributor Author

dossett commented Sep 13, 2019

Rebase and pushed, awaiting travis

@mik-laj
Copy link
Member

mik-laj commented Sep 16, 2019

Pylint is sad.

@dossett
Copy link
Contributor Author

dossett commented Sep 16, 2019

Pushed more changes, I'm learning a lot about pylint 😂

@dossett
Copy link
Contributor Author

dossett commented Sep 16, 2019

@mik-laj pylint is happy now, the kubernetes test suite failed but that seems unrelated to my change?

@dossett
Copy link
Contributor Author

dossett commented Sep 16, 2019

@mik-laj @potiuk All green!

@mik-laj mik-laj merged commit 578c57f into apache:master Sep 17, 2019
@dossett
Copy link
Contributor Author

dossett commented Sep 17, 2019

Thank you @mik-laj @potiuk @OmerJog @fenglu-g for all the feedback along the way, it's exciting to see this get merged. Looking forward to making more contributions.

@dossett dossett deleted the AIRFLOW-3149 branch September 17, 2019 13:18
@mik-laj
Copy link
Member

mik-laj commented Sep 17, 2019

Thank you very much for your cooperation. I am waiting for your next PR. 😸

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants