-
Notifications
You must be signed in to change notification settings - Fork 16.3k
AIRFLOW-3149 Support dataproc cluster deletion on ERROR #4064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Instead of adding the delete_on_error into the dataproc cluster create operator, curious why you can't have a downstream delete cluster operator with a trigger_rule = upstream failed? Would rather keep the operator logic atomic and simple. |
|
Hi @fenglu-g thanks for your comment. My goal wasn't just to make sure the ERROR cluster gets deleted but to give the cluster creation a chance to succeed with a retry. The behavior we have observed is this:
After applying this patch internally we observe:
It has greatly increased the reliability and stability of our GCP DAGS. |
|
Sorry for the late reply. Thank you @dossett for the detailed explanation. The root cause seems to be that DataprocClusterCreateOperator is not idempotent. Similar to what you have described, how about we re-factor the operator based on the following logic?
So we don't have to export this additional delete_cluster_on_error? |
|
Hi @fenglu-g thank you for the follow-up! What about taking both approaches? Deleting a cluster first if it already exists, creating the cluster, and if the creation fails and results in an ERROR state deleting it before exiting. That could be the default behavior without the need to add the new parameter ( |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
|
@dossett Deleting the existing cluster without prior warnings, wouldn't be a good option. I would rather go the appraoch @fenglu-g mentioned. |
|
Thank you @kaxil and @fenglu-g for the feedback. I will work on this when I'm back in the office next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to put this code in Hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @mik-laj - makes sense to move it to Hooks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6750b8c to
8aa742c
Compare
|
@kaxil @fenglu-g I have redone this PR with the new approach per our discussion. Looking forward to additional feedback! |
|
Moved common code to the hook and cleaned up some flake8 errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ).execute() | |
| ).execute(num_retries=5) |
Google API sometimes does not respond correctly. You should send a request again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.log.info('Diagnostic information for ERROR cluster available at [' + | |
| output_url .= diagnose_result.get('response').get('outputUri') | |
| self.log.info('Diagnostic information for ERROR cluster available at [%s]', output_url) |
You should avoid formatting the text before passing it to the logger. When text and data are passef to the logger separately, special loggers allow to analyze in a deeper way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again @mik-laj, that's a great point. Change pushed.
|
@dossett Note the flake8 errors: |
|
Thank you @OmerJog, fixed! |
|
I seem to have broken the |
|
@dossett There is still an error: |
|
Yeah @OmerJog I haven't been able to track that down |
|
@dossett You have some errors: |
|
@OmerJog I went over that test in detail and it seems like the cc @piffall who added that line about 11 months ago |
Hi @dossett, in fact, I didn't add that line, I just keep that test with a minor change. |
Codecov Report
@@ Coverage Diff @@
## master #4064 +/- ##
==========================================
+ Coverage 78.76% 78.99% +0.23%
==========================================
Files 481 488 +7
Lines 30215 30642 +427
==========================================
+ Hits 23800 24207 +407
- Misses 6415 6435 +20
Continue to review full report at Codecov.
|
|
@fenglu-g @kaxil This PR is complete and now passes all tests. We (Etsy) have been running with change locally for a couple of months now and it's been terrific. Clusters in ERROR no longer derail DAGs and automatically retrieving diagnostic information about them has let us diagnose underlying causes with Google support, where before this we couldn't because we had no information. Thank you @OmerJog and @mik-laj for helpful comments along the way! |
fenglu-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dossett, mostly LGTM. Two nits and one test case request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/Exception/AirflowException,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just double check, should the regex pattern be [smdh] as well to be consistent with the get_graceful_decommission_timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly? That would require some other substantive changes right below this line to handle the d and h. My change was just to make the regex a little more standardized, but sometimes those can spill over to more substantive changes. I would want to err on the side of making that a separate change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you also add a test cast that checks that the error cluster is deleted?
|
Thanks @OmerJog and @fenglu-g ! I've addressed the first two comments, and will rebase and try to add the requested unit test as well. Looking forward to 1.10.4 ! |
|
Hi. I made a change in the base class - GoogleCloudBaseHook. Your PR may need to be changed. Could you do rebase? Thanks Refenence: |
|
Same here @dossett -> we are just moving from contrib to core and we would love to get this one rebased/merged. |
5892119 to
1fc8fa5
Compare
|
Rebase and pushed, awaiting travis |
|
Pylint is sad. |
|
Pushed more changes, I'm learning a lot about pylint 😂 |
|
@mik-laj pylint is happy now, the kubernetes test suite failed but that seems unrelated to my change? |
|
Thank you very much for your cooperation. I am waiting for your next PR. 😸 |
Sometimes a dataproc cluster creation results in a
cluster in a state of ERROR, which makes it unsuable.
Subsequent Airflow retries will fail because a cluster
already exists. This change adds the option to delete
an ERROR cluster on creation so that subsequent attempts
might succeed. There are also some other small cleanups.
Make sure you have checked all steps below.
Jira
Description
Tests
My change does not include tests, I did not see any integration tests in the code base that this could fit into.
Commits
Documentation
Code Quality
flake8