-
Notifications
You must be signed in to change notification settings - Fork 16.3k
[AIRFLOW-5691] Rewrite Dataproc operators to use python library #6371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested this change with the original DAG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was how I checked backwards compatibility. Channing example DAG was last thing I did. If you wish I can change the DAG in separate PR.
airflow/gcp/hooks/dataproc.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| DeprecationWarning, | |
| DeprecationWarning, | |
| stacklevel=2 |
|
@mik-laj, @potiuk can you take a look at the Travis? I am not sure if this is something flaky or not. |
|
It looks like some kind of dependencies has exploded again. |
|
I restarted failed job. |
b0a746b to
a78ff2c
Compare
Codecov Report
@@ Coverage Diff @@
## master #6371 +/- ##
=========================================
+ Coverage 83.3% 83.5% +0.19%
=========================================
Files 645 645
Lines 37291 37356 +65
=========================================
+ Hits 31067 31194 +127
+ Misses 6224 6162 -62
Continue to review full report at Codecov.
|
6bba8b2 to
263bda4
Compare
4f56a2f to
6661b9a
Compare
6e90966 to
4c1e2e9
Compare
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one change -> providers/google/cloud
fixup! [AIRFLOW-5691] Rewrite Dataproc operators to use python library
4c1e2e9 to
d2f5a00
Compare
d2f5a00 to
ec33459
Compare
|
This seems to have made some substantive changes to the behavior of the operators, was that intended? For example, functionality to detect and reattach to a running job is gone. See the code starting here for functionality that wasn't ported. d633d3a#diff-0be5a6bccaef6a25d9ff5d63a92a12f0L64-L65 |
|
That link may not expand well. It's to line 64 in the |
|
The functionality that was apparently lost is AIRFLOW-3211 |
|
@dossett, the functionality added in AIRFLOW-3211 actually broke the behavior of the dataproc hook and made a few 1.10.x releases unusable for dataproc users. The problem is that the hook only uses the task ID part of the dataproc job ID when looking for previous invocations of the job, so if dataproc history still has jobs corresponding to any of the previous dag runs, the dataproc hook doesn't execute the job. A proper way to implement this would be to associate dataproc jobs with particular dag runs by e.g. embedding a dag run id hash in the dataproc job id. In any case the functionality added in AIRFLOW-3211 has to be optional. In our experience, users expect dataproc jobs to be re-executed when they re-execute the task, and this new behavior creates a lot of confusion. |
|
@digger I take your points about the original change. Do you know if reverting that functionality was an intentional part of this PR? |
|
@dossett, I don't know if that was intentional. I just shared my feedback on changes made in AIRFLOW-3211. In the company I work for we use Airflow and we had to patch Airflow 1.10.7 and 1.10.9, reverting changes from AIRFLOW-3211, in order to make the dataproc functionality work. |
|
@digger is right, Dataproc was not working (tested using example DAG). I've tried to do my best to preserve as much backward compatibility as I could. The present implementation is much clearer and uses Google's python library. Without operators - hooks coupling it should be easier to add any needed changes. I am happy to review PR that adds missing functionality @dossett :) |
|
@nuclearpinguin It looks like functionality added by AIRFLOW-3149 (to more gracefully handle the creation of clusters in an ERROR state) was also reverted, was that also intentional? Was there a JIRA or mailing list discussion of the functional changes that accompanied this API migration? (I'm seriously asking, I searched but did not find anything.) |
|
There's a JIRA: https://issues.apache.org/jira/browse/AIRFLOW-5691 No change was intentional. Also, this change was not backported to 1.10.x . So, from what I understand the submit job operator should:
Doesn't
|
|
@turbaszek I hope to return to this topic once world events stabilize and my kids return to school. Cheers! -Aaron |
Make sure you have checked all steps below.
Jira
Description
This PR introduces refactored Dataproc operators that use python client for Dataproc instead of discovery API. Job operators like
DataProcHadoopOperatorare going to be deprecated and user should use genericDataprocSubmitJobOperator. In general new operators will accept only request objects likeClusterorJob. I've added helper methods that will help users to generate those objects using their old operators.This PR also decouples hook, operators any additional Dataproc helpers.
Tests
Commits
Documentation