-
Notifications
You must be signed in to change notification settings - Fork 16.3k
[AIRFLOW-2221] Create DagFetcher abstraction #3138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3138 +/- ##
==========================================
- Coverage 76.67% 73.16% -3.51%
==========================================
Files 199 187 -12
Lines 16186 12716 -3470
==========================================
- Hits 12410 9304 -3106
+ Misses 3776 3412 -364
Continue to review full report at Codecov.
|
|
Great that this is coming! I would prefer a high level design before doing this as it is quite core. Couple of things that should be present:
Obviously, not everything needs to be in place right away, but having an generic idea where we are moving towards to seems smart. Security should not be an after thought :-). |
airflow/dag_fetcher.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do not put this in the root dir but in something like airflow/dag/fetcher/filesystem.py
airflow/dag_fetcher.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename to BaseDagFetcher
|
Woot! That was fast :) Similar to what @bolkedebruin mentioned, please break the dag_fetcher.py up into separate modules for each:
Also would be useful to add dag fetchers to plugin manager so custom fetchers can be written for internal usages. It's consistent with operators/hooks/executors/etc. (see |
airflow/dag_fetcher.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.stats should be called in process_file, otherwise if the uri is a single file, fetch won't be able to generate stats for it (I'm aware this is just copy/paste from old code, but just pointing it out)
airflow/dag_fetcher.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit rigid, it doesn't handle scenario where multiple dag fetchers are written for a specific service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it maybe worth adding dag fetcher as a default config.
b37cc10 to
081f7ee
Compare
Refactors DagBag to prepare the possibility of fetching dags remotely, by creating the DagFetcher abstraction. Defaults to FileSystemDagFetcher, maintining backwards compatibility.
081f7ee to
6979f0d
Compare
|
Thank you both for your help! Updated the PR:
Other topics:
I felt like starting the implementation would help to get things moving, but I'm happy to create a design doc. |
|
A few points to consider:
I think the URI is really key here. Assuming that the URI contains a provenance (s3, hive, hdfs, git), and can contain a version number (people are free to version their artifacts, create symlinks to latests, git refs, ...) and that an extension A few more thing:
Sorry for the large brain dump here. This is a bit of an hairy issue. |
|
@diogoalexandrefranco a design doc would be the ideal next step. This PR is a really awesome initiative, and it would be good that we prune out the design before merging this in case folks start adapting this current design and we end up making bw-incompatible changes later. @mistercrunch Using URI scheme to determine dag fetching mechanism is fine with me as well, but I think adding flexibility to the fetcher on how the URI is parsed will be useful (A bit far-fetched (no pun intended) example, which relates to manually maintaining a list of dags, would be someone decided to use a text file on a local machine to maintain their dag list, then it would be useful to have a custom text file fetcher that allow them to fetch the dags from those listed in the text file.) I'm also excited that the dag fetcher should be able to get rid of the work-around with https://issues.apache.org/jira/browse/AIRFLOW-97 |
|
Yeah About the DAGS_MANIFEST idea, if it can receive a callable, people can do whatever they want to manage their dictionary of dag_ids -> URIs. If they want a hard coded list that's fine, if they want to parse a yaml file they can do that, use the current crawler logic equivalent it'sok, if they want to put it in a In retrospec the |
|
A lot to unpack here :) Thank you for your help! With so much on the table, I'm thinking that it makes sense to explicitly divide this into phases. The base re-architecture seems like the low hanging fruit we could merge first, allowing independent development of many of these features afterwards. Something like: Base re-arch:
It seems like the above could be done in such a way where things would be strictly improved, even before further work. Independent features:
How should we go about starting the design doc? The Airflow space in Confluence seems like the right place, but I don't have permission to create a new page. |
|
https://github.com/RaRe-Technologies/smart_open may help, not sure if it's a perfect fit though. You should be able to get a confluence account as you get an Apache JIRA account, in the meantime a public gdoc or gist or anything that doesn't requires creds works fine by me. |
|
@diogoalexandrefranco just want to check up with you to make sure you are not blocked from the confluence permission issue, and how the design doc is going :) |
|
@diogoalexandrefranco Friendly ping - are you still planning on abstracting out the BaseDagFetcher, or is that work up for grabs? |
|
@astahlman looks like it's up for grabs, happy to help |
|
@astahlman , let me know how I can help. This should be very interesting. |
|
Hi guys,
Sorry I never got around to the design doc, my free time completely shrank.
Feel free to own this of course, I hope to be able to help again soon
enough.
There is an open PR which at most may be a decent starting point for some
of the changes that are required, it is a working implementation of the
abstraction with the current file system dag fetcher and the plug-in system
for dag fetchers.
Cheers,
Diogo
A Ter, 14 de ago de 2018, 17:08, Tao Feng <[email protected]>
escreveu:
… @astahlman <https://github.com/astahlman> , let me know how I can help.
This should be very interesting.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3138 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHMy3SpyayZAAZh9bUQitn6PhN-8ry0_ks5uQvYagaJpZM4SvBuq>
.
|
|
Andrew @astahlman , FYI, Airflow starts using AIF(airflow improvement proposal). With this kinda change, I think we need to have a wiki to document the initial design(https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals). |
|
@astahlman Are you still working on this PR? My coworker and I are interested in helping out with this, including drafting up an AIP. |
|
@idavison I am not working on it - all yours! |
|
Hey guys, I've drafted up a proposal for the dag fetcher. I've tried to keep in mind everything people have discussed here previously, but I'm sure I've missed some stuff. I'd appreciate people's thoughts on this and I have some questions about different design decisions we'll need to make . Thanks! |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
|
Hello guys , this is an important feature for large scale Airflow setup . what are the news ? |
|
looking forward to this feature. it might be good idea to have rest endpoint to trigger the fetch operation
|
|
For some reason this PR and the accompanying AIP have been either forgotten or abandoned .. I am posting this comment in hopes that someone can push this forward.. |
|
Any news? |
Refactors DagBag to prepare the possibility of fetching dags remotely, by creating the DagFetcher abstraction, and a factory function that instantiates the right DagFetcher based on the dags_folder URI prefix.
Implements the default FileSystemDagFetcher (using the code that was part of DagBag).
Adds the possibility of adding DagFetchers as plugins.
JIRA
Description
Because the file system related code from DagBag was moved to FileSystemDagFetcher, the diff is bigger than the actual changes. Is is basically:
Tests
Added test_get_dag_fetcher in tests.models:DagBagTest
Removed test_get_dag_without_refresh as I didn't find an easy way to implement this test under the new architecture.
Added test_fetchers in test.plugins_manager:PluginsTest
Commits
My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
Passes
git diff upstream/master -u -- "*.py" | flake8 --diff