-
Notifications
You must be signed in to change notification settings - Fork 16.3k
[AIRFLOW-4115] Multi-staging Aiflow Docker image [Step 1/3] #4936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dimberman @Fokko @ashb - this is the first of the steps, each of the changes is much smaller and can be reviewed more easily I hope.
101eccf to
55d4afc
Compare
Dockerfile
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having this after copying source files seems out of place/not where I would have put this.
Is there a reason this is in this image rather than, say before the first COPY in this image, or in the airflow-apt-deps image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason was to make sure we always use latest dependencies and update frequently. I think that originated from our original (long time ago :) ) discussion with @Fokko - he was concerned that we will be "stuck" with some older version of apt-get dependencies, rather than latest ones.
There must be some trigger to invalidate such apt-get and we can play a bit with it.
I also don't find it particularly good to run apt-get upgrade after adding sources. I'd rather do it after copying "setup.py" and related files - this way we would always upgrade to latest apt dependencies together with updates to pip dependencies (which makes perfect sense).
This will be less frequent on one hand but also more stable (less risk that build starts failing because some dependency-triggered issue independent of code change). Similarly as in case of PIP dependencies that gives IMHO the best of both worlds - fairly frequent, automated update of dependencies and relative stability - the dependencies will always be "fixed" between a change in setup.py - so no more broken builds "out-of-the-blue".
Paired with regularly daily-cron-triggered fresh rebuild from the scratch, I think we might have perfect solution that detects dependency-triggered errors quickly but does not stop contributors from working and running Travis CI tests.
I think we are aligned on that @ashb and I am happy to move it there.
@Fokko - what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this to before the COPY --chown=airflow:airflow . /opt/airflow/ line then?
.dockerignore
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ignore everything and then add a few things back in? The list of things we dont want in isn't that big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already given a talk on that actually I plan to write blog post :).
It's a practice I now absolutely recommend. I read it in many places as recommendation (for example here: https://youknowfordevs.com/2018/12/07/getting-control-of-your-dockerignore-files.html). But I experienced the effect of it in airflow. Besides longer time to build, this is a problem of cache invalidation when you do 'COPY .' .
The thing is that in this case the whole context has to match when you run rebuild of the image and if you add everything by default then every single file generated or added by accident in your source dir will cause context invalidation.
We already generate a lot of stuff in the sources (node_modules, 'static' and a number of others). When you do local development you often run stuff (such as document generation) locally and they introduce a lot of garbage in your context and invalidate the cache (thus pretty much every single time you restart dockerfile from the COPY . step no matter if the actual sources changed or not.
Moreover - you are not able to foresee what other stuff will be generated in the sources in the future. So no matter how hard you try and exclude everything you do not want, somebody few months from now might add a single generated file (or likely directory) that will cause subsequent frequent cache invalidations. Therefore "exclude everything" and then "add what you need" works much better. There is no chance you forget to add something important then - because your tests will fail.
This is actually the reason I now deliberately build Docker image and run tests from the binary image rather than using sources mounted (as it was before). This way your full docker image is tested. Previously some files could be ignored on build but then mounted from sources - which is quite bad practice - because it does not test the Docker image but some hybrid of Docker image and mounted sources (which are not necessarily the same).
One - hugely important additional thing - making context as minimal as possible is a huge time saver for local builds. What happens under the hood when Docker build runs, the whole context is compressed/packed and sent to the engine rather than used from local sources. This is the reason why Docker build sometimes pauses for many seconds until build starts - if you see it, it usually means you have not done good job in excluding some generated binaries. If you don't ignore such generated files (especially node_modules which are huge/lots of files) then your context might grow very large (and uncontrollably).
Excluding ** is all but guarantee that the context will not grow accidentally and uncontrollably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that by excluding ** we will miss new files is all.
You also have a number of duplicates in this file
There's no need for both right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if we are excluding everything then do we need these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct with the duplicates. I reviewed and removed it.
There is little worry that we will forget something - the tests will catch it (and it should rarely be needed - adding a new folder or file at the top level should be really rare event)
And yes - you need those extra specs because they will exclude what should not be part of those included dirs above. The pattern I follow is (and I saw it suggested by several people):
- ** -> Exclude everything
- !/airflow -> include only those top-level folders/top-level files that you need
- **/pycache/ -> from those folders that you included in 2) exclude all the generated content that follows common patterns for generated stuff (like *.pyc file or /pycache/ folders)
I feel very strong about it - having to debug the context changes (and figuring out what the hell changed to trigger copying of the files again) for many hours. Example: only after a while I realised that there is a generated folder in docs which is _build next to _template (which is not generated, thought the name kind of suggest it). And just generating the docs locally triggered rebuild of Dockerfile (that's why I did not include the whole docs folder :) ). Otherwise I would have to specifically exclude that particular folder (if I knew that I have to do it). I really think adding stuff deliberately to Docker context makes much more sense.
Alternative is to move ALL generated content to specific directory, and exclude that directory - but this is not always possible (node_modules) and from my experience people will not "stick" to it - they will start generating random files in different folders. Of course we are not protected if someone starts generating "xyz" file/folder inside "airflow" folder. But it's rather unlikely someone will do something that crazy. I think generating something in sources is rather bad practice and you do it only if you cannot do it otherwise (that's one of the parts I don't like about python - .pyc files are generated next to actual source files. It reminds me of times when I used RCS versioning system and we had source.c,v next to every file.... BRRR)
But to be honest - It was much bigger problem and I eventually introduced checksums for important files (setup/docker) to check if we need to run docker build at all.
But still I think it makes it much more resilient for future changes that might produce uncontrollable and unpredictable context "growth".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can give some more examples where things could go south. When I run unit tests from the IDE, the logs by default are written in the working directory - which is by default airflow's root folder. They are now ignored with .gitignore but when docker build is executed the logs are potentially big and should not be sent as part of the context. It's safer to exclude those by default.
Dockerfile
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are splitting out to multi-stage images it would be nice to node and the asset build stage as it's own step so that in the final image we have
COPY --from=airflow-assets /opt/airflow/airflow/www/static/dist /opt/airflow/airflow/www/static/dist
No node, no node modules in final image that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that would prevent you from re-running npm ci/npm run prod when you work with mounted sources. It's not really acceptable for local development with the CI image.
I think it's good to keep it - at the very least - for the CI image. I can think about some optional step to build it as another stage for "Airflow" image - but I think it's not really good idea. I will think a bit more about it, but as I see it, it would complicate the whole Dockerfile and image a lot with little gain. I'd rather leave that in for now.
Especially that we have bright future. I think it's a good candidate for upcoming changes in Docker - especially Buildkit. There is a new, experimental feature with mount type in RUN command. It could be used for that purpose. Secret type seems especially interesting as it prevents the layer generated by a RUN command to ever appear in the final image (it will be present during the build):
https://github.com/moby/buildkit/blob/b521aae3ea1e1bb73298481ac538f9880ccd3854/frontend/dockerfile/docs/experimental.md#run---mounttypesecret
I already saw people commenting that they would use secret for that very purpose you described. It was foreseen for other purpose (to keep secretes from leaking) but maybe in the final buildkit release there will be a new type "ignored" or smth. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, on CI this makes sense.
8235502 to
346bb16
Compare
Codecov Report
@@ Coverage Diff @@
## master #4936 +/- ##
==========================================
+ Coverage 76.36% 76.36% +<.01%
==========================================
Files 471 471
Lines 30295 30295
==========================================
+ Hits 23135 23136 +1
+ Misses 7160 7159 -1
Continue to review full report at Codecov.
|
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ashb @Fokko -> I moved setup.py modification to the 3rd commit where I get rid of tox (I will move discussion there).
I think the only unresolved point here is the approach for .dockerignore exclusion/inclusion patterns.
But I have strong feeling that this is the right approach.
Does it look good for you so that I can proceed with further "advertising" of the AIP on devlist?
|
@potiuk Yes, this looks very good. Now we have Python2 support dropped for Airflow 2.0, we possibly even simplify even more. I can also open a |
|
Great @Fokko . If you think it's ready for a vote I am happy :). Let's start it. I guess It might start with me summarising (quickly) on the devlist where we are (I will also quickly explain how we get through 1-2-3 PRs - there are some steps needed to go between 1, 2, 3 (including modification of airflow Dockerhub configuration. |
|
@potiuk Late happy birthday to you :-) I've started the vote. The VOTE is more on the concept than the actual implementation. Commenting on the implementation is easier to do on the PR. |
Dockerfile
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo:
| # Always add-get update/upgrade here to get latest dependencies before | |
| # Always apt-get update/upgrade here to get latest dependencies before |
ashb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like (but don't require) to re-order the COPY and RUN apt-get update+upgrade otherwise looks good to me.
|
Sure - I forgot about this. I will re-order. |
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @potiuk
(cherry picked from commit 5b53d99)
(cherry picked from commit 5b53d99)
(cherry picked from commit 5b53d99)
(cherry picked from commit 5b53d99)
Make sure you have checked all steps below.
Jira
Description
This is the first step in implementing Multi-stage Docker image and use it in CI tests.
Note that due to GPL licence of hadolint we might ask for exemption of ASF to use it.
More details here:
Tests
The original Docker file did not have tests. Tests are coming as part of the follow-up changes where all automated tests of Airflow will be run using the staged image.
Commits
Documentation
No documentation needed for that change. Design is documented in https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+and+multi-stage+official+Airflow+image
Follow up documentation will be added in Simplified Development Workflow change: #4932
Code Quality
flake8