Skip to content

Conversation

@potiuk
Copy link
Member

@potiuk potiuk commented Mar 18, 2019

Make sure you have checked all steps below.

Jira

Description

  • Here are some details about my PR, including screenshots of any UI changes:

This is the first step in implementing Multi-stage Docker image and use it in CI tests.
Note that due to GPL licence of hadolint we might ask for exemption of ASF to use it.

More details here:

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
  • automated lint on the Dockerfile run on Travis CI

The original Docker file did not have tests. Tests are coming as part of the follow-up changes where all automated tests of Airflow will be run using the staged image.

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

No documentation needed for that change. Design is documented in https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+and+multi-stage+official+Airflow+image

Follow up documentation will be added in Simplified Development Workflow change: #4932

Code Quality

  • Passes flake8

@potiuk potiuk closed this Mar 18, 2019
@potiuk potiuk reopened this Mar 18, 2019
@potiuk potiuk changed the title [AIRFLOW-4115] Multi-staging Aiflow Docker image with linting [AIRFLOW-4115] Multi-staging Aiflow Docker image [Step 1/3] Mar 18, 2019
@potiuk potiuk closed this Mar 18, 2019
@potiuk potiuk reopened this Mar 18, 2019
Copy link
Member Author

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimberman @Fokko @ashb - this is the first of the steps, each of the changes is much smaller and can be reviewed more easily I hope.

Dockerfile Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this after copying source files seems out of place/not where I would have put this.

Is there a reason this is in this image rather than, say before the first COPY in this image, or in the airflow-apt-deps image?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason was to make sure we always use latest dependencies and update frequently. I think that originated from our original (long time ago :) ) discussion with @Fokko - he was concerned that we will be "stuck" with some older version of apt-get dependencies, rather than latest ones.

There must be some trigger to invalidate such apt-get and we can play a bit with it.

I also don't find it particularly good to run apt-get upgrade after adding sources. I'd rather do it after copying "setup.py" and related files - this way we would always upgrade to latest apt dependencies together with updates to pip dependencies (which makes perfect sense).

This will be less frequent on one hand but also more stable (less risk that build starts failing because some dependency-triggered issue independent of code change). Similarly as in case of PIP dependencies that gives IMHO the best of both worlds - fairly frequent, automated update of dependencies and relative stability - the dependencies will always be "fixed" between a change in setup.py - so no more broken builds "out-of-the-blue".

Paired with regularly daily-cron-triggered fresh rebuild from the scratch, I think we might have perfect solution that detects dependency-triggered errors quickly but does not stop contributors from working and running Travis CI tests.

I think we are aligned on that @ashb and I am happy to move it there.

@Fokko - what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to before the COPY --chown=airflow:airflow . /opt/airflow/ line then?

.dockerignore Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore everything and then add a few things back in? The list of things we dont want in isn't that big.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already given a talk on that actually I plan to write blog post :).

It's a practice I now absolutely recommend. I read it in many places as recommendation (for example here: https://youknowfordevs.com/2018/12/07/getting-control-of-your-dockerignore-files.html). But I experienced the effect of it in airflow. Besides longer time to build, this is a problem of cache invalidation when you do 'COPY .' .

The thing is that in this case the whole context has to match when you run rebuild of the image and if you add everything by default then every single file generated or added by accident in your source dir will cause context invalidation.

We already generate a lot of stuff in the sources (node_modules, 'static' and a number of others). When you do local development you often run stuff (such as document generation) locally and they introduce a lot of garbage in your context and invalidate the cache (thus pretty much every single time you restart dockerfile from the COPY . step no matter if the actual sources changed or not.

Moreover - you are not able to foresee what other stuff will be generated in the sources in the future. So no matter how hard you try and exclude everything you do not want, somebody few months from now might add a single generated file (or likely directory) that will cause subsequent frequent cache invalidations. Therefore "exclude everything" and then "add what you need" works much better. There is no chance you forget to add something important then - because your tests will fail.

This is actually the reason I now deliberately build Docker image and run tests from the binary image rather than using sources mounted (as it was before). This way your full docker image is tested. Previously some files could be ignored on build but then mounted from sources - which is quite bad practice - because it does not test the Docker image but some hybrid of Docker image and mounted sources (which are not necessarily the same).

One - hugely important additional thing - making context as minimal as possible is a huge time saver for local builds. What happens under the hood when Docker build runs, the whole context is compressed/packed and sent to the engine rather than used from local sources. This is the reason why Docker build sometimes pauses for many seconds until build starts - if you see it, it usually means you have not done good job in excluding some generated binaries. If you don't ignore such generated files (especially node_modules which are huge/lots of files) then your context might grow very large (and uncontrollably).

Excluding ** is all but guarantee that the context will not grow accidentally and uncontrollably.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@ashb ashb Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct with the duplicates. I reviewed and removed it.

There is little worry that we will forget something - the tests will catch it (and it should rarely be needed - adding a new folder or file at the top level should be really rare event)

And yes - you need those extra specs because they will exclude what should not be part of those included dirs above. The pattern I follow is (and I saw it suggested by several people):

  1. ** -> Exclude everything
  2. !/airflow -> include only those top-level folders/top-level files that you need
  3. **/pycache/ -> from those folders that you included in 2) exclude all the generated content that follows common patterns for generated stuff (like *.pyc file or /pycache/ folders)

I feel very strong about it - having to debug the context changes (and figuring out what the hell changed to trigger copying of the files again) for many hours. Example: only after a while I realised that there is a generated folder in docs which is _build next to _template (which is not generated, thought the name kind of suggest it). And just generating the docs locally triggered rebuild of Dockerfile (that's why I did not include the whole docs folder :) ). Otherwise I would have to specifically exclude that particular folder (if I knew that I have to do it). I really think adding stuff deliberately to Docker context makes much more sense.

Alternative is to move ALL generated content to specific directory, and exclude that directory - but this is not always possible (node_modules) and from my experience people will not "stick" to it - they will start generating random files in different folders. Of course we are not protected if someone starts generating "xyz" file/folder inside "airflow" folder. But it's rather unlikely someone will do something that crazy. I think generating something in sources is rather bad practice and you do it only if you cannot do it otherwise (that's one of the parts I don't like about python - .pyc files are generated next to actual source files. It reminds me of times when I used RCS versioning system and we had source.c,v next to every file.... BRRR)

But to be honest - It was much bigger problem and I eventually introduced checksums for important files (setup/docker) to check if we need to run docker build at all.

But still I think it makes it much more resilient for future changes that might produce uncontrollable and unpredictable context "growth".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give some more examples where things could go south. When I run unit tests from the IDE, the logs by default are written in the working directory - which is by default airflow's root folder. They are now ignored with .gitignore but when docker build is executed the logs are potentially big and should not be sent as part of the context. It's safer to exclude those by default.

Dockerfile Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are splitting out to multi-stage images it would be nice to node and the asset build stage as it's own step so that in the final image we have

COPY --from=airflow-assets /opt/airflow/airflow/www/static/dist /opt/airflow/airflow/www/static/dist

No node, no node modules in final image that way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that would prevent you from re-running npm ci/npm run prod when you work with mounted sources. It's not really acceptable for local development with the CI image.

I think it's good to keep it - at the very least - for the CI image. I can think about some optional step to build it as another stage for "Airflow" image - but I think it's not really good idea. I will think a bit more about it, but as I see it, it would complicate the whole Dockerfile and image a lot with little gain. I'd rather leave that in for now.

Especially that we have bright future. I think it's a good candidate for upcoming changes in Docker - especially Buildkit. There is a new, experimental feature with mount type in RUN command. It could be used for that purpose. Secret type seems especially interesting as it prevents the layer generated by a RUN command to ever appear in the final image (it will be present during the build):
https://github.com/moby/buildkit/blob/b521aae3ea1e1bb73298481ac538f9880ccd3854/frontend/dockerfile/docs/experimental.md#run---mounttypesecret

I already saw people commenting that they would use secret for that very purpose you described. It was foreseen for other purpose (to keep secretes from leaking) but maybe in the final buildkit release there will be a new type "ignored" or smth. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, on CI this makes sense.

@potiuk potiuk force-pushed the ms-main-image branch 4 times, most recently from 8235502 to 346bb16 Compare April 1, 2019 05:51
@codecov-io
Copy link

codecov-io commented Apr 1, 2019

Codecov Report

Merging #4936 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4936      +/-   ##
==========================================
+ Coverage   76.36%   76.36%   +<.01%     
==========================================
  Files         471      471              
  Lines       30295    30295              
==========================================
+ Hits        23135    23136       +1     
+ Misses       7160     7159       -1
Impacted Files Coverage Δ
airflow/contrib/operators/ssh_operator.py 83.54% <0%> (+1.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd08034...925397d. Read the comment docs.

Copy link
Member Author

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ashb @Fokko -> I moved setup.py modification to the 3rd commit where I get rid of tox (I will move discussion there).

I think the only unresolved point here is the approach for .dockerignore exclusion/inclusion patterns.

But I have strong feeling that this is the right approach.

Does it look good for you so that I can proceed with further "advertising" of the AIP on devlist?

@Fokko
Copy link
Contributor

Fokko commented Apr 2, 2019

@potiuk Yes, this looks very good. Now we have Python2 support dropped for Airflow 2.0, we possibly even simplify even more. I can also open a [VOTE] for the API. Let me know!

@potiuk
Copy link
Member Author

potiuk commented Apr 3, 2019

Great @Fokko . If you think it's ready for a vote I am happy :). Let's start it.

I guess It might start with me summarising (quickly) on the devlist where we are (I will also quickly explain how we get through 1-2-3 PRs - there are some steps needed to go between 1, 2, 3 (including modification of airflow Dockerhub configuration.

@Fokko
Copy link
Contributor

Fokko commented Apr 5, 2019

@potiuk Late happy birthday to you :-) I've started the vote. The VOTE is more on the concept than the actual implementation. Commenting on the implementation is easier to do on the PR.

Dockerfile Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo:

Suggested change
# Always add-get update/upgrade here to get latest dependencies before
# Always apt-get update/upgrade here to get latest dependencies before

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like (but don't require) to re-order the COPY and RUN apt-get update+upgrade otherwise looks good to me.

@potiuk
Copy link
Member Author

potiuk commented Apr 6, 2019

Sure - I forgot about this. I will re-order.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @potiuk

@Fokko Fokko merged commit 5b53d99 into apache:master Apr 13, 2019
cthenderson pushed a commit to cthenderson/apache-airflow that referenced this pull request Apr 16, 2019
potiuk added a commit that referenced this pull request Jul 19, 2019
potiuk added a commit that referenced this pull request Jul 20, 2019
ashb pushed a commit to ashb/airflow that referenced this pull request Jul 23, 2019
andriisoldatenko pushed a commit to andriisoldatenko/airflow that referenced this pull request Jul 26, 2019
wmorris75 pushed a commit to modmed-external/incubator-airflow that referenced this pull request Jul 29, 2019
dharamsk pushed a commit to postmates/airflow that referenced this pull request Aug 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants