Split tests to more sub-types #11402

potiuk · 2020-10-10T16:09:05Z

We seem to have a problem with running all tests at once - most
likely due to some resource problems in our CI, therefore it makes
sense to split the tests into more batches. This is not yet full
implementation of selective tests but it is going in this direction
by splitting to Core/Providers/API/CLI tests. The full selective
tests approach will be implemented as part of #10507 issue.

This split is possible thanks to #10422 which moved building image
to a separate workflow - this way each image is only built once
and it is uploaded to a shared registry, where it is quickly
downloaded from rather than built by all the jobs separately - this
way we can have many more jobs as there is very little per-job
overhead before the tests start runnning.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

potiuk · 2020-10-10T16:14:06Z

This is an attempt to improve stability in our tests. I am still trying it - it will likely fail (I had to move tests from "tests" to "core" directory and it likely will cause some more troubles, but I think it's going in the right direction - we will have many less tests to run "per job" but many more jobs to run. I think that will be fine because those jobs will generally run much, much faster in general and I hope the 137 "errors" will be gone (I also move the "backfill_job" to Heisentests for now.

The next step will be to only run subset of tests for non-core-related changes as described in #10507

github-actions · 2020-10-10T16:14:37Z

The Workflow run is cancelling this PR. Building image for the PR has been cancelled

potiuk · 2020-10-11T07:51:10Z

Hey Everyone. I think with this change we have a chance to finally reach stability of the tests.

I split the tests into multiple jobs, and I think I have a very nice split. I believe with the split I introduced, we will hit resource limitations far less frequently (even if we run many more jobs). Each of the test jobs does not run longer than ~7 minutes (including pulling the image built once in the separate workflow). Additionally I've added all the test types to breeze, so that it will be really easy to reproduce every test type locally. If (also built only once) you have the image locally built with breeze it takes literally up to 4-5 minutes (or less depending on your machine type) to re-run the complete failed set for the given test type and reproduce the failure, then you can enter Breeze and reproduce it one-by-one.

Additionally, in case of test job failure I print some useful instructions on how to reproduce such failed run locally. In case you have not noticed, it is now very easy to reproduce the failed build from CI using the RUN_ID from GitHub Actions (you can pull the very image that was used to run the tests). Now this information is printed out on failure of tests and it will be immediately visible by the author and committer and reproducing the failed tests will be a ..... BREEZE.

*******************************************************************************************************
*
* ERROR! Some tests failed, unfortunately. Those might be transient errors,
*        but usually you have to fix something.
*        See the above log for details.
*
*******************************************************************************************************
*  You can easily reproduce the failed tests on your dev machine/
*
*   When you have the source branch checked out locally:
*
*     Run all tests:
*
*       ./breeze --backend postgres --python 3.6 --db-reset --test-type Core  tests
*
*     Enter docker shell:
*
*       ./breeze --backend postgres --python 3.6 --db-reset --test-type Core  shell
*
*   When you do not have sources:
*
*     Run all tests:
*
*       ./breeze --gihub-image-id NNNNNNNN --backend postgres --python 3.6 --db-reset --test-type Core  tests
*
*     Enter docker shell:
*
*       ./breeze --gihub-image-id NNNNNNNN  --backend postgres --python 3.6 --db-reset --test-type Core  shell
*
*
*   NOTE! Once you are in the docker shell, you can run failed test with:
*
*            pytest [TEST_NAME]
*
*   You can copy the test name from the output above
*
***************************************************************************************************************

potiuk · 2020-10-11T09:40:46Z

Some of the tests about processes were flaky with dropping connections to MySQL/Postgres. I moved them to Quarantined. However, I have not seen yet a single transient error caused by resource problems (Exit 137). So it looks really good. Paired with the workaround I implemented for the "unknown blob" problem (#11411 ) we might be finally back to a reasonably stable state of tests.

We seem to have a problem with running all tests at once - most likely due to some resource problems in our CI, therefore it makes sense to split the tests into more batches. This is not yet full implementation of selective tests but it is going in this direction by splitting to Core/Providers/API/CLI tests. The full selective tests approach will be implemented as part of apache#10507 issue. This split is possible thanks to apache#10422 which moved building image to a separate workflow - this way each image is only built once and it is uploaded to a shared registry, where it is quickly downloaded from rather than built by all the jobs separately - this way we can have many more jobs as there is very little per-job overhead before the tests start runnning.

potiuk · 2020-10-11T12:28:54Z

Succes! I got green build. Thera are a few follow-up tasks with Qurantined tests that need some love, and implementing full selective tests (I run some last tests with it) and we might be back in Green CI business.

Looking forward to reviews, but I think it's going to be quite a game-changer.

dimberman · 2020-10-11T14:39:03Z

@potiuk Oh heck yes! I was about to suggest exactly this!

dimberman · 2020-10-11T14:41:05Z

Yes exactly, before we couldn't do this because we'd have to rebuild the image a bunch of times, but I think this will be great for reducing strain on the CI

potiuk · 2020-10-11T15:15:51Z

Just look out for #11417! This will be the killer one.

potiuk · 2020-10-11T15:24:48Z

BTW. @dimberman -> This is where I wanted to get with CI when I joined the project ~ 2 years ago :).

Pretty much ALL the work I've done with Breeze and CI was to reach this very point where we can do this thing and massively speed CI up.

t was maaaaaaaaany PRs to get us here :).

Especially that now it will be so easy and straightforward to reproduce any failure locally. This is what I am especially happy about - that when one of those jobs fails for a good reason, It's literally one command to reproduce the failed build and another to enter the container and re-run the test.

It should take literally a few minutes now, to reproduce any failure and we even show you in the logs how you can do it with breeze.

ashb · 2020-10-11T18:34:24Z

Nice one!

potiuk · 2020-10-11T18:39:54Z

I already run several hundreds of those tests and not a single intermittent problem as of yet. I have really high hopes for this one!

We seem to have a problem with running all tests at once - most likely due to some resource problems in our CI, therefore it makes sense to split the tests into more batches. This is not yet full implementation of selective tests but it is going in this direction by splitting to Core/Providers/API/CLI tests. The full selective tests approach will be implemented as part of #10507 issue. This split is possible thanks to #10422 which moved building image to a separate workflow - this way each image is only built once and it is uploaded to a shared registry, where it is quickly downloaded from rather than built by all the jobs separately - this way we can have many more jobs as there is very little per-job overhead before the tests start runnning. (cherry picked from commit 5bc5994)

We seem to have a problem with running all tests at once - most likely due to some resource problems in our CI, therefore it makes sense to split the tests into more batches. This is not yet full implementation of selective tests but it is going in this direction by splitting to Core/Providers/API/CLI tests. The full selective tests approach will be implemented as part of apache#10507 issue. This split is possible thanks to apache#10422 which moved building image to a separate workflow - this way each image is only built once and it is uploaded to a shared registry, where it is quickly downloaded from rather than built by all the jobs separately - this way we can have many more jobs as there is very little per-job overhead before the tests start runnning. (cherry picked from commit 5bc5994)

boring-cyborg bot added area:dev-tools area:Scheduler including HA (high availability) scheduler labels Oct 10, 2020

potiuk requested a review from kaxil October 10, 2020 16:09

potiuk assigned mik-laj and unassigned mik-laj Oct 10, 2020

potiuk requested review from ashb, mik-laj and turbaszek October 10, 2020 16:09

potiuk force-pushed the split-tests branch 3 times, most recently from 67d4aca to 828f305 Compare October 10, 2020 18:04

potiuk marked this pull request as draft October 10, 2020 18:31

potiuk force-pushed the split-tests branch 4 times, most recently from dd8392d to adc99d4 Compare October 11, 2020 07:40

potiuk requested review from dimberman, feluelle, houqp and zhongjiajie October 11, 2020 07:40

potiuk marked this pull request as ready for review October 11, 2020 07:42

potiuk force-pushed the split-tests branch 5 times, most recently from 08e2f64 to f0d6435 Compare October 11, 2020 09:38

potiuk force-pushed the split-tests branch from f0d6435 to 5446638 Compare October 11, 2020 11:49

potiuk force-pushed the split-tests branch from 5446638 to f9dd5a6 Compare October 11, 2020 11:57

potiuk mentioned this pull request Oct 11, 2020

Selective tests - depends on files changed in the commit. #11417

Merged

dimberman approved these changes Oct 11, 2020

View reviewed changes

dimberman merged commit 5bc5994 into apache:master Oct 11, 2020

potiuk added the type:misc/internal Changelog: Misc changes that should appear in change log label Nov 14, 2020

potiuk added this to the Airflow 1.10.13 milestone Nov 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split tests to more sub-types #11402

Split tests to more sub-types #11402

Uh oh!

potiuk commented Oct 10, 2020

Uh oh!

potiuk commented Oct 10, 2020

Uh oh!

github-actions bot commented Oct 10, 2020

Uh oh!

potiuk commented Oct 11, 2020 •

edited

Loading

Uh oh!

potiuk commented Oct 11, 2020 •

edited

Loading

Uh oh!

potiuk commented Oct 11, 2020

Uh oh!

dimberman commented Oct 11, 2020

Uh oh!

dimberman commented Oct 11, 2020

Uh oh!

potiuk commented Oct 11, 2020

Uh oh!

potiuk commented Oct 11, 2020

Uh oh!

ashb commented Oct 11, 2020

Uh oh!

potiuk commented Oct 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Split tests to more sub-types #11402

Split tests to more sub-types #11402

Uh oh!

Conversation

potiuk commented Oct 10, 2020

Uh oh!

potiuk commented Oct 10, 2020

Uh oh!

github-actions bot commented Oct 10, 2020

Uh oh!

potiuk commented Oct 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

potiuk commented Oct 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

potiuk commented Oct 11, 2020

Uh oh!

dimberman commented Oct 11, 2020

Uh oh!

dimberman commented Oct 11, 2020

Uh oh!

potiuk commented Oct 11, 2020

Uh oh!

potiuk commented Oct 11, 2020

Uh oh!

ashb commented Oct 11, 2020

Uh oh!

potiuk commented Oct 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

potiuk commented Oct 11, 2020 •

edited

Loading

potiuk commented Oct 11, 2020 •

edited

Loading