Skip to content

Conversation

@potiuk
Copy link
Member

@potiuk potiuk commented Oct 10, 2020

We seem to have a problem with running all tests at once - most
likely due to some resource problems in our CI, therefore it makes
sense to split the tests into more batches. This is not yet full
implementation of selective tests but it is going in this direction
by splitting to Core/Providers/API/CLI tests. The full selective
tests approach will be implemented as part of #10507 issue.

This split is possible thanks to #10422 which moved building image
to a separate workflow - this way each image is only built once
and it is uploaded to a shared registry, where it is quickly
downloaded from rather than built by all the jobs separately - this
way we can have many more jobs as there is very little per-job
overhead before the tests start runnning.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added area:dev-tools area:Scheduler including HA (high availability) scheduler labels Oct 10, 2020
@potiuk potiuk requested a review from kaxil October 10, 2020 16:09
@potiuk potiuk assigned mik-laj and unassigned mik-laj Oct 10, 2020
@potiuk potiuk requested review from ashb, mik-laj and turbaszek October 10, 2020 16:09
@potiuk
Copy link
Member Author

potiuk commented Oct 10, 2020

This is an attempt to improve stability in our tests. I am still trying it - it will likely fail (I had to move tests from "tests" to "core" directory and it likely will cause some more troubles, but I think it's going in the right direction - we will have many less tests to run "per job" but many more jobs to run. I think that will be fine because those jobs will generally run much, much faster in general and I hope the 137 "errors" will be gone (I also move the "backfill_job" to Heisentests for now.

The next step will be to only run subset of tests for non-core-related changes as described in #10507

@github-actions
Copy link

The Workflow run is cancelling this PR. Building image for the PR has been cancelled

@potiuk potiuk force-pushed the split-tests branch 3 times, most recently from 67d4aca to 828f305 Compare October 10, 2020 18:04
@potiuk potiuk marked this pull request as draft October 10, 2020 18:31
@potiuk potiuk force-pushed the split-tests branch 4 times, most recently from dd8392d to adc99d4 Compare October 11, 2020 07:40
@potiuk potiuk marked this pull request as ready for review October 11, 2020 07:42
@potiuk
Copy link
Member Author

potiuk commented Oct 11, 2020

Hey Everyone. I think with this change we have a chance to finally reach stability of the tests.

I split the tests into multiple jobs, and I think I have a very nice split. I believe with the split I introduced, we will hit resource limitations far less frequently (even if we run many more jobs). Each of the test jobs does not run longer than ~7 minutes (including pulling the image built once in the separate workflow). Additionally I've added all the test types to breeze, so that it will be really easy to reproduce every test type locally. If (also built only once) you have the image locally built with breeze it takes literally up to 4-5 minutes (or less depending on your machine type) to re-run the complete failed set for the given test type and reproduce the failure, then you can enter Breeze and reproduce it one-by-one.

Additionally, in case of test job failure I print some useful instructions on how to reproduce such failed run locally. In case you have not noticed, it is now very easy to reproduce the failed build from CI using the RUN_ID from GitHub Actions (you can pull the very image that was used to run the tests). Now this information is printed out on failure of tests and it will be immediately visible by the author and committer and reproducing the failed tests will be a ..... BREEZE.

*******************************************************************************************************
*
* ERROR! Some tests failed, unfortunately. Those might be transient errors,
*        but usually you have to fix something.
*        See the above log for details.
*
*******************************************************************************************************
*  You can easily reproduce the failed tests on your dev machine/
*
*   When you have the source branch checked out locally:
*
*     Run all tests:
*
*       ./breeze --backend postgres --python 3.6 --db-reset --test-type Core  tests
*
*     Enter docker shell:
*
*       ./breeze --backend postgres --python 3.6 --db-reset --test-type Core  shell
*
*   When you do not have sources:
*
*     Run all tests:
*
*       ./breeze --gihub-image-id NNNNNNNN --backend postgres --python 3.6 --db-reset --test-type Core  tests
*
*     Enter docker shell:
*
*       ./breeze --gihub-image-id NNNNNNNN  --backend postgres --python 3.6 --db-reset --test-type Core  shell
*
*
*   NOTE! Once you are in the docker shell, you can run failed test with:
*
*            pytest [TEST_NAME]
*
*   You can copy the test name from the output above
*
***************************************************************************************************************

@potiuk potiuk force-pushed the split-tests branch 5 times, most recently from 08e2f64 to f0d6435 Compare October 11, 2020 09:38
@potiuk
Copy link
Member Author

potiuk commented Oct 11, 2020

Some of the tests about processes were flaky with dropping connections to MySQL/Postgres. I moved them to Quarantined. However, I have not seen yet a single transient error caused by resource problems (Exit 137). So it looks really good. Paired with the workaround I implemented for the "unknown blob" problem (#11411 ) we might be finally back to a reasonably stable state of tests.

We seem to have a problem with running all tests at once - most
likely due to some resource problems in our CI, therefore it makes
sense to split the tests into more batches. This is not yet full
implementation of selective tests but it is going in this direction
by splitting to Core/Providers/API/CLI tests. The full selective
tests approach will be implemented as part of apache#10507 issue.

This split is possible thanks to apache#10422 which moved building image
to a separate workflow - this way each image is only built once
and it is uploaded to a shared registry, where it is quickly
downloaded from rather than built by all the jobs separately - this
way we can have many more jobs as there is very little per-job
overhead before the tests start runnning.
@potiuk
Copy link
Member Author

potiuk commented Oct 11, 2020

Succes! I got green build. Thera are a few follow-up tasks with Qurantined tests that need some love, and implementing full selective tests (I run some last tests with it) and we might be back in Green CI business.

Looking forward to reviews, but I think it's going to be quite a game-changer.

@dimberman
Copy link
Contributor

@potiuk Oh heck yes! I was about to suggest exactly this!

@dimberman dimberman merged commit 5bc5994 into apache:master Oct 11, 2020
@dimberman
Copy link
Contributor

Yes exactly, before we couldn't do this because we'd have to rebuild the image a bunch of times, but I think this will be great for reducing strain on the CI

@potiuk
Copy link
Member Author

potiuk commented Oct 11, 2020

Just look out for #11417! This will be the killer one.

@potiuk
Copy link
Member Author

potiuk commented Oct 11, 2020

BTW. @dimberman -> This is where I wanted to get with CI when I joined the project ~ 2 years ago :).

Pretty much ALL the work I've done with Breeze and CI was to reach this very point where we can do this thing and massively speed CI up.

t was maaaaaaaaany PRs to get us here :).

Especially that now it will be so easy and straightforward to reproduce any failure locally. This is what I am especially happy about - that when one of those jobs fails for a good reason, It's literally one command to reproduce the failed build and another to enter the container and re-run the test.

It should take literally a few minutes now, to reproduce any failure and we even show you in the logs how you can do it with breeze.

@ashb
Copy link
Member

ashb commented Oct 11, 2020

Nice one!

@potiuk
Copy link
Member Author

potiuk commented Oct 11, 2020

I already run several hundreds of those tests and not a single intermittent problem as of yet. I have really high hopes for this one!

potiuk added a commit that referenced this pull request Nov 14, 2020
We seem to have a problem with running all tests at once - most
likely due to some resource problems in our CI, therefore it makes
sense to split the tests into more batches. This is not yet full
implementation of selective tests but it is going in this direction
by splitting to Core/Providers/API/CLI tests. The full selective
tests approach will be implemented as part of #10507 issue.

This split is possible thanks to #10422 which moved building image
to a separate workflow - this way each image is only built once
and it is uploaded to a shared registry, where it is quickly
downloaded from rather than built by all the jobs separately - this
way we can have many more jobs as there is very little per-job
overhead before the tests start runnning.

(cherry picked from commit 5bc5994)
potiuk added a commit that referenced this pull request Nov 14, 2020
We seem to have a problem with running all tests at once - most
likely due to some resource problems in our CI, therefore it makes
sense to split the tests into more batches. This is not yet full
implementation of selective tests but it is going in this direction
by splitting to Core/Providers/API/CLI tests. The full selective
tests approach will be implemented as part of #10507 issue.

This split is possible thanks to #10422 which moved building image
to a separate workflow - this way each image is only built once
and it is uploaded to a shared registry, where it is quickly
downloaded from rather than built by all the jobs separately - this
way we can have many more jobs as there is very little per-job
overhead before the tests start runnning.

(cherry picked from commit 5bc5994)
@potiuk potiuk added the type:misc/internal Changelog: Misc changes that should appear in change log label Nov 14, 2020
@potiuk potiuk added this to the Airflow 1.10.13 milestone Nov 14, 2020
potiuk added a commit that referenced this pull request Nov 16, 2020
We seem to have a problem with running all tests at once - most
likely due to some resource problems in our CI, therefore it makes
sense to split the tests into more batches. This is not yet full
implementation of selective tests but it is going in this direction
by splitting to Core/Providers/API/CLI tests. The full selective
tests approach will be implemented as part of #10507 issue.

This split is possible thanks to #10422 which moved building image
to a separate workflow - this way each image is only built once
and it is uploaded to a shared registry, where it is quickly
downloaded from rather than built by all the jobs separately - this
way we can have many more jobs as there is very little per-job
overhead before the tests start runnning.

(cherry picked from commit 5bc5994)
potiuk added a commit that referenced this pull request Nov 16, 2020
We seem to have a problem with running all tests at once - most
likely due to some resource problems in our CI, therefore it makes
sense to split the tests into more batches. This is not yet full
implementation of selective tests but it is going in this direction
by splitting to Core/Providers/API/CLI tests. The full selective
tests approach will be implemented as part of #10507 issue.

This split is possible thanks to #10422 which moved building image
to a separate workflow - this way each image is only built once
and it is uploaded to a shared registry, where it is quickly
downloaded from rather than built by all the jobs separately - this
way we can have many more jobs as there is very little per-job
overhead before the tests start runnning.

(cherry picked from commit 5bc5994)
kaxil pushed a commit that referenced this pull request Nov 18, 2020
We seem to have a problem with running all tests at once - most
likely due to some resource problems in our CI, therefore it makes
sense to split the tests into more batches. This is not yet full
implementation of selective tests but it is going in this direction
by splitting to Core/Providers/API/CLI tests. The full selective
tests approach will be implemented as part of #10507 issue.

This split is possible thanks to #10422 which moved building image
to a separate workflow - this way each image is only built once
and it is uploaded to a shared registry, where it is quickly
downloaded from rather than built by all the jobs separately - this
way we can have many more jobs as there is very little per-job
overhead before the tests start runnning.

(cherry picked from commit 5bc5994)
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
We seem to have a problem with running all tests at once - most
likely due to some resource problems in our CI, therefore it makes
sense to split the tests into more batches. This is not yet full
implementation of selective tests but it is going in this direction
by splitting to Core/Providers/API/CLI tests. The full selective
tests approach will be implemented as part of apache#10507 issue.

This split is possible thanks to apache#10422 which moved building image
to a separate workflow - this way each image is only built once
and it is uploaded to a shared registry, where it is quickly
downloaded from rather than built by all the jobs separately - this
way we can have many more jobs as there is very little per-job
overhead before the tests start runnning.

(cherry picked from commit 5bc5994)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:dev-tools area:Scheduler including HA (high availability) scheduler type:misc/internal Changelog: Misc changes that should appear in change log

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants