Migrate from travis to github actions #4504

crusaderky · 2021-02-12T11:26:03Z

Closes Migrate CI to GitHub Actions #4214
Closes Segmentation fault on travis #4387
Supersedes Migrate to GitHub Actions and change install script to environment files #4218
Follow-ups: Test SSHCluster on Windows CI #4509, Flaky test test_as_completed_with_results_no_raise #4512, ipv6 broken on github actions #4514, test_worker_start_exception flaky on MacOS #4519, test_ssh.py is extremely flaky on MacOS #4543

Changes:

Moved Linux and MacOS CI from travis to github actions
Now running MacOS CI on Python 3.8 in addition to 3.7
Run coverage everywhere
Added new github workflow to debug ssh-related issues (disabled by default)
Removed python-snappy and python-blosc from 3.7 (they're already tested on 3.8)
Moved lz4 from 3.6 to 3.8
Moved crick from 3.6 to 3.7
Use git tip of s3fs, zict, filesystem_spec, and joblib on the latest version of Python only
"slow" tests are now running on Windows too
Added verbose flag to pytest (vital in figuring out pytest-timeout failures)
Relaxed a wealth of timeouts to prevent random failures
Marked several of tests as flaky or xfail to prevent random failures
Introduced pytest_rerunfailures
Disabled ipv6 testing on Linux - see ipv6 broken on github actions #4514. I could not figure out the issue.

Stress tests evidence:
https://docs.google.com/spreadsheets/d/1BN_85gjT9AoGJUGPvhIKSiEcIzfS3VV6O_1zSAtBhBM/edit?usp=sharing

mrocklin · 2021-02-17T15:51:03Z

distributed/deploy/tests/test_spec_cluster.py

                await client.restart()
-                await asyncio.sleep(3)
-
-                assert len(cluster.workers) == 2
+                while len(cluster.workers) != 2:
+                    await asyncio.sleep(0.5)



Thanks for doing this. Explicit sleep calls are the bane of CI.

Question, is there anything here to cause this test to time out if it fails? If not, this is the sort of thing that can cause CI to hang without feedback. pytest-asyncio might do this for us. I'm not sure. if not then we might want to verify that time() < start + 10 or something.

Ah, I see in setup.cfg that you've added a timeout option. Great.

pytest-timeout will kill off the whole test suite (not just the single test) after 5 minutes

@mrocklin I didn't add it; I just moved it from the command-line arguments to setup.cfg

added individual timeout to the test

mrocklin · 2021-02-17T15:51:32Z

distributed/deploy/tests/test_ssh.py

        SSHCluster(hosts=[])


+@pytest.mark.ssh


I wonder if we can mark the entire file with ssh somehow?

I wondered that too but found nothing on the pytest docs

It looks like there's a special pytestmark variable we can set at the module-level https://docs.pytest.org/en/stable/example/markers.html#marking-whole-classes-or-modules

mrocklin · 2021-02-17T15:52:28Z

distributed/tests/test_scheduler.py



-@gen_cluster(client=True, timeout=1000)
+@gen_cluster(client=True, timeout=None)


This seems off. timeout=None can cause CI to hang without giving us much feedback.

pytest-timeout will kill everything after 300 seconds. You'll never reach the 1000 seconds listed there.

mrocklin · 2021-02-17T15:53:30Z

distributed/utils_test.py

                                    exc_info=True,
                                )
-                                await asyncio.sleep(1)
+                                await asyncio.sleep(5)


Personally I would rather have more short sleeps than long ones. This change may have been necessary for some reason though.

I was systematically getting failures here. I'll try dropping down the sleep again and increasing the number of retries to compensate.

mrocklin · 2021-02-17T16:46:54Z

Looking at the recent failed test run in https://github.com/dask/distributed/runs/1919308651 , two things stand out

There are several IO Loop and Profile threads still active. There are about 10 of these at the end of the report. So something is leaking here.

~~~~~~~~~~~~~~~~~~~~~~ Stack of IO loop (139674081294080) ~~~~~~~~~~~~~~~~~~~~~~
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/runner/work/distributed/distributed/distributed/utils.py", line 417, in run_loop
    loop.start()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 199, in start
    self.asyncio_loop.run_forever()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/asyncio/base_events.py", line 541, in run_forever
    self._run_once()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/asyncio/base_events.py", line 1750, in _run_once
    event_list = self._selector.select(timeout)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)

There is a small error at the beginning with Client._asynchronous not being set before __del__ is called

Exception ignored in: <function Client.__del__ at 0x7f089a528f80>
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/client.py", line 1197, in __del__
    self.close()
  File "/home/runner/work/distributed/distributed/distributed/client.py", line 1421, in close
    if self.asynchronous:
  File "/home/runner/work/distributed/distributed/distributed/client.py", line 806, in asynchronous
    return self._asynchronous and self.loop is IOLoop.current()
AttributeError: 'Client' object has no attribute '_asynchronous'

This is probably minor and doesn't impact the main problem here, but can probably also be fixed easily by moving the Client._asynchronous = asynchronous line before we possibly break out of the constructor, such as with a raised exception.

mrocklin · 2021-02-17T16:47:24Z

Feel free to ignore my comments though. I'm just glancing at this in hopes that I stumble upon something useful. I estimate the probability of that happening at less than 20%

crusaderky · 2021-02-17T21:03:57Z

distributed/utils_test.py

                                logger.error(
-                                    "Failed to start gen_cluster, retrying",
+                                    "Failed to start gen_cluster: "
+                                    f"{e.__class__.__name__}: {e}; retrying",


exc_info=True doesn't print out anything for some reason

.github/workflows/ipv6_debug.yaml

crusaderky · 2021-02-17T21:20:41Z

.github/workflows/ssh_debug.yaml

+
+      - name: Debug with tmate on failure
+        if: ${{ failure() }}
+        uses: mxschmitt/action-tmate@v3


I think this should get into master; it will save a lot of work next time SSH breaks.

crusaderky · 2021-02-22T11:07:59Z

distributed/comm/tcp.py

                each_frame_nbytes = nbytes(each_frame)
                if each_frame_nbytes:
+                    if stream._write_buffer is None:
+                        raise StreamClosedError()


Treat 'NoneType' object has no attribute 'append'; see handler below:

if shutting_down(), silently suppress the error

otherwise, re-raise it as CommClosedError

crusaderky · 2021-02-22T11:08:54Z

distributed/tests/test_asyncprocess.py



-@pytest.mark.xfail(reason="Intermittent failure")
+@pytest.mark.xfail()


deterministic failure

Exclude MacOS m m

…-actions

crusaderky · 2021-02-24T12:38:49Z

@jrbourbeau @mrocklin this is now ready for review.

Failure rate on Windows/Linux is < 5% per job; on MacOS it's 20~30%.
MacOS is currently disabled on PRs and only running on push.
See opening post for full summary of changes.

Stress tests evidence:
https://docs.google.com/spreadsheets/d/1BN_85gjT9AoGJUGPvhIKSiEcIzfS3VV6O_1zSAtBhBM/edit?usp=sharing

jacobtomlinson

This looks great! Thanks so much for picking this up.

jrbourbeau

Thanks for all your work on this @crusaderky! In particular, thanks for opening up several issues to inform follow up work to further improve our CI. This is in

jakirkham · 2021-02-24T18:43:16Z

Seeing this get merged made my day! Thanks for all of the hard work here! 😄

crusaderky force-pushed the migrate-to-github-actions branch 3 times, most recently from 709f318 to ba264dc Compare February 12, 2021 17:19

crusaderky mentioned this pull request Feb 15, 2021

Test SSHCluster on Windows CI #4509

Open

crusaderky force-pushed the migrate-to-github-actions branch 8 times, most recently from 8eb367b to c7592c2 Compare February 16, 2021 15:22

mrocklin reviewed Feb 17, 2021

View reviewed changes

crusaderky added 4 commits February 17, 2021 20:54

First milestone: all Linux+Windows tests successful!

636fe9d

Remove redundant parentheses

da161ee

Cleanup

5198803

Code review

9d46cf9

crusaderky force-pushed the migrate-to-github-actions branch from d5b0217 to 9d46cf9 Compare February 17, 2021 20:54

crusaderky commented Feb 17, 2021

View reviewed changes

crusaderky mentioned this pull request Feb 17, 2021

ipv6 broken on github actions #4514

Open

Reference issue dask#4514

52ea2ff

crusaderky commented Feb 17, 2021

View reviewed changes

.github/workflows/ipv6_debug.yaml Outdated Show resolved Hide resolved

crusaderky commented Feb 17, 2021

View reviewed changes

crusaderky added 2 commits February 17, 2021 22:51

module-level pytest mark

f6c5259

Don't test Python 3.6 on MacOS

dd14f40

crusaderky commented Feb 22, 2021

View reviewed changes

crusaderky added 5 commits February 22, 2021 12:07

tweak flaky

c5c3156

tweak flaky tests

f1c734f

tweak flaky

3049dc2

Exclude MacOS

68ddf81

Exclude MacOS m m

coverage cleanup

ceb3009

crusaderky force-pushed the migrate-to-github-actions branch from 2466099 to ceb3009 Compare February 22, 2021 15:31

crusaderky added 9 commits February 22, 2021 15:39

README

af8a1e2

Merge remote-tracking branch 'upstream/master' into migrate-to-github…

1407987

…-actions

stress test

7a8f49e

tweak flaky tests

e71970a

Flaky MacOS tests

d89a4f1

Disable MacOS tests

1de76f6

Merge remote-tracking branch 'upstream/master' into migrate-to-github…

e47d83c

…-actions

remove ipv6 debug workflow

c85c90c

xfail whole tests_ssh module on MacOS

a9cc96a

crusaderky mentioned this pull request Feb 24, 2021

test_ssh.py is extremely flaky on MacOS #4543

Open

GH ref

41e68f9

crusaderky changed the title ~~WIP Migrate from travis to github actions~~ Migrate from travis to github actions Feb 24, 2021

crusaderky marked this pull request as ready for review February 24, 2021 12:33

crusaderky mentioned this pull request Feb 24, 2021

Migrate to GitHub Actions and change install script to environment files #4218

Closed

jacobtomlinson approved these changes Feb 24, 2021

View reviewed changes

jrbourbeau approved these changes Feb 24, 2021

View reviewed changes

jrbourbeau merged commit d24d62f into dask:master Feb 24, 2021

crusaderky deleted the migrate-to-github-actions branch February 25, 2021 10:07

jacobtomlinson mentioned this pull request Sep 14, 2021

Migrate CI to GitHub Actions as Travis CI is ending their free OSS builds dask/community#107

Open

crusaderky mentioned this pull request Jan 24, 2022

Fix flaky test_workspace_concurrency #5690

Merged



		@gen_cluster(client=True, timeout=1000)
		@gen_cluster(client=True, timeout=None)



		@pytest.mark.xfail(reason="Intermittent failure")
		@pytest.mark.xfail()

		SSHCluster(hosts=[])


		@pytest.mark.ssh

Uh oh!

Migrate from travis to github actions #4504

Migrate from travis to github actions #4504

Uh oh!

Conversation

crusaderky commented Feb 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Feb 17, 2021

Uh oh!

mrocklin commented Feb 17, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky commented Feb 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobtomlinson left a comment

Choose a reason for hiding this comment

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Feb 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

crusaderky commented Feb 12, 2021 •

edited

Loading

crusaderky commented Feb 24, 2021 •

edited

Loading