Skip to content

Conversation

@crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Feb 12, 2021

Changes:

  • Moved Linux and MacOS CI from travis to github actions
  • Now running MacOS CI on Python 3.8 in addition to 3.7
  • Run coverage everywhere
  • Added new github workflow to debug ssh-related issues (disabled by default)
  • Removed python-snappy and python-blosc from 3.7 (they're already tested on 3.8)
  • Moved lz4 from 3.6 to 3.8
  • Moved crick from 3.6 to 3.7
  • Use git tip of s3fs, zict, filesystem_spec, and joblib on the latest version of Python only
  • "slow" tests are now running on Windows too
  • Added verbose flag to pytest (vital in figuring out pytest-timeout failures)
  • Relaxed a wealth of timeouts to prevent random failures
  • Marked several of tests as flaky or xfail to prevent random failures
  • Introduced pytest_rerunfailures
  • Disabled ipv6 testing on Linux - see ipv6 broken on github actions #4514. I could not figure out the issue.

Stress tests evidence:
https://docs.google.com/spreadsheets/d/1BN_85gjT9AoGJUGPvhIKSiEcIzfS3VV6O_1zSAtBhBM/edit?usp=sharing

@crusaderky crusaderky force-pushed the migrate-to-github-actions branch 3 times, most recently from 709f318 to ba264dc Compare February 12, 2021 17:19
@crusaderky crusaderky force-pushed the migrate-to-github-actions branch 8 times, most recently from 8eb367b to c7592c2 Compare February 16, 2021 15:22
Comment on lines 214 to 217
await client.restart()
await asyncio.sleep(3)

assert len(cluster.workers) == 2
while len(cluster.workers) != 2:
await asyncio.sleep(0.5)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. Explicit sleep calls are the bane of CI.

Question, is there anything here to cause this test to time out if it fails? If not, this is the sort of thing that can cause CI to hang without feedback. pytest-asyncio might do this for us. I'm not sure. if not then we might want to verify that time() < start + 10 or something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see in setup.cfg that you've added a timeout option. Great.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest-timeout will kill off the whole test suite (not just the single test) after 5 minutes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrocklin I didn't add it; I just moved it from the command-line arguments to setup.cfg

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added individual timeout to the test

SSHCluster(hosts=[])


@pytest.mark.ssh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can mark the entire file with ssh somehow?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered that too but found nothing on the pytest docs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed



@gen_cluster(client=True, timeout=1000)
@gen_cluster(client=True, timeout=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems off. timeout=None can cause CI to hang without giving us much feedback.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest-timeout will kill everything after 300 seconds. You'll never reach the 1000 seconds listed there.

exc_info=True,
)
await asyncio.sleep(1)
await asyncio.sleep(5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would rather have more short sleeps than long ones. This change may have been necessary for some reason though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was systematically getting failures here. I'll try dropping down the sleep again and increasing the number of retries to compensate.

@mrocklin
Copy link
Member

Looking at the recent failed test run in https://github.com/dask/distributed/runs/1919308651 , two things stand out

There are several IO Loop and Profile threads still active. There are about 10 of these at the end of the report. So something is leaking here.

~~~~~~~~~~~~~~~~~~~~~~ Stack of IO loop (139674081294080) ~~~~~~~~~~~~~~~~~~~~~~
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/runner/work/distributed/distributed/distributed/utils.py", line 417, in run_loop
    loop.start()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 199, in start
    self.asyncio_loop.run_forever()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/asyncio/base_events.py", line 541, in run_forever
    self._run_once()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/asyncio/base_events.py", line 1750, in _run_once
    event_list = self._selector.select(timeout)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)

There is a small error at the beginning with Client._asynchronous not being set before __del__ is called

Exception ignored in: <function Client.__del__ at 0x7f089a528f80>
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/client.py", line 1197, in __del__
    self.close()
  File "/home/runner/work/distributed/distributed/distributed/client.py", line 1421, in close
    if self.asynchronous:
  File "/home/runner/work/distributed/distributed/distributed/client.py", line 806, in asynchronous
    return self._asynchronous and self.loop is IOLoop.current()
AttributeError: 'Client' object has no attribute '_asynchronous'

This is probably minor and doesn't impact the main problem here, but can probably also be fixed easily by moving the Client._asynchronous = asynchronous line before we possibly break out of the constructor, such as with a raised exception.

@mrocklin
Copy link
Member

Feel free to ignore my comments though. I'm just glancing at this in hopes that I stumble upon something useful. I estimate the probability of that happening at less than 20%

@crusaderky crusaderky force-pushed the migrate-to-github-actions branch from d5b0217 to 9d46cf9 Compare February 17, 2021 20:54
logger.error(
"Failed to start gen_cluster, retrying",
"Failed to start gen_cluster: "
f"{e.__class__.__name__}: {e}; retrying",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exc_info=True doesn't print out anything for some reason


- name: Debug with tmate on failure
if: ${{ failure() }}
uses: mxschmitt/action-tmate@v3
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should get into master; it will save a lot of work next time SSH breaks.

each_frame_nbytes = nbytes(each_frame)
if each_frame_nbytes:
if stream._write_buffer is None:
raise StreamClosedError()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Treat 'NoneType' object has no attribute 'append'; see handler below:

  • if shutting_down(), silently suppress the error
  • otherwise, re-raise it as CommClosedError



@pytest.mark.xfail(reason="Intermittent failure")
@pytest.mark.xfail()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deterministic failure

@crusaderky crusaderky force-pushed the migrate-to-github-actions branch from 2466099 to ceb3009 Compare February 22, 2021 15:31
@crusaderky crusaderky changed the title WIP Migrate from travis to github actions Migrate from travis to github actions Feb 24, 2021
@crusaderky crusaderky marked this pull request as ready for review February 24, 2021 12:33
@crusaderky
Copy link
Collaborator Author

crusaderky commented Feb 24, 2021

@jrbourbeau @mrocklin this is now ready for review.

Failure rate on Windows/Linux is < 5% per job; on MacOS it's 20~30%.
MacOS is currently disabled on PRs and only running on push.
See opening post for full summary of changes.

Stress tests evidence:
https://docs.google.com/spreadsheets/d/1BN_85gjT9AoGJUGPvhIKSiEcIzfS3VV6O_1zSAtBhBM/edit?usp=sharing

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks so much for picking this up.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work on this @crusaderky! In particular, thanks for opening up several issues to inform follow up work to further improve our CI. This is in

@jrbourbeau jrbourbeau merged commit d24d62f into dask:master Feb 24, 2021
@jakirkham
Copy link
Member

Seeing this get merged made my day! Thanks for all of the hard work here! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Segmentation fault on travis Migrate CI to GitHub Actions

5 participants