-
-
Notifications
You must be signed in to change notification settings - Fork 750
Migrate from travis to github actions #4504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate from travis to github actions #4504
Conversation
709f318 to
ba264dc
Compare
8eb367b to
c7592c2
Compare
| await client.restart() | ||
| await asyncio.sleep(3) | ||
|
|
||
| assert len(cluster.workers) == 2 | ||
| while len(cluster.workers) != 2: | ||
| await asyncio.sleep(0.5) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. Explicit sleep calls are the bane of CI.
Question, is there anything here to cause this test to time out if it fails? If not, this is the sort of thing that can cause CI to hang without feedback. pytest-asyncio might do this for us. I'm not sure. if not then we might want to verify that time() < start + 10 or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see in setup.cfg that you've added a timeout option. Great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest-timeout will kill off the whole test suite (not just the single test) after 5 minutes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrocklin I didn't add it; I just moved it from the command-line arguments to setup.cfg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added individual timeout to the test
distributed/deploy/tests/test_ssh.py
Outdated
| SSHCluster(hosts=[]) | ||
|
|
||
|
|
||
| @pytest.mark.ssh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can mark the entire file with ssh somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered that too but found nothing on the pytest docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there's a special pytestmark variable we can set at the module-level https://docs.pytest.org/en/stable/example/markers.html#marking-whole-classes-or-modules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
|
||
|
|
||
| @gen_cluster(client=True, timeout=1000) | ||
| @gen_cluster(client=True, timeout=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems off. timeout=None can cause CI to hang without giving us much feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest-timeout will kill everything after 300 seconds. You'll never reach the 1000 seconds listed there.
distributed/utils_test.py
Outdated
| exc_info=True, | ||
| ) | ||
| await asyncio.sleep(1) | ||
| await asyncio.sleep(5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would rather have more short sleeps than long ones. This change may have been necessary for some reason though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was systematically getting failures here. I'll try dropping down the sleep again and increasing the number of retries to compensate.
|
Looking at the recent failed test run in https://github.com/dask/distributed/runs/1919308651 , two things stand out There are several IO Loop and Profile threads still active. There are about 10 of these at the end of the report. So something is leaking here. ~~~~~~~~~~~~~~~~~~~~~~ Stack of IO loop (139674081294080) ~~~~~~~~~~~~~~~~~~~~~~
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/runner/work/distributed/distributed/distributed/utils.py", line 417, in run_loop
loop.start()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 199, in start
self.asyncio_loop.run_forever()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/asyncio/base_events.py", line 541, in run_forever
self._run_once()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/asyncio/base_events.py", line 1750, in _run_once
event_list = self._selector.select(timeout)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/selectors.py", line 468, in select
fd_event_list = self._selector.poll(timeout, max_ev)There is a small error at the beginning with Exception ignored in: <function Client.__del__ at 0x7f089a528f80>
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/client.py", line 1197, in __del__
self.close()
File "/home/runner/work/distributed/distributed/distributed/client.py", line 1421, in close
if self.asynchronous:
File "/home/runner/work/distributed/distributed/distributed/client.py", line 806, in asynchronous
return self._asynchronous and self.loop is IOLoop.current()
AttributeError: 'Client' object has no attribute '_asynchronous'This is probably minor and doesn't impact the main problem here, but can probably also be fixed easily by moving the |
|
Feel free to ignore my comments though. I'm just glancing at this in hopes that I stumble upon something useful. I estimate the probability of that happening at less than 20% |
d5b0217 to
9d46cf9
Compare
| logger.error( | ||
| "Failed to start gen_cluster, retrying", | ||
| "Failed to start gen_cluster: " | ||
| f"{e.__class__.__name__}: {e}; retrying", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exc_info=True doesn't print out anything for some reason
|
|
||
| - name: Debug with tmate on failure | ||
| if: ${{ failure() }} | ||
| uses: mxschmitt/action-tmate@v3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should get into master; it will save a lot of work next time SSH breaks.
| each_frame_nbytes = nbytes(each_frame) | ||
| if each_frame_nbytes: | ||
| if stream._write_buffer is None: | ||
| raise StreamClosedError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Treat 'NoneType' object has no attribute 'append'; see handler below:
if shutting_down(), silently suppress the error- otherwise, re-raise it as CommClosedError
|
|
||
|
|
||
| @pytest.mark.xfail(reason="Intermittent failure") | ||
| @pytest.mark.xfail() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deterministic failure
Exclude MacOS m m
2466099 to
ceb3009
Compare
|
@jrbourbeau @mrocklin this is now ready for review. Failure rate on Windows/Linux is < 5% per job; on MacOS it's 20~30%. Stress tests evidence: |
jacobtomlinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thanks so much for picking this up.
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work on this @crusaderky! In particular, thanks for opening up several issues to inform follow up work to further improve our CI. This is in
|
Seeing this get merged made my day! Thanks for all of the hard work here! 😄 |
Changes:
Stress tests evidence:
https://docs.google.com/spreadsheets/d/1BN_85gjT9AoGJUGPvhIKSiEcIzfS3VV6O_1zSAtBhBM/edit?usp=sharing