-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Description
We recently started seeing a high rate of failure in runs of a job that run go race on Linux runners (ubuntu-latest). We see the following in the logs, but lack the context to say why the process receives the SIGTERM:
2022-12-01T17:39:21.3663768Z make: *** [Makefile:22: test] Terminated
2022-12-01T17:39:21.5137635Z ##[error]Process completed with exit code 143.
2022-12-01T17:39:21.5192954Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
2022-12-01T17:39:21.7086950Z Cleaning up orphan processesThis seems to be less of an issue with the codebase itself (the same set of tests pass under stress on dedicated Linux workstations and cloud VMs), and more with the action runner VMs. That said, the failure rate seems to have markedly increased after a recent change to the codebase.
We're speculating that we are hitting some kind of resource limit due to the recent code change, though it's hard to say definitively.
More context in cockroachdb/pebble#2159.
Platforms affected
- Azure DevOps
- GitHub Actions - Standard Runners
- GitHub Actions - Larger Runners
Runner images affected
- Ubuntu 18.04
- Ubuntu 20.04
- Ubuntu 22.04
- macOS 10.15
- macOS 11
- macOS 12
- Windows Server 2019
- Windows Server 2022
Image version and build link
Image: ubuntu-22.04
Version: 20221119.2
Is it regression?
No - we've seen the same job passing with the same image.
Expected behavior
The job should complete without error.
Actual behavior
Job fails with exit code 143.
Repro steps
Run the linux-race job in the Pebble repo (e.g. via PR, etc.). NOTE: we've since temporarily disabled that job until we resolve this particular issue.
We used cockroachdb/pebble#2158 to bisect down to the code change that increased the failure rate, though it's not clear why it's failing with error code 143.