-
Notifications
You must be signed in to change notification settings - Fork 329
Add end-to-end testing infra #3599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e009661 to
f09915e
Compare
45be616 to
ff2c367
Compare
c339a38 to
49b4919
Compare
internal/e2e/basic_test.go
Outdated
| if err := agent.Process.Signal(syscall.SIGTERM); err != nil { | ||
| t.Errorf("agent.Process.Signal(%d) = %v", syscall.SIGTERM, err) | ||
| } | ||
| if err := agent.Wait(); err != nil { | ||
| t.Errorf("agent.Wait() = %v", err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should get handled by the test completing, right? when the test finishes, its context gets cancelled, killing the process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The agent is killed, so on the BK side it becomes lost. The queue can't be deleted because an agent is "still running".
Maybe this should go in a cleanup function registered in startAgent ? Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, it's complicated by the ordering of cleanups and context cancellation. I've changed it to a deferred function that sends SIGQUIT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should go in a cleanup function registered in startAgent ? Thoughts?
yeah, i think that's a good call — there's a bit of nuance there that could be easy to miss, but much harder if there's a default cleanup function.
so i'm sure i have it right, SIGTERM is the one we don't catch, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We catch SIGINT, SIGTERM, SIGQUIT, and SIGHUP. We're not allowed to catch SIGKILL. The difference between SIGQUIT and SIGTERM is that we proceed directly to cancelling jobs with SIGQUIT, but that only happens for the second SIGTERM.
agent/clicommand/agent_start.go
Lines 1385 to 1456 in c1039ae
| func (ps *poolSignals) handle(ctx context.Context) chan os.Signal { | |
| signals := make(chan os.Signal, 1) | |
| signal.Notify( | |
| signals, | |
| os.Interrupt, | |
| syscall.SIGHUP, | |
| syscall.SIGTERM, | |
| syscall.SIGINT, | |
| syscall.SIGQUIT, | |
| ) | |
| go ps.handleLoop(ctx, signals) | |
| return signals | |
| } | |
| func (ps *poolSignals) handleLoop(ctx context.Context, signals chan os.Signal) { | |
| _, setStatus, done := status.AddSimpleItem(ctx, "Handle Pool Signals") | |
| defer done() | |
| setStatus("⏳ Waiting for a signal") | |
| interruptCount := 0 | |
| if ps.skipGraceful { | |
| interruptCount = 1 | |
| } | |
| ungracefulStop := func() { | |
| // We shouldn't block the signal handler loop either by waiting | |
| // for the jobs to cancel or by waiting for the cancel grace | |
| // period to expire. | |
| go ps.pool.StopUngracefully() // one last chance to stop | |
| go func() { | |
| // Assuming cancelling jobs takes the full cancel grace period, | |
| // allow 1 second to send agent disconnects. | |
| time.Sleep(ps.cancelGracePeriod + 1*time.Second) | |
| // We get here if the main goroutine hasn't returned yet. | |
| ps.log.Info("Timed out waiting for agents to exit; exiting immediately with status 1") | |
| os.Exit(1) | |
| }() | |
| } | |
| for sig := range signals { | |
| ps.log.Debug("Received signal `%v`", sig) | |
| setStatus(fmt.Sprintf("Received signal `%v`", sig)) | |
| switch sig { | |
| case syscall.SIGQUIT: | |
| ungracefulStop() | |
| case syscall.SIGTERM, syscall.SIGINT: | |
| interruptCount++ | |
| switch interruptCount { | |
| case 1: | |
| ps.log.Info("Received CTRL-C, send again to forcefully kill the agent(s)") | |
| ps.pool.StopGracefully() | |
| case 2: | |
| ps.log.Info("Forcefully stopping running jobs and stopping the agent(s) in %v", ps.cancelGracePeriod) | |
| if !ps.skipGraceful { | |
| ps.log.Info("Press Ctrl-C one more time to exit immediately without disconnecting - note that agents will be considered lost!") | |
| } | |
| ungracefulStop() | |
| case 3: | |
| ps.log.Info("Exiting immediately with status 1") | |
| os.Exit(1) | |
| } | |
| default: | |
| ps.log.Debug("Ignoring signal `%s`", sig.String()) | |
| } | |
| } | |
| } |
dd01ab9 to
7d13f5b
Compare
zhming0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No blocker, looks good to me 🚀
| build: | ||
| context: . | ||
| dockerfile: Dockerfile-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need build when the dockerfile only contain a single line of FROM ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Dockerfile mainly exists so that dependabot has something it can easily update. It might work with docker-compose.yml now, I haven't checked lately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL!
| CI_E2E_TESTS_TARGET_ORG: agent-e2e-testing | ||
| CI_E2E_TESTS_TARGET_CLUSTER: e1773a06-b43c-4b77-84e5-5e088a737f6a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be good to log a task in linear to remove these two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove and replace with what? Hard-code in the tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking replacing with automatically inspecting agent cluster token and find the org and cluster, like we did in the k8s stack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok. Sure.
| state, _, err := tc.bkClient.Builds.Get(ctx, targetOrg, tc.pipeline.Slug, strconv.Itoa(build.Number), &buildkite.BuildGetOptions{}) | ||
| if err != nil { | ||
| return "", err | ||
| } | ||
| switch state.State { | ||
| case "passed", "failed", "canceled", "canceling": | ||
| return state.State, nil | ||
|
|
||
| case "scheduled", "running": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without looking into how bk client work, this is a bit confusing? Why is there a "scheduled" state on build? I thought it's a job thing 🤔 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this part with modifications from agent-stack-k8s. It seems there is a scheduled build state: https://github.com/buildkite/agent-stack-k8s/blob/eab654d3dda6574d1aa3fe9ba67c1dcb17987e8d/internal/integration/api/generated.go#L358-L359
internal/e2e/testcase.go
Outdated
| // The agent should be cancelled automatically by t.Context. | ||
| // But cancellation SIGKILLs the agent, so it doesn't disconnect cleanly. | ||
| // It is eventually marked lost on the backend, but not for a few minutes, | ||
| // which is long enough to block queue cleanup. | ||
| // This cleanup function SIGQUITs it, which forces an ungraceful | ||
| // but clean exit (jobs are cancelled but it has time to disconnect). | ||
| // This cleanup function has to happen *within* the test (use defer!), | ||
| // because the test context will otherwise kill the agent! | ||
| return cmd, func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏿
| }, extraArgs...) | ||
| tc.Logf("Starting agent with args: %q", args) | ||
|
|
||
| cmd := exec.CommandContext(tc.Context(), agentPath, args...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[totally not blocking] I wonder if we could intercept tc.Context()'s cancelation and change that to SIGTERM to the agent command? It would save the defer on the caller side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a similar idea that might work... please stand by...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it replaces the command's Cancel func, and uses WaitDelay to ensure it is killed later. No special cleanup or defer needed!
7d13f5b to
3b6f49d
Compare
3b6f49d to
4167772
Compare
Description
Adds a new package with some test helper code, and new pipeline yml.
So far it "end-to-end tests" the same agent version that runs the test, and does not inspect the job logs. These will be the next steps.
Context
PB-698
Changes
Similar to agent-stack-k8s integration test setup.
Testing
go test ./...). Buildkite employees may check this if the pipeline has run automatically.go tool gofumpt -extra -w .)Disclosures / Credits
I did not use AI tools at all