Skip to content

Conversation

@DrJosh9000
Copy link
Contributor

@DrJosh9000 DrJosh9000 commented Nov 27, 2025

Description

Adds a new package with some test helper code, and new pipeline yml.

So far it "end-to-end tests" the same agent version that runs the test, and does not inspect the job logs. These will be the next steps.

Context

PB-698

Changes

Similar to agent-stack-k8s integration test setup.

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go tool gofumpt -extra -w .)

Disclosures / Credits

I did not use AI tools at all

@DrJosh9000 DrJosh9000 marked this pull request as draft November 27, 2025 06:47
@DrJosh9000 DrJosh9000 force-pushed the pb-968-create-a-testing-framework branch 5 times, most recently from e009661 to f09915e Compare December 1, 2025 06:35
@DrJosh9000 DrJosh9000 changed the title Add e2e testing package in internal Add e2e testing infra Dec 1, 2025
@DrJosh9000 DrJosh9000 changed the title Add e2e testing infra Add end-to-end testing infra Dec 1, 2025
@DrJosh9000 DrJosh9000 force-pushed the pb-968-create-a-testing-framework branch 22 times, most recently from 45be616 to ff2c367 Compare December 2, 2025 03:06
@DrJosh9000 DrJosh9000 force-pushed the pb-968-create-a-testing-framework branch 3 times, most recently from c339a38 to 49b4919 Compare December 2, 2025 03:10
@DrJosh9000 DrJosh9000 marked this pull request as ready for review December 2, 2025 03:13
@DrJosh9000 DrJosh9000 requested a review from a team December 2, 2025 03:16
Comment on lines 31 to 36
if err := agent.Process.Signal(syscall.SIGTERM); err != nil {
t.Errorf("agent.Process.Signal(%d) = %v", syscall.SIGTERM, err)
}
if err := agent.Wait(); err != nil {
t.Errorf("agent.Wait() = %v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should get handled by the test completing, right? when the test finishes, its context gets cancelled, killing the process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent is killed, so on the BK side it becomes lost. The queue can't be deleted because an agent is "still running".

Maybe this should go in a cleanup function registered in startAgent ? Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, it's complicated by the ordering of cleanups and context cancellation. I've changed it to a deferred function that sends SIGQUIT.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should go in a cleanup function registered in startAgent ? Thoughts?

yeah, i think that's a good call — there's a bit of nuance there that could be easy to miss, but much harder if there's a default cleanup function.

so i'm sure i have it right, SIGTERM is the one we don't catch, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We catch SIGINT, SIGTERM, SIGQUIT, and SIGHUP. We're not allowed to catch SIGKILL. The difference between SIGQUIT and SIGTERM is that we proceed directly to cancelling jobs with SIGQUIT, but that only happens for the second SIGTERM.

func (ps *poolSignals) handle(ctx context.Context) chan os.Signal {
signals := make(chan os.Signal, 1)
signal.Notify(
signals,
os.Interrupt,
syscall.SIGHUP,
syscall.SIGTERM,
syscall.SIGINT,
syscall.SIGQUIT,
)
go ps.handleLoop(ctx, signals)
return signals
}
func (ps *poolSignals) handleLoop(ctx context.Context, signals chan os.Signal) {
_, setStatus, done := status.AddSimpleItem(ctx, "Handle Pool Signals")
defer done()
setStatus("⏳ Waiting for a signal")
interruptCount := 0
if ps.skipGraceful {
interruptCount = 1
}
ungracefulStop := func() {
// We shouldn't block the signal handler loop either by waiting
// for the jobs to cancel or by waiting for the cancel grace
// period to expire.
go ps.pool.StopUngracefully() // one last chance to stop
go func() {
// Assuming cancelling jobs takes the full cancel grace period,
// allow 1 second to send agent disconnects.
time.Sleep(ps.cancelGracePeriod + 1*time.Second)
// We get here if the main goroutine hasn't returned yet.
ps.log.Info("Timed out waiting for agents to exit; exiting immediately with status 1")
os.Exit(1)
}()
}
for sig := range signals {
ps.log.Debug("Received signal `%v`", sig)
setStatus(fmt.Sprintf("Received signal `%v`", sig))
switch sig {
case syscall.SIGQUIT:
ungracefulStop()
case syscall.SIGTERM, syscall.SIGINT:
interruptCount++
switch interruptCount {
case 1:
ps.log.Info("Received CTRL-C, send again to forcefully kill the agent(s)")
ps.pool.StopGracefully()
case 2:
ps.log.Info("Forcefully stopping running jobs and stopping the agent(s) in %v", ps.cancelGracePeriod)
if !ps.skipGraceful {
ps.log.Info("Press Ctrl-C one more time to exit immediately without disconnecting - note that agents will be considered lost!")
}
ungracefulStop()
case 3:
ps.log.Info("Exiting immediately with status 1")
os.Exit(1)
}
default:
ps.log.Debug("Ignoring signal `%s`", sig.String())
}
}
}

@DrJosh9000 DrJosh9000 force-pushed the pb-968-create-a-testing-framework branch 2 times, most recently from dd01ab9 to 7d13f5b Compare December 2, 2025 04:12
Copy link
Contributor

@zhming0 zhming0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocker, looks good to me 🚀

Comment on lines +34 to +36
build:
context: .
dockerfile: Dockerfile-e2e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need build when the dockerfile only contain a single line of FROM ...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dockerfile mainly exists so that dependabot has something it can easily update. It might work with docker-compose.yml now, I haven't checked lately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL!

Comment on lines +5 to +6
CI_E2E_TESTS_TARGET_ORG: agent-e2e-testing
CI_E2E_TESTS_TARGET_CLUSTER: e1773a06-b43c-4b77-84e5-5e088a737f6a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be good to log a task in linear to remove these two?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove and replace with what? Hard-code in the tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking replacing with automatically inspecting agent cluster token and find the org and cluster, like we did in the k8s stack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok. Sure.

Comment on lines +175 to +183
state, _, err := tc.bkClient.Builds.Get(ctx, targetOrg, tc.pipeline.Slug, strconv.Itoa(build.Number), &buildkite.BuildGetOptions{})
if err != nil {
return "", err
}
switch state.State {
case "passed", "failed", "canceled", "canceling":
return state.State, nil

case "scheduled", "running":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without looking into how bk client work, this is a bit confusing? Why is there a "scheduled" state on build? I thought it's a job thing 🤔 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 298 to 306
// The agent should be cancelled automatically by t.Context.
// But cancellation SIGKILLs the agent, so it doesn't disconnect cleanly.
// It is eventually marked lost on the backend, but not for a few minutes,
// which is long enough to block queue cleanup.
// This cleanup function SIGQUITs it, which forces an ungraceful
// but clean exit (jobs are cancelled but it has time to disconnect).
// This cleanup function has to happen *within* the test (use defer!),
// because the test context will otherwise kill the agent!
return cmd, func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏿

}, extraArgs...)
tc.Logf("Starting agent with args: %q", args)

cmd := exec.CommandContext(tc.Context(), agentPath, args...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[totally not blocking] I wonder if we could intercept tc.Context()'s cancelation and change that to SIGTERM to the agent command? It would save the defer on the caller side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a similar idea that might work... please stand by...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it replaces the command's Cancel func, and uses WaitDelay to ensure it is killed later. No special cleanup or defer needed!

@DrJosh9000 DrJosh9000 force-pushed the pb-968-create-a-testing-framework branch from 7d13f5b to 3b6f49d Compare December 2, 2025 05:21
@DrJosh9000 DrJosh9000 force-pushed the pb-968-create-a-testing-framework branch from 3b6f49d to 4167772 Compare December 2, 2025 05:23
@DrJosh9000 DrJosh9000 merged commit c9ca995 into main Dec 2, 2025
2 checks passed
@DrJosh9000 DrJosh9000 deleted the pb-968-create-a-testing-framework branch December 2, 2025 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants