Skip to content

fix(ci): address flaky test infrastructure issues#7706

Merged
bm1549 merged 3 commits intomasterfrom
brian.marks/fix-flaky-infra
Mar 6, 2026
Merged

fix(ci): address flaky test infrastructure issues#7706
bm1549 merged 3 commits intomasterfrom
brian.marks/fix-flaky-infra

Conversation

@bm1549
Copy link
Copy Markdown
Contributor

@bm1549 bm1549 commented Mar 6, 2026

Summary

  • Debugger / busboy: Added a readiness health-check after docker compose up -d testagent. Previously, tests could start sending multipart form data to the test agent container before it was ready to accept connections, causing busboy to receive truncated requests and throw Error: Unexpected end of form.

  • Azure Service Bus / queue not found: The azure-service-bus CI job was missing the config-copy + emulator-restart steps that the azure-functions job already has. GitHub Actions services don't support volume mounts, so the servicebus emulator was starting with its default (empty) config — no queue.1 or topic.1 existed. Added docker cp to inject servicebus-emulator-config.json and a docker restart to apply it.

Note: The AI Guard / macOS "Timeout in beforeEach" was also investigated. The root cause is unclear (likely slow proxyquire-based module loading on macOS runners, not a fixable infrastructure issue). That failure is tracked separately.

Risk analysis

testagent health check

  • Blast radius is wide: this action is used by lambda, aws-sdk, and everything routed through plugins/test. If the /info endpoint doesn't exist or returns a non-2xx while booting, curl -sf will keep retrying until the 30s timeout and then fail — turning a flake into a hard breakage across many jobs.
  • Mitigation: the /info endpoint is a standard diagnostic endpoint on the DD APM test agent; this pattern is well-established. 30s timeout should be sufficient on normal runners, though a cold-pull on a slow runner could hit it.
  • Confidence: high that it fixes the race condition; moderate confidence in the specific endpoint choice.

Azure Service Bus config copy + restart

  • docker cp timing: GitHub Actions marks services "running" without a healthcheck, so the emulator may not have created /ServiceBus_Emulator/ConfigFiles/ yet when docker cp runs. A failure here would be a hard error rather than a flake.
  • No explicit readiness wait after docker restart: relies on the implicit time taken by testagent/start + node setup + install inside plugins/test. The azure-functions job uses the same pattern and works in practice, which is the main evidence this is safe.
  • Confidence: high that the missing config was the root cause of "Queue not found"; moderate confidence in the timing assumptions.

Test plan

  • Debugger tests no longer show Error: Unexpected end of form from busboy
  • azure-service-bus job no longer fails with "Queue not found"
  • No regressions in other jobs that use the testagent start action

🤖 Generated with Claude Code

@bm1549 bm1549 added the AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos label Mar 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 6, 2026

Overall package size

Self size: 4.93 MB
Deduped: 5.77 MB
No deduping: 5.77 MB

Dependency sizes | name | version | self size | total size | |------|---------|-----------|------------| | import-in-the-middle | 3.0.0 | 81.15 kB | 815.98 kB | | dc-polyfill | 0.1.10 | 26.73 kB | 26.73 kB |

🤖 This report was automatically generated by heaviest-objects-in-the-universe

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.35%. Comparing base (b6f7a69) to head (03a2197).
⚠️ Report is 15 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7706      +/-   ##
==========================================
+ Coverage   80.31%   80.35%   +0.04%     
==========================================
  Files         739      741       +2     
  Lines       31946    32004      +58     
==========================================
+ Hits        25657    25718      +61     
+ Misses       6289     6286       -3     
Flag Coverage Δ
aiguard-macos 39.01% <ø> (+0.01%) ⬆️
aiguard-ubuntu 39.13% <ø> (+0.01%) ⬆️
aiguard-windows 38.85% <ø> (+<0.01%) ⬆️
apm-capabilities-tracing-macos 48.77% <ø> (+0.05%) ⬆️
apm-capabilities-tracing-ubuntu 48.81% <ø> (+0.01%) ⬆️
apm-capabilities-tracing-windows 47.75% <ø> (-0.79%) ⬇️
apm-integrations-child-process 38.56% <ø> (-0.02%) ⬇️
apm-integrations-couchbase-18 37.47% <ø> (+0.10%) ⬆️
apm-integrations-couchbase-eol 37.94% <ø> (+0.10%) ⬆️
apm-integrations-oracledb 37.78% <ø> (-0.04%) ⬇️
appsec-express 55.30% <ø> (-0.08%) ⬇️
appsec-fastify 51.63% <ø> (-0.08%) ⬇️
appsec-graphql 51.81% <ø> (-0.07%) ⬇️
appsec-kafka 44.33% <ø> (-0.06%) ⬇️
appsec-ldapjs 44.03% <ø> (-0.05%) ⬇️
appsec-lodash 43.68% <ø> (-0.05%) ⬇️
appsec-macos 58.30% <ø> (-0.08%) ⬇️
appsec-mongodb-core 48.82% <ø> (-0.07%) ⬇️
appsec-mongoose 49.49% <ø> (-0.07%) ⬇️
appsec-mysql 50.87% <ø> (-0.05%) ⬇️
appsec-node-serialize 43.20% <ø> (-0.05%) ⬇️
appsec-passport 47.63% <ø> (-0.07%) ⬇️
appsec-postgres 50.61% <ø> (-0.07%) ⬇️
appsec-sourcing 42.61% <ø> (-0.05%) ⬇️
appsec-template 43.37% <ø> (-0.05%) ⬇️
appsec-ubuntu 58.37% <ø> (-0.08%) ⬇️
appsec-windows 58.13% <ø> (-0.10%) ⬇️
instrumentations-instrumentation-bluebird 32.39% <ø> (+<0.01%) ⬆️
instrumentations-instrumentation-body-parser 40.50% <ø> (-0.04%) ⬇️
instrumentations-instrumentation-child_process 37.87% <ø> (-0.03%) ⬇️
instrumentations-instrumentation-cookie-parser 34.37% <ø> (-0.01%) ⬇️
instrumentations-instrumentation-express 34.69% <ø> (-0.02%) ⬇️
instrumentations-instrumentation-express-mongo-sanitize 34.50% <ø> (-0.01%) ⬇️
instrumentations-instrumentation-express-session 40.14% <ø> (-0.04%) ⬇️
instrumentations-instrumentation-fs 32.00% <ø> (+<0.01%) ⬆️
instrumentations-instrumentation-generic-pool 29.91% <ø> (+0.20%) ⬆️
instrumentations-instrumentation-http 39.86% <ø> (-0.04%) ⬇️
instrumentations-instrumentation-knex 32.39% <ø> (+<0.01%) ⬆️
instrumentations-instrumentation-mongoose 33.52% <ø> (-0.01%) ⬇️
instrumentations-instrumentation-multer 40.25% <ø> (-0.04%) ⬇️
instrumentations-instrumentation-mysql2 38.34% <ø> (-0.03%) ⬇️
instrumentations-instrumentation-passport 44.01% <ø> (-0.06%) ⬇️
instrumentations-instrumentation-passport-http 43.69% <ø> (-0.06%) ⬇️
instrumentations-instrumentation-passport-local 44.22% <ø> (-0.06%) ⬇️
instrumentations-instrumentation-pg 37.77% <ø> (-0.03%) ⬇️
instrumentations-instrumentation-promise 32.32% <ø> (+<0.01%) ⬆️
instrumentations-instrumentation-promise-js 32.33% <ø> (+<0.01%) ⬆️
instrumentations-instrumentation-q 32.37% <ø> (+<0.01%) ⬆️
instrumentations-instrumentation-url 32.29% <ø> (+<0.01%) ⬆️
instrumentations-instrumentation-when 32.34% <ø> (+<0.01%) ⬆️
llmobs-ai 42.34% <ø> (+0.15%) ⬆️
llmobs-anthropic 40.31% <ø> (-0.05%) ⬇️
llmobs-bedrock 39.27% <ø> (-0.03%) ⬇️
llmobs-google-genai 39.84% <ø> (-0.05%) ⬇️
llmobs-langchain 40.07% <ø> (+0.13%) ⬆️
llmobs-openai 44.06% <ø> (-0.14%) ⬇️
llmobs-vertex-ai 40.10% <ø> (-0.05%) ⬇️
platform-core 31.53% <ø> (ø)
platform-esbuild 34.48% <ø> (ø)
platform-instrumentations-misc 48.40% <ø> (+2.87%) ⬆️
platform-shimmer 37.63% <ø> (ø)
platform-unit-guardrails 32.95% <ø> (ø)
plugins-azure-event-hubs 25.83% <ø> (ø)
plugins-azure-service-bus 25.19% <ø> (ø)
plugins-bullmq 44.14% <ø> (+0.14%) ⬆️
plugins-cassandra 37.81% <ø> (-0.04%) ⬇️
plugins-cookie 26.89% <ø> (ø)
plugins-cookie-parser 26.67% <ø> (ø)
plugins-crypto 26.79% <ø> (ø)
plugins-dd-trace-api 38.39% <ø> (-0.05%) ⬇️
plugins-express-mongo-sanitize 26.82% <ø> (ø)
plugins-express-session 26.63% <ø> (ø)
plugins-fastify 42.23% <ø> (-0.06%) ⬇️
plugins-fetch 38.38% <ø> (-0.02%) ⬇️
plugins-fs 38.67% <ø> (-0.02%) ⬇️
plugins-generic-pool 25.87% <ø> (ø)
plugins-google-cloud-pubsub 45.37% <ø> (-0.07%) ⬇️
plugins-grpc 40.94% <ø> (-0.05%) ⬇️
plugins-handlebars 26.86% <ø> (ø)
plugins-hapi 40.14% <ø> (-0.05%) ⬇️
plugins-hono 40.40% <ø> (-0.05%) ⬇️
plugins-ioredis 38.47% <ø> (-0.02%) ⬇️
plugins-knex 26.50% <ø> (ø)
plugins-ldapjs 24.36% <ø> (ø)
plugins-light-my-request 26.23% <ø> (ø)
plugins-limitd-client 32.67% <ø> (-0.01%) ⬇️
plugins-lodash 25.96% <ø> (ø)
plugins-mariadb 39.53% <ø> (-0.02%) ⬇️
plugins-memcached 38.19% <ø> (-0.04%) ⬇️
plugins-microgateway-core 39.19% <ø> (-0.05%) ⬇️
plugins-moleculer 40.54% <ø> (-0.03%) ⬇️
plugins-mongodb 39.21% <ø> (-0.05%) ⬇️
plugins-mongodb-core 39.05% <ø> (-0.05%) ⬇️
plugins-mongoose 38.90% <ø> (-0.02%) ⬇️
plugins-multer 26.63% <ø> (ø)
plugins-mysql 39.21% <ø> (-0.02%) ⬇️
plugins-mysql2 39.31% <ø> (+0.01%) ⬆️
plugins-node-serialize 26.93% <ø> (ø)
plugins-opensearch 37.65% <ø> (-0.04%) ⬇️
plugins-passport-http 26.68% <ø> (ø)
plugins-postgres 35.75% <ø> (-0.03%) ⬇️
plugins-process 26.79% <ø> (ø)
plugins-pug 26.89% <ø> (ø)
plugins-redis 38.94% <ø> (-0.02%) ⬇️
plugins-router 42.95% <ø> (-0.20%) ⬇️
plugins-sequelize 25.47% <ø> (ø)
plugins-test-and-upstream-amqp10 38.51% <ø> (-0.05%) ⬇️
plugins-test-and-upstream-amqplib 43.83% <ø> (-0.04%) ⬇️
plugins-test-and-upstream-apollo 39.04% <ø> (-0.04%) ⬇️
plugins-test-and-upstream-avsc 38.73% <ø> (-0.05%) ⬇️
plugins-test-and-upstream-bunyan 33.94% <ø> (-0.02%) ⬇️
plugins-test-and-upstream-connect 40.79% <ø> (-0.06%) ⬇️
plugins-test-and-upstream-graphql 40.17% <ø> (-0.02%) ⬇️
plugins-test-and-upstream-koa 40.38% <ø> (-0.05%) ⬇️
plugins-test-and-upstream-protobufjs 38.95% <ø> (-0.05%) ⬇️
plugins-test-and-upstream-rhea 44.03% <ø> (-0.05%) ⬇️
plugins-undici 39.15% <ø> (-0.02%) ⬇️
plugins-url 26.79% <ø> (ø)
plugins-valkey 38.14% <ø> (-0.01%) ⬇️
plugins-vm 26.79% <ø> (ø)
plugins-winston 34.13% <ø> (-0.02%) ⬇️
plugins-ws 41.90% <ø> (-0.03%) ⬇️
profiling-macos 39.84% <ø> (-0.05%) ⬇️
profiling-ubuntu 39.97% <ø> (-0.05%) ⬇️
profiling-windows 41.16% <ø> (-0.07%) ⬇️
serverless-azure-functions-client 25.54% <ø> (ø)
serverless-azure-functions-eventhubs 25.54% <ø> (ø)
serverless-azure-functions-servicebus 25.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Mar 6, 2026

Benchmarks

Benchmark execution time: 2026-03-06 21:32:54

Comparing candidate commit 03a2197 in PR branch brian.marks/fix-flaky-infra with baseline commit b6f7a69 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 231 metrics, 29 unstable metrics.

- Add readiness wait after `docker compose up -d testagent` so tests
  don't start sending multipart data before busboy is ready
- Copy servicebus emulator config file and restart emulator in the
  azure-service-bus CI job (queue.1/topic.1 were missing because
  GitHub Actions services don't support volume mounts; the
  azure-functions job already does this with docker cp + restart)
- Wait for emulator to be ready after restart before running tests

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@bm1549 bm1549 force-pushed the brian.marks/fix-flaky-infra branch from a3b82c3 to b33d345 Compare March 6, 2026 18:42
@bm1549 bm1549 marked this pull request as ready for review March 6, 2026 19:15
@bm1549 bm1549 requested review from a team as code owners March 6, 2026 19:15
@bm1549 bm1549 requested a review from shreyamalpani March 6, 2026 19:15
jcstorms1
jcstorms1 previously approved these changes Mar 6, 2026
Copy link
Copy Markdown
Contributor

@jcstorms1 jcstorms1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment for Service Bus but LGTM otherwise.

jcstorms1
jcstorms1 previously approved these changes Mar 6, 2026
…service-bus

SQL Server Edge takes 45-90s to start; without a health check the Service Bus
emulator connects before SQL is ready. Add a health check using bash TCP
built-in (nc is not installed in the azure-sql-edge image) so GitHub Actions
gates job steps until SQL is healthy. After docker restart, wait for port 5672
to be ready before running tests.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@bm1549 bm1549 force-pushed the brian.marks/fix-flaky-infra branch from 7511b88 to b2a6f6a Compare March 6, 2026 21:19
…us job

The azure-service-bus plugin tests use the built-in OOB emulator config
(queue.1, topic.1), so copying a custom config file and restarting the
emulator is not needed.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@bm1549 bm1549 requested a review from jcstorms1 March 6, 2026 21:22
@bm1549 bm1549 enabled auto-merge (squash) March 6, 2026 21:28
@bm1549 bm1549 merged commit c48215d into master Mar 6, 2026
789 checks passed
@bm1549 bm1549 deleted the brian.marks/fix-flaky-infra branch March 6, 2026 21:42
dd-octo-sts bot pushed a commit that referenced this pull request Mar 7, 2026
* fix(ci): address flaky test infrastructure issues

- Add readiness wait after `docker compose up -d testagent` so tests
  don't start sending multipart data before busboy is ready
- Copy servicebus emulator config file and restart emulator in the
  azure-service-bus CI job (queue.1/topic.1 were missing because
  GitHub Actions services don't support volume mounts; the
  azure-functions job already does this with docker cp + restart)
- Wait for emulator to be ready after restart before running tests

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

* fix(ci): add SQL Edge health check and emulator ready-wait for azure-service-bus

SQL Server Edge takes 45-90s to start; without a health check the Service Bus
emulator connects before SQL is ready. Add a health check using bash TCP
built-in (nc is not installed in the azure-sql-edge image) so GitHub Actions
gates job steps until SQL is healthy. After docker restart, wait for port 5672
to be ready before running tests.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

* fix(ci): remove unnecessary emulator config copy from azure-service-bus job

The azure-service-bus plugin tests use the built-in OOB emulator config
(queue.1, topic.1), so copying a custom config file and restarting the
emulator is not needed.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

---------

Co-authored-by: Claude Sonnet 4.6 <[email protected]>
@atlassian atlassian bot mentioned this pull request Mar 7, 2026
juan-fernandez pushed a commit that referenced this pull request Mar 10, 2026
* fix(ci): address flaky test infrastructure issues

- Add readiness wait after `docker compose up -d testagent` so tests
  don't start sending multipart data before busboy is ready
- Copy servicebus emulator config file and restart emulator in the
  azure-service-bus CI job (queue.1/topic.1 were missing because
  GitHub Actions services don't support volume mounts; the
  azure-functions job already does this with docker cp + restart)
- Wait for emulator to be ready after restart before running tests

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

* fix(ci): add SQL Edge health check and emulator ready-wait for azure-service-bus

SQL Server Edge takes 45-90s to start; without a health check the Service Bus
emulator connects before SQL is ready. Add a health check using bash TCP
built-in (nc is not installed in the azure-sql-edge image) so GitHub Actions
gates job steps until SQL is healthy. After docker restart, wait for port 5672
to be ready before running tests.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

* fix(ci): remove unnecessary emulator config copy from azure-service-bus job

The azure-service-bus plugin tests use the built-in OOB emulator config
(queue.1, topic.1), so copying a custom config file and restarting the
emulator is not needed.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

---------

Co-authored-by: Claude Sonnet 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos semver-patch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants