Skip to content

Emit OTEL error events when any downstream service it depends on returns a rate-limiting or quota-related failure #256

@nicobistolfi

Description

@nicobistolfi

Summary

Vigilante should emit OTEL error events when any downstream service it depends on returns a rate-limiting or quota-related failure. The codebase already classifies some provider quota/rate-limit failures and already has
OTEL workflow/event plumbing, but there is not yet a clear, dedicated telemetry signal for service-side rate limiting across the systems Vigilante uses.

Issue Type: feature

Problem

  • Vigilante depends on multiple external services and CLIs, including GitHub and coding-agent providers.
  • When one of those services returns a rate-limit or quota-related failure, Vigilante can block or fail locally, but the rate-limit event itself is not being surfaced as a dedicated OTEL error signal.
  • This makes it harder to observe systemic throttling problems, distinguish service-side quota exhaustion from other failures, and build alerting or dashboards around rate-limit pressure.

Context

  • Repository: aliengiraffe/vigilante
  • Existing OTEL and analytics support lives in internal/telemetry/telemetry.go.
  • Existing workflow telemetry already captures bounded product/workflow events via telemetry.CaptureWorkflowEvent(...).
  • The codebase already recognizes quota/rate-limit style failures in some places, for example provider-side classification that maps usage-limit/rate-limit/quota failures into provider_quota.
  • Similar behavior should be extended into telemetry so rate-limited external dependencies become observable events, not just local blocked states or logs.

Desired Outcome

  • When any downstream service used by Vigilante returns a rate-limit or quota-related error, Vigilante emits an OTEL error event describing that failure in a bounded, privacy-safe way.
  • The event should make it possible to identify:
  • which service category was rate limiting Vigilante
  • what high-level operation was being attempted
  • whether the failure was classified as retryable/transient vs quota-related/blocking
  • The emitted telemetry should avoid leaking sensitive request content, prompts, tokens, raw arguments, or other free-form payloads.
  • Non-goals:
  • sending raw API responses or full stderr/stdout bodies to OTEL
  • building a full retry/backoff policy in the same change unless necessary for instrumentation correctness

Implementation Notes

  • Treat this as a feature request for operational observability.
  • Required behavior:
  • detect rate-limit/quota-style failures from the services Vigilante uses
  • emit an OTEL error event when those failures occur
  • keep emitted properties bounded and privacy-safe
  • Plausible implementation areas:
  • internal/telemetry/telemetry.go for the event helper/schema
  • failure-classification paths in internal/blocking, internal/app, and internal/runner
  • shared command/service execution layers where service-specific errors are normalized
  • Required constraints:
  • do not emit raw prompts, tokens, repo-private payloads, or full free-form error text when it may contain sensitive information
  • keep the event taxonomy consistent with existing OTEL/workflow telemetry
  • cover at least the known quota/rate-limit failure shapes already classified in the codebase
  • Flexible details:
  • whether the signal is emitted as an OTEL log record, a workflow analytics event, or both
  • whether service names are normalized into categories such as github, provider, telemetry_export, etc.
  • Tradeoffs to consider:
  • generic runner-level detection provides broad coverage, but service-specific classification may be needed to avoid false positives
  • call-site emission can attach better context, but it is easier to miss future rate-limit paths

Acceptance Criteria

  • Vigilante emits an OTEL error event when a downstream service returns a rate-limit or quota-related error.
  • The event includes bounded metadata identifying the affected service/category, the high-level operation, and the outcome/classification.
  • Known provider quota/rate-limit failures are covered by the new telemetry signal.
  • Sensitive raw payloads such as prompts, tokens, raw request bodies, and unrestricted stderr/stdout are not emitted.
  • The emitted telemetry is consistent enough to support dashboarding or alerting around downstream throttling.

Testing Expectations

  • Add tests for detection and telemetry emission of provider quota/rate-limit failures.
  • Add tests for at least one non-provider downstream rate-limit scenario when applicable, such as GitHub/API throttling.
  • Add negative tests proving that sensitive raw payloads are not emitted in the telemetry event.
  • Add coverage for the bounded event fields/schema so future changes do not silently break observability.

Operational / UX Considerations

  • These events should make downstream throttling visible without requiring operators to infer the pattern from scattered blocked sessions or logs.
  • Keep the event names and fields stable enough for dashboards and alerts.
  • If multiple services can rate limit Vigilante, normalize service categories so aggregate analysis remains useful.

Metadata

Metadata

Assignees

Labels

vigilante:automergevigilante:doneVigilante completed its work on the issue and no further automation is expected.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions