Summary
Vigilante should emit OTEL error events when any downstream service it depends on returns a rate-limiting or quota-related failure. The codebase already classifies some provider quota/rate-limit failures and already has
OTEL workflow/event plumbing, but there is not yet a clear, dedicated telemetry signal for service-side rate limiting across the systems Vigilante uses.
Issue Type: feature
Problem
- Vigilante depends on multiple external services and CLIs, including GitHub and coding-agent providers.
- When one of those services returns a rate-limit or quota-related failure, Vigilante can block or fail locally, but the rate-limit event itself is not being surfaced as a dedicated OTEL error signal.
- This makes it harder to observe systemic throttling problems, distinguish service-side quota exhaustion from other failures, and build alerting or dashboards around rate-limit pressure.
Context
- Repository:
aliengiraffe/vigilante
- Existing OTEL and analytics support lives in
internal/telemetry/telemetry.go.
- Existing workflow telemetry already captures bounded product/workflow events via
telemetry.CaptureWorkflowEvent(...).
- The codebase already recognizes quota/rate-limit style failures in some places, for example provider-side classification that maps usage-limit/rate-limit/quota failures into
provider_quota.
- Similar behavior should be extended into telemetry so rate-limited external dependencies become observable events, not just local blocked states or logs.
Desired Outcome
- When any downstream service used by Vigilante returns a rate-limit or quota-related error, Vigilante emits an OTEL error event describing that failure in a bounded, privacy-safe way.
- The event should make it possible to identify:
- which service category was rate limiting Vigilante
- what high-level operation was being attempted
- whether the failure was classified as retryable/transient vs quota-related/blocking
- The emitted telemetry should avoid leaking sensitive request content, prompts, tokens, raw arguments, or other free-form payloads.
- Non-goals:
- sending raw API responses or full stderr/stdout bodies to OTEL
- building a full retry/backoff policy in the same change unless necessary for instrumentation correctness
Implementation Notes
- Treat this as a feature request for operational observability.
- Required behavior:
- detect rate-limit/quota-style failures from the services Vigilante uses
- emit an OTEL error event when those failures occur
- keep emitted properties bounded and privacy-safe
- Plausible implementation areas:
internal/telemetry/telemetry.go for the event helper/schema
- failure-classification paths in
internal/blocking, internal/app, and internal/runner
- shared command/service execution layers where service-specific errors are normalized
- Required constraints:
- do not emit raw prompts, tokens, repo-private payloads, or full free-form error text when it may contain sensitive information
- keep the event taxonomy consistent with existing OTEL/workflow telemetry
- cover at least the known quota/rate-limit failure shapes already classified in the codebase
- Flexible details:
- whether the signal is emitted as an OTEL log record, a workflow analytics event, or both
- whether service names are normalized into categories such as
github, provider, telemetry_export, etc.
- Tradeoffs to consider:
- generic runner-level detection provides broad coverage, but service-specific classification may be needed to avoid false positives
- call-site emission can attach better context, but it is easier to miss future rate-limit paths
Acceptance Criteria
Testing Expectations
- Add tests for detection and telemetry emission of provider quota/rate-limit failures.
- Add tests for at least one non-provider downstream rate-limit scenario when applicable, such as GitHub/API throttling.
- Add negative tests proving that sensitive raw payloads are not emitted in the telemetry event.
- Add coverage for the bounded event fields/schema so future changes do not silently break observability.
Operational / UX Considerations
- These events should make downstream throttling visible without requiring operators to infer the pattern from scattered blocked sessions or logs.
- Keep the event names and fields stable enough for dashboards and alerts.
- If multiple services can rate limit Vigilante, normalize service categories so aggregate analysis remains useful.
Summary
Vigilante should emit OTEL error events when any downstream service it depends on returns a rate-limiting or quota-related failure. The codebase already classifies some provider quota/rate-limit failures and already has
OTEL workflow/event plumbing, but there is not yet a clear, dedicated telemetry signal for service-side rate limiting across the systems Vigilante uses.
Issue Type: feature
Problem
Context
aliengiraffe/vigilanteinternal/telemetry/telemetry.go.telemetry.CaptureWorkflowEvent(...).provider_quota.Desired Outcome
Implementation Notes
internal/telemetry/telemetry.gofor the event helper/schemainternal/blocking,internal/app, andinternal/runnergithub,provider,telemetry_export, etc.Acceptance Criteria
Testing Expectations
Operational / UX Considerations