Skip to content

feat: implement multi-layer AI resilience with retries and circuit br…#29

Merged
dedilaniado merged 6 commits intomainfrom
feature/ai-resilience
Apr 25, 2026
Merged

feat: implement multi-layer AI resilience with retries and circuit br…#29
dedilaniado merged 6 commits intomainfrom
feature/ai-resilience

Conversation

@jonatan5524
Copy link
Copy Markdown
Owner

@jonatan5524 jonatan5524 commented Apr 7, 2026

📋 PR Summary: AI Resilience & Error Handling

Goal: Implement a robust safety layer for AI model calls to handle rate limits (429) and service outages (5xx) gracefully, preventing system crashes and providing instant feedback to users.

🚀 Key Features

  1. Smart Retries (tenacity):

    • Added exponential backoff with jitter to all Gemini API calls in llm_service.py.
    • Automatically retries on transient errors (Rate Limits or Server Errors) up to 3 times before failing.
  2. Circuit Breaker Pattern:

    • Implemented a thread-safe CircuitBreaker singleton that acts as a "safety switch."
    • Trips (OPEN) after 3 consecutive failures, preventing further API calls for 60 seconds.
    • Fast Failure: While the circuit is OPEN, requests fail immediately (milliseconds) instead of hanging, saving server resources and improving UX.
    • Self-Healing: Automatically enters a "HALF-OPEN" state after the timeout to test if the service has recovered.
  3. End-to-End Error Mapping:

    • Python Engine: Maps internal resilience errors to standard HTTP codes (429 for rate limits, 503 for service unavailable/circuit open).
    • NestJS Orchestrator: Catches these specific codes and translates them into a user-friendly AI_SERVICE_BUSY message: "The AI engine is currently at capacity. Please try again in a minute."
  4. Observability & Docs:

    • Replaced print statements with proper Python logging.
    • Added a "Resilience & Stability" section to the data-engine README to explain the logic to future developers.

🧪 Testing & Verification

  • Python Unit Tests (test_resilience.py): Verified that the circuit trips after failures, fast-fails when open, and correctly executes retries.
  • NestJS Integration Tests (ai-resilience.spec.ts): Verified that the orchestrator correctly identifies and maps technical failures from the Python service.
  • Manual Verification: Successully verified "Fast Failure" behavior by simulating API errors in main.py.

Note for Reviewer: This implementation uses the tenacity library for declarative retries and a custom CircuitBreaker implementation to ensure high stability during AI model peaks.

@jonatan5524 jonatan5524 marked this pull request as draft April 7, 2026 12:52
@jonatan5524 jonatan5524 marked this pull request as ready for review April 7, 2026 13:27
@jonatan5524 jonatan5524 linked an issue Apr 24, 2026 that may be closed by this pull request
@dedilaniado dedilaniado merged commit f186ca9 into main Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

error handling model limit

2 participants