Skip to content

fix: bound live UI reads and DAG run timeouts#1827

Merged
yottahmd merged 4 commits intodagucloud:mainfrom
yottahmd:fix/sse-conn-2
Mar 22, 2026
Merged

fix: bound live UI reads and DAG run timeouts#1827
yottahmd merged 4 commits intodagucloud:mainfrom
yottahmd:fix/sse-conn-2

Conversation

@yottahmd
Copy link
Copy Markdown
Collaborator

@yottahmd yottahmd commented Mar 22, 2026

Summary

  • add bounded request timeouts and abort handling for live UI fetches
  • serialize DAG run detail polling and harden hover detail fetches with abort and cache
  • propagate DAG run read deadlines through the API and file-backed store so timeouts return cleanly

Testing

  • go test ./internal/persis/filedagrun -count=1
  • go test ./internal/service/frontend/api/v1 -count=1

Summary by CodeRabbit

  • New Features

    • Added timeout protection (10 seconds) for DAG run read operations to prevent hanging requests
    • Added error handling and UI messages for timeout and cancellation scenarios
    • Implemented caching for hover-triggered step details to improve performance
    • Added automatic polling for DAG run details when modals/panels are open
  • Bug Fixes

    • Improved error messages displayed when operations timeout or are canceled
  • Refactor

    • Enhanced internal cancellation support across status reading and directory scanning operations

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 22, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c18b5fa1-3d5a-46ea-869c-adbffaf4d294

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces context-aware cancellation through the request pipeline: backend persistence layer propagates context.Context through status reading with cancellation checks; API handlers enforce 10-second timeouts on DAG-run reads; frontend uses timeout-aware fetch utilities and refactors DAG-run details fetching via a new bounded hook with controlled concurrency and polling.

Changes

Cohort / File(s) Summary
Backend Context Propagation
internal/persis/filedagrun/attempt.go, internal/persis/filedagrun/attempt_test.go
Added context awareness to status reading/parsing with cancellation checks; ReadStatus and parseLocked now accept context.Context; introduced parseStatusFileWithContext helper; error wrapping upgraded to use %w format.
Backend Cancellation Integration
internal/persis/filedagrun/dagrun.go, internal/persis/filedagrun/dataroot.go, internal/persis/filedagrun/dataroot_test.go
Extended directory traversal operations to respect context cancellation: LatestAttempt and FindByDAGRunID now check ctx.Err() during iteration and return immediately on cancellation.
API Timeout Wrapper & Handlers
internal/service/frontend/api/v1/dagruns.go, internal/service/frontend/api/v1/dagruns_internal_test.go
Added withDAGRunReadTimeout[T] wrapper enforcing 10s deadline for DAG-run reads; updated ListDAGRuns, ListDAGRunsByName, GetDAGRunDetails, and GetSubDAGRunDetails handlers to use wrapper and return gateway-timeout responses; added test validating timeout behavior.
Frontend Timeout Utilities
ui/src/lib/requestTimeout.ts
New module providing timeout-aware fetch with RequestTimeoutError and RequestAbortError custom classes; determines request type (read/write) and applies corresponding timeouts; sophisticated error mapping for timeout/abort scenarios; includes helper predicates and retry logic.
Frontend Request Infrastructure
ui/src/lib/fetchJson.ts, ui/src/hooks/api.ts, ui/src/App.tsx
Integrated fetchWithTimeout throughout request pipeline: fetchJson uses timeout-aware fetch; API client initialized with fetchWithTimeout as default; app-level fetch updated with timeout behavior and new retry predicate.
Frontend DAG Run Details Request
ui/src/features/dag-runs/hooks/dagRunDetailsRequest.ts
New module exporting DAGRunDetailsRequestTarget type and utilities: buildDAGRunDetailsPath constructs conditional API paths for sub-DAG vs main DAG lookups; fetchDAGRunDetails handles fetching with request init support.
Frontend Bounded Details Hook
ui/src/features/dag-runs/hooks/useBoundedDAGRunDetails.ts
New React hook managing DAG run details with concurrency control, polling, and error handling; prevents overlapping requests via inFlightRef/pendingRef queuing; cancels in-flight requests using AbortController; provides data, error, isLoading, isValidating, and async refresh() method.
Frontend Component Updates
ui/src/features/dag-runs/components/dag-run-details/DAGRunDetailsModal.tsx, ui/src/features/dag-runs/components/dag-run-details/DAGRunDetailsPanel.tsx, ui/src/pages/dag-runs/dag-run/index.tsx
Refactored components to use useBoundedDAGRunDetails hook instead of conditional useQuery calls; added target-based routing logic (sub-DAG vs main DAG); introduced error UI states; improved data validation with dagRunId matching.
Frontend Hover-Driven Fetch
ui/src/features/dag-runs/components/dag-run-list/StepDetailsTooltip.tsx
Replaced useQuery-based fetching with manual hover-driven flow: introduces delay timer, request abort on close, module-level hover cache with TTL, and error message display from fetch failure.
Frontend Live Invalidation
ui/src/hooks/useAppLive.ts
Enhanced useLiveInvalidation with cooldown tracking after abort-like errors; computes effective revalidation delay respecting cooldown period; updates cleanup logic to reset cooldown on unmount.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant API as API Handler
    participant Persist as Persistence Layer
    participant Cache as Status Cache
    
    Client->>API: ReadStatus(ctx with timeout)
    API->>Persist: FindByDAGRunID(ctx)
    Persist->>Persist: Check ctx.Err() during iteration
    alt Context Cancelled
        Persist-->>API: context.Canceled
    else Context Valid
        Persist->>Persist: LatestAttempt(ctx)
        Persist->>Persist: Check ctx.Err() in loop
        Persist->>Persist: parseLocked(ctx)
        Persist->>Cache: Check cached status
        alt Cache Miss
            Persist->>Persist: parseStatusFileWithContext(ctx)
            Persist->>Persist: Check ctx.Err() during scan
        else Cache Hit
            Persist->>Persist: parseLocked(ctx) on cached data
        end
        Persist-->>API: *DAGRunStatus
    end
    API-->>Client: Response or timeout error
Loading
sequenceDiagram
    participant Component as React Component
    participant Hook as useBoundedDAGRunDetails
    participant Timer as Polling Timer
    participant Fetch as fetchWithTimeout
    participant API as Backend API
    
    Component->>Hook: useBoundedDAGRunDetails(target, enabled)
    activate Hook
    Hook->>Hook: Initialize refs (inFlight, pending)
    Hook->>Hook: Schedule initial fetch
    deactivate Hook
    
    Loop Polling (when enabled & pollIntervalMs > 0)
        Timer->>Hook: Poll interval elapsed
        Hook->>Hook: Check inFlightRef (prevent overlap)
        alt Request In Flight
            Hook->>Hook: Queue in pendingRef
        else No Request In Flight
            Hook->>Hook: Create AbortController
            Hook->>Fetch: fetchWithTimeout(target)
            Fetch->>Fetch: Link abort signals, set timeout
            Fetch->>API: fetch(request, {signal})
            alt Timeout or Abort
                API-->>Fetch: AbortSignal fired
                Fetch-->>Hook: RequestTimeoutError or RequestAbortError
            else Success
                API-->>Fetch: Response
                Fetch-->>Hook: data (mapped)
            end
            Hook->>Hook: Update state (data, error, isValidating)
            alt Pending Refetch Queued
                Hook->>Hook: Clear pending, reschedule polling
            end
        end
    end
    
    Component->>Hook: Call refresh()
    Hook->>Fetch: fetchWithTimeout(target, {signal: abortController.signal})
    Fetch->>API: fetch(request, {signal})
    API-->>Fetch: Response
    Fetch-->>Hook: DAGRunDetails
    Hook->>Hook: Update data, clear error
    Hook-->>Component: Refreshed details
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: bound live UI reads and DAG run timeouts' directly and clearly summarizes the main changes: adding timeout bounds and controls to live UI reads and DAG run operations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (4)
ui/src/features/dag-runs/hooks/dagRunDetailsRequest.ts (1)

1-4: Missing GPL v3 license header.

As per coding guidelines, source files should have GPL v3 license headers managed via make addlicense.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/features/dag-runs/hooks/dagRunDetailsRequest.ts` around lines 1 - 4,
This file is missing the required GPL v3 license header; add the standard GPL v3
license header to the top of the file (above the imports) by running the
repository tool: execute make addlicense so the header is applied automatically,
ensuring it precedes the existing imports and type export (references: the
import lines for components and fetchJson and the exported type DAGRunDetails).
ui/src/features/dag-runs/hooks/useBoundedDAGRunDetails.ts (2)

188-196: Redundant mountedRef.current = true assignment.

The ref is already initialized to true on line 48, and refs persist across renders. The assignment on line 189 is unnecessary.

🧹 Proposed cleanup
   useEffect(() => {
-    mountedRef.current = true;
     return () => {
       mountedRef.current = false;
       clearPollTimer();
       abortActiveRequest();
       pendingRef.current = false;
     };
   }, [abortActiveRequest, clearPollTimer]);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/features/dag-runs/hooks/useBoundedDAGRunDetails.ts` around lines 188 -
196, In useBoundedDAGRunDetails, remove the redundant mountedRef.current = true
assignment inside the useEffect since mountedRef is already initialized to true
and refs persist; keep the cleanup callback that sets mountedRef.current = false
and calls clearPollTimer(), abortActiveRequest(), and sets pendingRef.current =
false, leaving the effect dependencies ([abortActiveRequest, clearPollTimer])
unchanged.

1-8: Missing GPL v3 license header.

As per coding guidelines, source files should have GPL v3 license headers managed via make addlicense.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/features/dag-runs/hooks/useBoundedDAGRunDetails.ts` around lines 1 -
8, This file lacks the required GPL v3 license header; run the project's license
tooling to add it (e.g., execute the make addlicense target) so the header is
inserted at the top of the module containing useBoundedDAGRunDetails (the file
that imports fetchDAGRunDetails and isAbortLikeError and declares
useBoundedDAGRunDetails). Ensure the header matches the project's GPL v3
template used by addlicense and is placed before any imports or code.
ui/src/lib/requestTimeout.ts (1)

1-6: Missing GPL v3 license header.

As per coding guidelines, source files should have GPL v3 license headers managed via make addlicense.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/lib/requestTimeout.ts` around lines 1 - 6, This file is missing the
required GPL v3 license header; add the standard GPL v3 header to the top of the
file (above the existing declarations like READ_METHODS,
DEFAULT_READ_TIMEOUT_MS, DEFAULT_WRITE_TIMEOUT_MS, and
LIVE_INVALIDATION_COOLDOWN_MS) by running the repository's license tool (make
addlicense) or inserting the project's standard header so the file complies with
the project's licensing guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/service/frontend/api/v1/dagruns.go`:
- Around line 1359-1371: The handler currently only checks for
context.DeadlineExceeded from withDAGRunReadTimeout and treats other errors
(including context.Canceled) as not-found; update the error handling in
GetDAGRunDetails (the block that calls withDAGRunReadTimeout and uses
dagRunReadRequestInfo) to first detect errors.Is(err, context.Canceled) and
return an appropriate "request canceled" response (same style as the deadline
branch — e.g., a short-circuit response with a client-canceled status/body)
instead of falling through to the 404 branch; apply the identical change to the
GetSubDAGRunDetails handler so canceled contexts are preserved there as well.
- Around line 82-100: The helper currently returns a successful result if
err==nil even when the child context (readCtx) has already been cancelled or hit
its deadline; after calling read(readCtx) and before returning result, check
readCtx.Err() and if it is context.DeadlineExceeded or context.Canceled handle
it the same way as in the switch: log via logger.Warn with info.attrs(duration)
and return a zero T plus the corresponding context error (use startedAt to
compute duration). In short: after obtaining result, inspect readCtx.Err() and
convert late-success into the appropriate timeout/cancel error instead of
returning result.

In `@ui/src/features/dag-runs/components/dag-run-details/DAGRunDetailsModal.tsx`:
- Around line 103-105: The check for freshDetails uses expectedDagRunId which
equals 'latest' for dagRunId === 'latest', but the server returns a resolved
UUID so equality fails; update the logic in DAGRunDetailsModal.tsx to consider a
resolved response as fresh when the requested dagRunId was the sentinel 'latest'
(or when canQuerySubDag and subDAGRunId is used) — e.g., detect if dagRunId ===
'latest' and then accept latestDetails as fresh when latestDetails?.dagRunId is
defined (or compare against latestDetails?.dagRunId instead of the literal
'latest'); adjust the assignment of expectedDagRunId/freshDetails (references:
expectedDagRunId, canQuerySubDag, subDAGRunId, dagRunId, latestDetails,
freshDetails) so the first open shows the returned details for 'latest'.

In `@ui/src/features/dag-runs/components/dag-run-details/DAGRunDetailsPanel.tsx`:
- Around line 63-67: The current guard compares latestDetails?.dagRunId to
expectedDagRunId which can be the literal 'latest', so when the API returns a
concrete DAG run id for a "latest" request data stays null; change the match
logic so that when the requested dagRunId is 'latest' (or dagRunId is falsy and
you defaulted to 'latest') you treat any returned latestDetails as a match.
Concretely, update the conditional that sets data (using expectedDagRunId,
isSubDAGRun, subDAGRunId, dagRunId, and latestDetails?.dagRunId) to also
consider expectedDagRunId === 'latest' as a successful match (i.e., set data = {
dagRunDetails: latestDetails } if latestDetails exists and either
latestDetails.dagRunId === expectedDagRunId OR expectedDagRunId === 'latest').

In `@ui/src/features/dag-runs/components/dag-run-list/StepDetailsTooltip.tsx`:
- Around line 109-117: The tooltip clears details then sets isLoading false
before the hover delay, causing a “No running…” flash and allowing an older
aborted request to hide the spinner for a newer one; fix by setting
setIsLoading(true) immediately when scheduling the fetch (before/when calling
setTimeout via fetchTimerRef) so the loading state shows during the delay, and
in the async fetch handler use a local controller variable (const controller =
new AbortController(); controllerRef.current = controller) and only call
setIsLoading(false) in finally if controllerRef.current === controller to avoid
older requests clearing the spinner; apply the same change to the corresponding
block around lines 143-147.

In `@ui/src/features/dag-runs/hooks/useBoundedDAGRunDetails.ts`:
- Around line 129-153: The finally block currently contains return statements
which can suppress thrown exceptions; remove those returns and instead capture
necessary post-finally actions with local flags so the control flow happens
after the try/catch/finally completes. Specifically, inside the finally block
(where controllerRef, inFlightRef, mountedRef, setIsLoading and setIsValidating
are updated) only clear refs and update state, and read-and-clear
pendingRef.current into a local boolean (e.g., const hadPending =
pendingRef.current; pendingRef.current = false) and also read
enabledRef.current, targetRef.current, and pollIntervalRef.current into locals;
then after the try/catch/finally (outside it) use those locals: if mountedRef
indicates mounted and hadPending then call runFetch(); else if enabled/local
target/pollInterval indicate polling should continue then call
schedulePoll(runFetch). This preserves behavior while removing unsafe returns in
the finally block.

In `@ui/src/pages/dag-runs/dag-run/index.tsx`:
- Around line 94-96: The guard comparing expectedDagRunId to
latestDetails.dagRunId incorrectly treats the sentinel 'latest' as a literal id;
update the logic around expectedDagRunId and dagRunDetails (used where
expectedDagRunId, subDAGRunId, dagRunId, latestDetails are referenced) so that
if expectedDagRunId === 'latest' it is treated as a wildcard (i.e., accept the
response) — for example, set dagRunDetails = latestDetails if latestDetails
exists and (expectedDagRunId === 'latest' || latestDetails.dagRunId ===
expectedDagRunId).

---

Nitpick comments:
In `@ui/src/features/dag-runs/hooks/dagRunDetailsRequest.ts`:
- Around line 1-4: This file is missing the required GPL v3 license header; add
the standard GPL v3 license header to the top of the file (above the imports) by
running the repository tool: execute make addlicense so the header is applied
automatically, ensuring it precedes the existing imports and type export
(references: the import lines for components and fetchJson and the exported type
DAGRunDetails).

In `@ui/src/features/dag-runs/hooks/useBoundedDAGRunDetails.ts`:
- Around line 188-196: In useBoundedDAGRunDetails, remove the redundant
mountedRef.current = true assignment inside the useEffect since mountedRef is
already initialized to true and refs persist; keep the cleanup callback that
sets mountedRef.current = false and calls clearPollTimer(),
abortActiveRequest(), and sets pendingRef.current = false, leaving the effect
dependencies ([abortActiveRequest, clearPollTimer]) unchanged.
- Around line 1-8: This file lacks the required GPL v3 license header; run the
project's license tooling to add it (e.g., execute the make addlicense target)
so the header is inserted at the top of the module containing
useBoundedDAGRunDetails (the file that imports fetchDAGRunDetails and
isAbortLikeError and declares useBoundedDAGRunDetails). Ensure the header
matches the project's GPL v3 template used by addlicense and is placed before
any imports or code.

In `@ui/src/lib/requestTimeout.ts`:
- Around line 1-6: This file is missing the required GPL v3 license header; add
the standard GPL v3 header to the top of the file (above the existing
declarations like READ_METHODS, DEFAULT_READ_TIMEOUT_MS,
DEFAULT_WRITE_TIMEOUT_MS, and LIVE_INVALIDATION_COOLDOWN_MS) by running the
repository's license tool (make addlicense) or inserting the project's standard
header so the file complies with the project's licensing guidelines.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cd737f33-e229-43da-b2ee-5c56081a68fc

📥 Commits

Reviewing files that changed from the base of the PR and between c33cb46 and e6c6fda.

📒 Files selected for processing (18)
  • internal/persis/filedagrun/attempt.go
  • internal/persis/filedagrun/attempt_test.go
  • internal/persis/filedagrun/dagrun.go
  • internal/persis/filedagrun/dataroot.go
  • internal/persis/filedagrun/dataroot_test.go
  • internal/service/frontend/api/v1/dagruns.go
  • internal/service/frontend/api/v1/dagruns_internal_test.go
  • ui/src/App.tsx
  • ui/src/features/dag-runs/components/dag-run-details/DAGRunDetailsModal.tsx
  • ui/src/features/dag-runs/components/dag-run-details/DAGRunDetailsPanel.tsx
  • ui/src/features/dag-runs/components/dag-run-list/StepDetailsTooltip.tsx
  • ui/src/features/dag-runs/hooks/dagRunDetailsRequest.ts
  • ui/src/features/dag-runs/hooks/useBoundedDAGRunDetails.ts
  • ui/src/hooks/api.ts
  • ui/src/hooks/useAppLive.ts
  • ui/src/lib/fetchJson.ts
  • ui/src/lib/requestTimeout.ts
  • ui/src/pages/dag-runs/dag-run/index.tsx

Comment thread internal/service/frontend/api/v1/dagruns.go
Comment thread internal/service/frontend/api/v1/dagruns.go
Comment thread ui/src/features/dag-runs/components/dag-run-details/DAGRunDetailsModal.tsx Outdated
Comment thread ui/src/features/dag-runs/hooks/useBoundedDAGRunDetails.ts
Comment thread ui/src/pages/dag-runs/dag-run/index.tsx Outdated
@yottahmd yottahmd merged commit 7314d03 into dagucloud:main Mar 22, 2026
6 checks passed
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 22, 2026

Codecov Report

❌ Patch coverage is 76.00000% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.25%. Comparing base (ae62a46) to head (fe1f968).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
internal/persis/filedagrun/dataroot.go 63.63% 2 Missing and 2 partials ⚠️
internal/persis/filedagrun/attempt.go 83.33% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1827      +/-   ##
==========================================
+ Coverage   69.21%   69.25%   +0.03%     
==========================================
  Files         429      429              
  Lines       51586    51600      +14     
==========================================
+ Hits        35706    35735      +29     
+ Misses      12857    12845      -12     
+ Partials     3023     3020       -3     
Files with missing lines Coverage Δ
internal/persis/filedagrun/dagrun.go 63.73% <100.00%> (+0.37%) ⬆️
internal/persis/filedagrun/attempt.go 72.81% <83.33%> (+0.70%) ⬆️
internal/persis/filedagrun/dataroot.go 68.80% <63.63%> (-0.37%) ⬇️

... and 18 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c33cb46...fe1f968. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant