Skip to content

fix(auth): persist OAuth state for restart resilience#6034

Closed
clubanderson wants to merge 1 commit intofix/6022-6024-6026-securityfrom
fix/6028-oauth-persistence
Closed

fix(auth): persist OAuth state for restart resilience#6034
clubanderson wants to merge 1 commit intofix/6022-6024-6026-securityfrom
fix/6028-oauth-persistence

Conversation

@clubanderson
Copy link
Copy Markdown
Collaborator

Fixes #6028. Stacked on #6031 (base branch will auto-update to `main` once #6031 merges).

Problem

`pkg/api/handlers/auth.go` stored OAuth CSRF state tokens in an in-memory `map[string]time.Time`. If the backend restarted between `/auth/login` (where the state is generated and the user is redirected to GitHub) and `/auth/github/callback` (where the state must be validated), the map was gone and every in-flight user hit `csrf_validation_failed`.

Fix

Persist OAuth state in the backing store so in-flight flows survive restarts, as long as the user completes them within `oauthStateExpiration` (10 minutes, unchanged).

  • New table `oauth_states (state PRIMARY KEY, created_at, expires_at)` with an index on `expires_at`, added to `pkg/store/sqlite.go` `migrate()` next to `revoked_tokens`.
  • Store interface gains three additive methods:
    • `StoreOAuthState(state, ttl)`
    • `ConsumeOAuthState(state) (bool, error)` — single transaction that `SELECT`s, `DELETE`s, and returns whether the row existed and had not expired. Always deletes so a row cannot be reused even on the expired/invalid path.
    • `CleanupExpiredOAuthStates() (int64, error)`
  • auth.go:
    • Removed the global `oauthStateStore` map, its `sync.RWMutex`, the package-level `init()` goroutine, `purgeExpiredOAuthStates`, and `inlineCleanupThreshold`.
    • `storeOAuthState` / `validateAndConsumeOAuthState` are now methods on `*AuthHandler` delegating to the store.
    • `NewAuthHandler` starts a per-handler cleanup ticker (`oauthStateCleanupInterval = 5m`) that calls `CleanupExpiredOAuthStates`.
    • `GitHubLogin` now returns `oauth_state_store_failed` if the DB write fails (new, narrow error path).
  • MockStore in `pkg/test/mock_store.go` gains overridable implementations of all three methods.

Tests

  • `pkg/store/oauth_states_test.go` — round-trip, single-use semantics, unknown state, expired state (including rejection on retry), bulk cleanup, and a reopen-the-same-SQLite-file restart test.
  • `pkg/api/handlers/auth_test.go` — three new tests using a real `SQLiteStore`:
  • Fixed `TestGitHubLogin_Redirects` to pass a real store instead of `nil` (storeOAuthState now requires one).

Verification

```
go build ./... # clean
go vet ./... # clean
go test ./pkg/store/... ./pkg/api/handlers/... # PASS
cd web && npm run build # PASS
```

Notes

  • No magic numbers — all intervals/TTLs go through the existing `oauthStateExpiration` / `oauthStateCleanupInterval` constants.
  • No other handlers in `auth.go` were touched.
  • The DB-backed approach is consistent with the existing `revoked_tokens` pattern added for token revocation.

Replace the in-memory oauthStateStore map with a persistent oauth_states
table so in-flight OAuth flows survive a backend restart between
/auth/login and /auth/callback. Previously a restart between those two
redirects would drop every pending state token and users would see
csrf_validation_failed on the callback.

- New oauth_states table + index in pkg/store/sqlite.go migrate()
- Store interface gains StoreOAuthState / ConsumeOAuthState /
  CleanupExpiredOAuthStates (atomic single-transaction consume)
- auth.go helpers become methods on *AuthHandler that delegate to the
  store; the in-memory map, global init goroutine, and inlineCleanup
  threshold are removed
- NewAuthHandler starts a per-handler cleanup ticker that sweeps
  expired rows every oauthStateCleanupInterval
- MockStore gains overridable StoreOAuthState / ConsumeOAuthState
- Tests: store-level round-trip, single-use, expiry, cleanup, and a
  reopen-the-same-db "restart" test; handler-level restart test that
  instantiates two AuthHandlers against the same SQLite file

Fixes #6028

Signed-off-by: Andrew Anderson <[email protected]>
@kubestellar-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign eeshaansa for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@clubanderson clubanderson added the ai-generated Pull request generated by AI label Apr 10, 2026
@kubestellar-prow kubestellar-prow Bot added the dco-signoff: yes Indicates the PR's author has signed the DCO. label Apr 10, 2026
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hey @clubanderson — thanks for opening this PR!

🤖 This project is developed exclusively using AI coding assistants.

Please do not attempt to code anything for this project manually.
All contributions should be authored using an AI coding tool such as:

This ensures consistency in code style, architecture patterns, test coverage,
and commit quality across the entire codebase.


This is an automated message.

@kubestellar-prow kubestellar-prow Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 10, 2026
@kubestellar-prow kubestellar-prow Bot deleted the branch fix/6022-6024-6026-security April 10, 2026 12:36
@kubestellar-prow kubestellar-prow Bot closed this Apr 10, 2026
clubanderson added a commit that referenced this pull request Apr 10, 2026
…covery, oauth_configured requires secret

Fixes three backend auth edges:

- #6056: /health reported oauth_configured=true when only GITHUB_CLIENT_ID
  was set but GITHUB_CLIENT_SECRET was missing. Such a config is unusable
  for the token exchange, so the probe misled the frontend into showing
  a GitHub login button that was guaranteed to fail. The check now
  requires both values.

- #6063: A structurally malformed Authorization header (no Bearer prefix,
  'Bearer' with no token, whitespace-only) returned 401 immediately even
  when a perfectly valid kc_auth cookie was attached. The middleware now
  treats a malformed header the same as an empty header and falls through
  to the cookie path. Complements #6026 (stale-but-valid-format header
  fallback already landed in #6031).

- #6064: When validateAndConsumeOAuthState failed (stale OAuth tab,
  server restart that cleared the in-memory state store, back-button
  replay) the callback unconditionally redirected to
  /login?error=csrf_validation_failed even if the user already had a
  valid kc_auth cookie. The callback now checks for a non-expired,
  non-revoked JWT cookie on state failure and, if present, redirects to
  / so the live session is preserved. Complements #6034 which fixes the
  root cause of state loss on the server side.

Signed-off-by: Andrew Anderson <[email protected]>
clubanderson added a commit that referenced this pull request Apr 10, 2026
…covery, oauth_configured requires secret (#6076)

Fixes three backend auth edges:

- #6056: /health reported oauth_configured=true when only GITHUB_CLIENT_ID
  was set but GITHUB_CLIENT_SECRET was missing. Such a config is unusable
  for the token exchange, so the probe misled the frontend into showing
  a GitHub login button that was guaranteed to fail. The check now
  requires both values.

- #6063: A structurally malformed Authorization header (no Bearer prefix,
  'Bearer' with no token, whitespace-only) returned 401 immediately even
  when a perfectly valid kc_auth cookie was attached. The middleware now
  treats a malformed header the same as an empty header and falls through
  to the cookie path. Complements #6026 (stale-but-valid-format header
  fallback already landed in #6031).

- #6064: When validateAndConsumeOAuthState failed (stale OAuth tab,
  server restart that cleared the in-memory state store, back-button
  replay) the callback unconditionally redirected to
  /login?error=csrf_validation_failed even if the user already had a
  valid kc_auth cookie. The callback now checks for a non-expired,
  non-revoked JWT cookie on state failure and, if present, redirects to
  / so the live session is preserved. Complements #6034 which fixes the
  root cause of state loss on the server side.

Signed-off-by: Andrew Anderson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-generated Pull request generated by AI dco-signoff: yes Indicates the PR's author has signed the DCO. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant