Skip to content

feat(#10794): adds replication failure log#10823

Open
dianabarsan wants to merge 26 commits intomasterfrom
10794-replication-fails
Open

feat(#10794): adds replication failure log#10823
dianabarsan wants to merge 26 commits intomasterfrom
10794-replication-fails

Conversation

@dianabarsan
Copy link
Copy Markdown
Member

@dianabarsan dianabarsan commented Apr 7, 2026

Description

  • Moves all replication related api-services into a new replication dedicated folder. The list became huge!
  • Add replication failure logging to /api/v1/replication/get-ids, matching the existing setup on the initial-replication route
  • Each failure records timestamp, status code, duration, request ID, roles, and subjects count (if available)
  • Failures are stored per-user per-month in the logs DB, capped at 50 detailed entries with a running total count
  • Captures both server errors and client cancellations
  • Add e2e tests for failure logging, accumulation, monthly separation, per-user separation, and the 50-entry cap
  • Adds api endpoint to get replication failures for a given month. Defaults to the current month.

closes #10794

AI Disclosure

This is developed with the help of Claude. I have made all design decisions, reviewed and corrected all code.

Code review checklist

  • UI/UX backwards compatible: Test it works for the new design (enabled by default). And test it works in the old design, enable can_view_old_navigation permission to see the old design. Test it has appropriate design for RTL languages.
  • Readable: Concise, well named, follows the style guide
  • Documented: Configuration and user documentation on cht-docs
  • Tested: Unit and/or e2e where appropriate
  • Internationalised: All user facing text
  • Backwards compatible: Works with existing data and configuration or includes a migration. Any breaking changes documented in the release notes.
  • AI disclosure: Please disclose use of AI per the guidelines.

Compose URLs

If Build CI hasn't passed, these may 404:

License

The software is provided under AGPL-3.0. Contributions to this project are accepted under the same license.

Signed-off-by: Diana Barsan <[email protected]>
# Conflicts:
#	api/tests/mocha/controllers/replication-limit-log.spec.js
#	api/tests/mocha/controllers/users.spec.js
# Conflicts:
#	api/src/middleware/authorization.js
#	api/src/services/replication/bulk-docs.js
#	api/tests/mocha/controllers/users.spec.js
#	api/tests/mocha/middleware/authorization.spec.js
#	tests/integration/api/controllers/replication.spec.js
Signed-off-by: Diana Barsan <[email protected]>
… for replication failure log service

Signed-off-by: Diana Barsan <[email protected]>
# Conflicts:
#	tests/integration/api/controllers/replication.spec.js
… comments for nouveau behavior in k3d

Signed-off-by: Diana Barsan <[email protected]>
Signed-off-by: Diana Barsan <[email protected]>
@dianabarsan dianabarsan requested a review from jkuester April 21, 2026 10:05
@dianabarsan
Copy link
Copy Markdown
Member Author

I know your review queue is long, but i have to ask for another one. I appreciate the time you spend on this review.

Copy link
Copy Markdown
Contributor

@jkuester jkuester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I left a bunch of random thoughts inline.

Have you thought at all about if we should report a total value of these failures via the /monitoring api? Maybe even just a count of the distinct users that have replication failures.

Comment thread api/src/services/replication/authorization.js Outdated
Comment thread api/src/middleware/authorization.js Outdated
Comment thread api/src/controllers/replication-failure-log.js Outdated
Comment thread api/src/controllers/replication-failure-log.js Outdated
Comment thread api/src/controllers/replication-failure-log.js Outdated
Comment thread api/src/services/replication/replication-failure-log.js Outdated
Comment thread api/src/controllers/replication-failure-log.js Outdated
Comment thread tests/integration/api/controllers/replication-failure-log.spec.js
Comment thread tests/integration/api/controllers/replication-failure-log.spec.js Outdated
Comment thread tests/integration/api/controllers/replication-failure-log.spec.js
@mrjones-plip
Copy link
Copy Markdown
Contributor

if we should report a total value of these failures via the /monitoring api

a million times yes!! But also, I can wait 'til another iteration if it delays this PR/is too much feature creep.

@dianabarsan
Copy link
Copy Markdown
Member Author

Hi @jkuester

A bunch of changes occured, mostly as part of your review feedback:

  • API

    • month -> reporting_period
    • cht-datasource-like cursor based pagination and response style (to keep PR small i didn't make changes to datasource, but added a small lib into api)
    • user is now a query param and I improved performance for getting all user failed periods
  • Doc format

    • timestamp -> date to match connected user logs
    • added docs_count and unpurged_docs_count along subjects_count, to show how far the request progressed. all these default to unknown.
  • Routing

    • I finally understood the confusion about endpoints, because there were two :D. initial-replication/get-ids was added in 4.2 when we did the initial replication refactor, and in 4.4 became deprecated with replication/get-ids added with the Nairobi Protocol. We left initial-replication/get-ids in case older clients tried to replicate and then in 5.0 we forgot to remove the endpoint. Because updating from 4.2 to 5.x requires a step to 4.5 (?), it is now completely safe to be removed, so I did.

I hope I didn't miss anything. Review at your discretion. Appreciate your time and thoughts!

@dianabarsan dianabarsan requested a review from jkuester May 6, 2026 12:52
Copy link
Copy Markdown
Contributor

@jkuester jkuester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple minor suggestions/thoughts, but this is good to go! 👍 LGTM

Comment thread api/src/routing.js
Comment on lines -916 to -921
app.get(
'/api/v1/initial-replication/get-ids',
authorization.handleAuthErrors,
authorization.onlineUserPassThrough,
replication.getDocIds,
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

* summary: Get replication failure logs
* operationId: v1ReplicationFailureLogsGet
* description: >
* Returns a paginated page of full replication failure log documents. Optionally filter by `user`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Returns a paginated page of full replication failure log documents. Optionally filter by `user`
* Returns a page of replication failure log documents. Optionally filter by `user`

Comment on lines +14 to +16
// Counts are only set on userCtx as each phase of the request completes. When a count is missing
// we record 'unknown' instead of omitting the key, so a stable shape on the failure entry tells you
// (by which counters are 'unknown') how far the request progressed before failing.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have to change anything here, but setting unknown in these cases does not seem to make it any easier for consumer code to parse the data from the doc when compared to setting:

  • "",
  • null
  • undefined

I kind of understand wanting to keep the object shape "stable", but I am not really sure how much value there is in that when we don't have any meaningful default value to use... 🤷

Comment on lines +134 to +141
const oldestPeriod = await findOldestReportingPeriod();
if (!oldestPeriod) {
return { data: [], cursor: null };
}

// Username is the doc-id suffix, preventing prefix-scan; bulk-fetch the user's candidate keys (max 60).
const candidateKeys = enumerateReportingPeriods(oldestPeriod).map(period => getDocId(user, period));
const result = await db.medicLogs.allDocs({ keys: candidateKeys, include_docs: true });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a strong reason we cannot just index these docs by user? This is kind of a crazy amount of logic to have/maintain (even if the performance is okay). It is fancy that all of this actually works, but at the same time it really smells like a premature optimization.

const getUserFailureLog = async (username) => {
const response = await getFailureLogs({ user: username });
const currentPeriodId = getFailureLogId(username);
return response.data.find(log => log._id === currentPeriodId) || response.data[0] || null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but the currentPeriodId logic here feels like surprizing behavior that might be confusing to debug later. (When we call this function are we expecting a doc for the current period or are we expecting to just get back the first one? How can someone reading the code know?) Would rather just have two dedicated functions...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement failed user replication tracking

3 participants