Skip to content

Fix Critical Memory Leak in Session Replay and Refactor for Better Error Handling#1328

Merged
matux merged 5 commits into
masterfrom
matux/fix-queue-leak
Sep 24, 2025
Merged

Fix Critical Memory Leak in Session Replay and Refactor for Better Error Handling#1328
matux merged 5 commits into
masterfrom
matux/fix-queue-leak

Conversation

@matux

@matux matux commented Sep 24, 2025

Copy link
Copy Markdown
Contributor

Note

PR description written by Claude with my guidance and by analyzing the diff of this branch.

Problem

We discovered a critical memory leak in the Session Replay feature where replays were not being properly discarded when API errors occurred. This could lead to unbounded memory growth in production applications experiencing API failures.

Solution

This PR fixes the memory leak and significantly improves the Session Replay architecture through better separation of concerns, explicit error handling, and API consistency.

Key Changes

1. Memory Leak Fix 🚨

  • Before: Replays were only discarded on successful API responses with specific conditions, causing them to accumulate in memory on errors
  • After: Replays are always cleaned up - either sent successfully or explicitly discarded on any failure
  • Added _canSendReplay() and _sendOrDiscardReplay() methods to Queue for clearer logic

2. API Refactoring - dump()exportRecordingSpan()

  • Renamed Recorder.dump() to exportRecordingSpan() for consistency with exportTelemetrySpan()
  • Changed from returning a payload to being a pure side-effect function
  • Now accepts an attributes object instead of individual parameters for better extensibility
  • Throws errors instead of returning null for invalid recordings
  • Renamed _processReplay() to _exportSpansAndAddTracingPayload() for clarity
  • Made span export async to avoid blocking error submission

3. Improved Error Handling

  • Recording span export is now wrapped in try/catch with explicit error handling
  • Telemetry span is only exported if recording span succeeds
  • ReplayManager methods now throw errors instead of logging and returning false
  • Queue properly discards replays even when exceptions occur during API calls

4. Better Separation of Concerns

  • exportRecordingSpan() no longer returns a value - it purely exports the span
  • _processReplay() explicitly calls toPayload() after exporting spans
  • Cleaner flow: export recording → export telemetry → generate payload → store

5. Enhanced Test Coverage

  • Added comprehensive tests for memory leak prevention scenarios
  • Tests verify replays are discarded on API errors, network failures, and exceptions
  • Added test to verify the "end event" is properly added to recordings
  • Coverage for rate limiting, retries, and various failure modes

Technical Details

Files Changed:

  • src/queue.js - Fixed memory leak, added replay handling methods
  • src/browser/replay/recorder.js - Renamed dump to exportRecordingSpan, refactored
  • src/browser/replay/replayManager.js - Improved error handling, synchronous processing
  • 6 test files with extensive new test coverage

Breaking Changes: None - All changes are internal

Testing

  • ✅ All existing tests pass
  • ✅ Added 15+ new test cases covering error scenarios
  • ✅ Verified memory leak is fixed in multiple failure conditions
  • ✅ Tested retry logic and rate limiting behavior

Impact

This fix is critical for production stability. Applications using Session Replay that experience API errors will no longer accumulate replays in memory, preventing potential out-of-memory crashes.

@matux matux self-assigned this Sep 24, 2025
@matux matux requested a review from Copilot September 24, 2025 15:21

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors replay-related functionality with synchronous processing, improved error handling, and comprehensive test coverage for edge cases and memory leak prevention.

  • Replaces asynchronous replay processing with synchronous execution to improve error handling and control flow
  • Updates the Recorder API from dump() to exportRecordingSpan() with proper error boundaries and span export separation
  • Enhances error handling in Queue and ReplayManager with proper exception propagation instead of silent failures

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/replay/unit/replayManager.test.js Updates unit tests to match new synchronous API and error handling patterns
test/replay/unit/queue.replayManager.test.js Adds comprehensive tests for replay cleanup scenarios and renames methods to match new API
test/replay/integration/sessionRecording.test.js Updates integration tests to use new exportRecordingSpan API
test/replay/integration/replayManager.test.js Updates integration tests for synchronous processing and proper error propagation
test/replay/integration/queue.replayManager.test.js Adds extensive memory leak prevention tests and modernizes callback syntax
test/replay/integration/e2e.test.js Updates E2E tests for new recorder API and span ordering changes
test/browser.replay.recorder.test.js Updates recorder tests for new exportRecordingSpan API with proper error handling
src/queue.js Refactors replay handling with better error boundaries and cleanup logic
src/browser/replay/replayManager.js Changes from async to sync processing with proper error propagation
src/browser/replay/recorder.js Replaces dump() with exportRecordingSpan() and improves event collection logic

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread src/browser/replay/recorder.js
Comment thread test/replay/unit/replayManager.test.js Outdated
// Recording may be up to two checkout intervals, therefore the checkout
// interval is set to half of the maxSeconds.
return (this.options.maxSeconds || 10) * 1000 / 2;
return ((this.options.maxSeconds || 10) * 1000) / 2;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason either Prettier or ESLint decided to add parenthesis now. This shouldn't make any difference, but 🤷‍♂️ .

Comment on lines 74 to 85
/**
* Converts recorded events into a formatted payload ready for transport.
* Exports the recording span with all recorded events.
*
* This method takes the recorder's stored events, creates a new span with the
* provided tracing context, attaches all events with their timestamps as span
* events, and then returns a payload ready for transport to the server.
* events, and exports the span to the tracing exporter. This is a side-effect
* function that doesn't return anything - the span is exported internally.
*
* @param {Object} tracing - The tracing system instance to create spans
* @param {string} replayId - Unique identifier to associate with this replay recording
* @returns {Object|null} A formatted payload containing spans data in OTLP format, or null if no events exist
* @param {Object} attributes - Attributes to add to the span
* (e.g., rollbar.replay.id, rollbar.occurrence.uuid)
*/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function now has the same shape as Telemeter's exportTelemetrySpan, and instead of returning the payload (which forces us to produce all export side-effects before calling this function), it now returns nothing, making it a pure side-effect like Telemeter's exportTelemetrySpan.

if (events.length < 2) {
logger.error('Replay recording cannot have less than 2 events');
return null;
if (events.length < 3) {

@matux matux Sep 24, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ 1 since now we have the final ending event, which is being added in _collectEvents now.

*/
dump(tracing, replayId, occurrenceUuid) {
const events = this._events.previous.concat(this._events.current);
exportRecordingSpan(tracing, attributes = {}) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow exportTelemetrySpan, this function now accepts attributes instead of explicitly requiring to pass the occurrence uuid and the replay id, which is controlled by the ReplayManager now as it should.

Comment on lines 44 to 54
/**
* Processes a replay by converting recorder events into a transport-ready payload.
* Processes a replay by exporting spans and generating a transport-ready payload.
*
* Calls recorder.dump() to capture events as spans, formats them into a proper payload,
* and stores the result in the map using replayId as the key.
* Exports both telemetry and recording spans, then generates the complete payload
* using the tracing exporter and stores it in the map using replayId as the key.
*
* @param {string} replayId - The unique ID for this replay
* @returns {Promise<string>} A promise resolving to the processed replayId
* @param {string} occurrenceUuid - The UUID of the associated error occurrence
* @returns {string|null} The replayId if successful, or null if an error occurred
* @private
*/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a couple of things in this function:

  1. exportTelemetrySpan should happen after exporting the recording span since the recording span export can throw if the recording happens to be invalid. If this happens, we don't want the telemetry span to end. We want to bail out from the whole situation.
  2. Recorder.dump doesn't exist anymore and now uses the same strategy as Telemeter.exportTelemetrySpan, this is useful because it not only makes our code simpler and more consistent, but it allows us to control the order of exports in order to fix the problem described in 1.
  3. ReplayManager is now in charge of calling tracing's toPayload to generate the payload necessary instead of Recording.
  4. If the _processReplay process fails, we don't return a replay id anymore, we return null. This is because the replay doesn't exist if this function fails. So we end up with a replay id that points at nothing.
  5. We don't map the replayId to null in our Map collection if processing the replay fails, since the replay doesn't exist and we have no logic to make any use of null replays.
  6. This function isn't async anymore because it doesn't have to be.

Comment on lines 74 to 82
/**
* Adds a replay to the map and returns a uniquely generated replay ID.
*
* This method immediately returns the replayId and asynchronously processes
* the replay data in the background. The processing involves converting
* recorder events into a payload format and storing it in the map.
* The processing involves converting recorder events into a payload format
* and storing it in the map.
*
* @returns {string} A unique identifier for this replay
* @returns {string|null} A unique identifier for this replay, or null if an
* error occurred
*/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function now returns the replay id if we were able to add it, or null if we didn't (signaling the replay doesn't exist).

The catch in _processReplay did nothing so it was removed together with making the function non-async.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_processReplay is async so that queue.add_item doesn't have to wait for this work to happen before sending the occurrence payload.

The main benefit of making it sync is so the id can be omitted from the occurrence payload, but the great majority of replays not sent will be due to API rate limits and account quotas, which won't be known until send is called. So there's limited benefit to making this sync.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I was too focused on error handling and I didn't see that. I understand, _processReplay happens asynchronously, we assume the replay was exported successfully, but at the point of sending, if anything went wrong during process, we will just not find the replay now (before, we'd actually end up in an invalid state with a null replay).

I'll make _processReplay async again, no trouble!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the benefit you mention would be the fact that, if the processing fails, the replayId in item wouldn't be set. So the occurrence wouldn't be mapped to a replay that doesn't actually exist.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be interesting that in those cases (which I've seen happen a lot, though not enough to justify not making this async), to actually send a replay span that's basically an error span, signaling that the replay failed to be processed. 🤔

We talked about this, that's why I had the "error span" comment when setting the replayId to null in the map if the replay processing fails.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be an error event on another span. I don't know that we'd want replay spans that aren't actually replays.

@@ -103,15 +99,11 @@
*/
async send(replayId) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The send function doesn't return either true or false, but throws if there was a problem sending the replay. This makes it much easier to control the problematic flows to discard replays that fail to send.

Comment thread src/queue.js
Comment on lines 104 to 110
if (this.replayManager && data.body) {
const replayId = data?.attributes?.find(
(a) => a.key === 'replay_id',
)?.value;
const rid = data.attributes?.find((a) => a.key === 'replay_id')?.value;

if (replayId) {
item.replayId = this.replayManager.add(replayId, data.uuid);
if (rid) {
item.replayId = this.replayManager.add(rid, data.uuid);
}
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed replayId to rid just because I felt the find was easier to read in a single line.

Comment thread src/queue.js
Comment on lines +123 to +130
} catch (err) {
this._dequeuePendingRequest(data);
callback(e);

if (item.replayId) {
this.replayManager?.discard(item.replayId);
}

callback(err);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a more graceful way to discard the replay if makeApiRequest throws (I wanted to control the discard from a single place: the new _sendOrDiscardReplay function. But, oh well.

Comment thread src/queue.js
Comment on lines +333 to 345
/**
* Sends or discards a replay based on the API response.
*
* Sends the replay if the response indicates success and replay is enabled.
*
* Discards the replay if there's an error, replay is disabled, or rate limit
* is exceeded.
*
* @param {string} replayId - The ID of the replay to handle
* @param {string} replayId - The ID of the replay to send or discard
* @param {Object} err - The error object from the API request, if any
* @param {Object} response - The API response
* @param {Object} headers - The response headers
* @returns {Promise<boolean>} A promise that resolves to true if replay was sent successfully,
* false if replay was discarded or an error occurred
*/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is now much simpler and renamed it to make it more explicit.

Basically:

  1. If we can't send the replay, discard the replay.
  2. If we send the replay and fails, discard the replay.

This function is async but the promise doesn't resolve to any value since we don't need to.

@matux matux changed the title Replay related fixes and refactoring + new tests Fix Critical Memory Leak in Session Replay and Refactor for Better Error Handling Sep 24, 2025
@matux matux marked this pull request as ready for review September 24, 2025 18:04
@matux matux force-pushed the matux/fix-queue-leak branch from 90425f3 to 40cd8ab Compare September 24, 2025 18:51
@matux matux requested a review from waltjones September 24, 2025 18:51

@waltjones waltjones left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Nice improvement!

@matux matux merged commit b99de9b into master Sep 24, 2025
4 checks passed
@matux matux deleted the matux/fix-queue-leak branch September 24, 2025 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants