Fix Critical Memory Leak in Session Replay and Refactor for Better Error Handling by matux · Pull Request #1328 · rollbar/rollbar.js

matux · 2025-09-24T15:20:53Z

Note

PR description written by Claude with my guidance and by analyzing the diff of this branch.

Problem

We discovered a critical memory leak in the Session Replay feature where replays were not being properly discarded when API errors occurred. This could lead to unbounded memory growth in production applications experiencing API failures.

Solution

This PR fixes the memory leak and significantly improves the Session Replay architecture through better separation of concerns, explicit error handling, and API consistency.

Key Changes

1. Memory Leak Fix 🚨

Before: Replays were only discarded on successful API responses with specific conditions, causing them to accumulate in memory on errors
After: Replays are always cleaned up - either sent successfully or explicitly discarded on any failure
Added _canSendReplay() and _sendOrDiscardReplay() methods to Queue for clearer logic

2. API Refactoring - `dump()` → `exportRecordingSpan()`

Renamed Recorder.dump() to exportRecordingSpan() for consistency with exportTelemetrySpan()
Changed from returning a payload to being a pure side-effect function
Now accepts an attributes object instead of individual parameters for better extensibility
Throws errors instead of returning null for invalid recordings
Renamed _processReplay() to _exportSpansAndAddTracingPayload() for clarity
Made span export async to avoid blocking error submission

3. Improved Error Handling

Recording span export is now wrapped in try/catch with explicit error handling
Telemetry span is only exported if recording span succeeds
ReplayManager methods now throw errors instead of logging and returning false
Queue properly discards replays even when exceptions occur during API calls

4. Better Separation of Concerns

exportRecordingSpan() no longer returns a value - it purely exports the span
_processReplay() explicitly calls toPayload() after exporting spans
Cleaner flow: export recording → export telemetry → generate payload → store

5. Enhanced Test Coverage

Added comprehensive tests for memory leak prevention scenarios
Tests verify replays are discarded on API errors, network failures, and exceptions
Added test to verify the "end event" is properly added to recordings
Coverage for rate limiting, retries, and various failure modes

Technical Details

Files Changed:

src/queue.js - Fixed memory leak, added replay handling methods
src/browser/replay/recorder.js - Renamed dump to exportRecordingSpan, refactored
src/browser/replay/replayManager.js - Improved error handling, synchronous processing
6 test files with extensive new test coverage

Breaking Changes: None - All changes are internal

Testing

✅ All existing tests pass
✅ Added 15+ new test cases covering error scenarios
✅ Verified memory leak is fixed in multiple failure conditions
✅ Tested retry logic and rate limiting behavior

Impact

This fix is critical for production stability. Applications using Session Replay that experience API errors will no longer accumulate replays in memory, preventing potential out-of-memory crashes.

Copilot

Pull Request Overview

This PR refactors replay-related functionality with synchronous processing, improved error handling, and comprehensive test coverage for edge cases and memory leak prevention.

Replaces asynchronous replay processing with synchronous execution to improve error handling and control flow
Updates the Recorder API from dump() to exportRecordingSpan() with proper error boundaries and span export separation
Enhances error handling in Queue and ReplayManager with proper exception propagation instead of silent failures

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
test/replay/unit/replayManager.test.js	Updates unit tests to match new synchronous API and error handling patterns
test/replay/unit/queue.replayManager.test.js	Adds comprehensive tests for replay cleanup scenarios and renames methods to match new API
test/replay/integration/sessionRecording.test.js	Updates integration tests to use new exportRecordingSpan API
test/replay/integration/replayManager.test.js	Updates integration tests for synchronous processing and proper error propagation
test/replay/integration/queue.replayManager.test.js	Adds extensive memory leak prevention tests and modernizes callback syntax
test/replay/integration/e2e.test.js	Updates E2E tests for new recorder API and span ordering changes
test/browser.replay.recorder.test.js	Updates recorder tests for new exportRecordingSpan API with proper error handling
src/queue.js	Refactors replay handling with better error boundaries and cleanup logic
src/browser/replay/replayManager.js	Changes from async to sync processing with proper error propagation
src/browser/replay/recorder.js	Replaces dump() with exportRecordingSpan() and improves event collection logic

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

matux · 2025-09-24T15:26:18Z

    // Recording may be up to two checkout intervals, therefore the checkout
    // interval is set to half of the maxSeconds.
-    return (this.options.maxSeconds || 10) * 1000 / 2;
+    return ((this.options.maxSeconds || 10) * 1000) / 2;


For some reason either Prettier or ESLint decided to add parenthesis now. This shouldn't make any difference, but 🤷‍♂️ .

matux · 2025-09-24T15:28:49Z

  /**
-   * Converts recorded events into a formatted payload ready for transport.
+   * Exports the recording span with all recorded events.
   *
   * This method takes the recorder's stored events, creates a new span with the
   * provided tracing context, attaches all events with their timestamps as span
-   * events, and then returns a payload ready for transport to the server.
+   * events, and exports the span to the tracing exporter. This is a side-effect
+   * function that doesn't return anything - the span is exported internally.
   *
   * @param {Object} tracing - The tracing system instance to create spans
-   * @param {string} replayId - Unique identifier to associate with this replay recording
-   * @returns {Object|null} A formatted payload containing spans data in OTLP format, or null if no events exist
+   * @param {Object} attributes - Attributes to add to the span
+   *  (e.g., rollbar.replay.id, rollbar.occurrence.uuid)
   */


This function now has the same shape as Telemeter's exportTelemetrySpan, and instead of returning the payload (which forces us to produce all export side-effects before calling this function), it now returns nothing, making it a pure side-effect like Telemeter's exportTelemetrySpan.

matux · 2025-09-24T15:30:31Z

-    if (events.length < 2) {
-      logger.error('Replay recording cannot have less than 2 events');
-      return null;
+    if (events.length < 3) {


+ 1 since now we have the final ending event, which is being added in _collectEvents now.

matux · 2025-09-24T15:31:22Z

   */
-  dump(tracing, replayId, occurrenceUuid) {
-    const events = this._events.previous.concat(this._events.current);
+  exportRecordingSpan(tracing, attributes = {}) {


To follow exportTelemetrySpan, this function now accepts attributes instead of explicitly requiring to pass the occurrence uuid and the replay id, which is controlled by the ReplayManager now as it should.

matux · 2025-09-24T16:47:46Z

  /**
-   * Processes a replay by converting recorder events into a transport-ready payload.
+   * Processes a replay by exporting spans and generating a transport-ready payload.
   *
-   * Calls recorder.dump() to capture events as spans, formats them into a proper payload,
-   * and stores the result in the map using replayId as the key.
+   * Exports both telemetry and recording spans, then generates the complete payload
+   * using the tracing exporter and stores it in the map using replayId as the key.
   *
   * @param {string} replayId - The unique ID for this replay
-   * @returns {Promise<string>} A promise resolving to the processed replayId
+   * @param {string} occurrenceUuid - The UUID of the associated error occurrence
+   * @returns {string|null} The replayId if successful, or null if an error occurred
   * @private
   */


Fixed a couple of things in this function:

exportTelemetrySpan should happen after exporting the recording span since the recording span export can throw if the recording happens to be invalid. If this happens, we don't want the telemetry span to end. We want to bail out from the whole situation.

Recorder.dump doesn't exist anymore and now uses the same strategy as Telemeter.exportTelemetrySpan, this is useful because it not only makes our code simpler and more consistent, but it allows us to control the order of exports in order to fix the problem described in 1.

ReplayManager is now in charge of calling tracing's toPayload to generate the payload necessary instead of Recording.

If the _processReplay process fails, we don't return a replay id anymore, we return null. This is because the replay doesn't exist if this function fails. So we end up with a replay id that points at nothing.

We don't map the replayId to null in our Map collection if processing the replay fails, since the replay doesn't exist and we have no logic to make any use of null replays.

This function isn't async anymore because it doesn't have to be.

matux · 2025-09-24T16:48:57Z

  /**
   * Adds a replay to the map and returns a uniquely generated replay ID.
   *
-   * This method immediately returns the replayId and asynchronously processes
-   * the replay data in the background. The processing involves converting
-   * recorder events into a payload format and storing it in the map.
+   * The processing involves converting recorder events into a payload format
+   * and storing it in the map.
   *
-   * @returns {string} A unique identifier for this replay
+   * @returns {string|null} A unique identifier for this replay, or null if an
+   *  error occurred
   */


This function now returns the replay id if we were able to add it, or null if we didn't (signaling the replay doesn't exist).

The catch in _processReplay did nothing so it was removed together with making the function non-async.

_processReplay is async so that queue.add_item doesn't have to wait for this work to happen before sending the occurrence payload.

The main benefit of making it sync is so the id can be omitted from the occurrence payload, but the great majority of replays not sent will be due to API rate limits and account quotas, which won't be known until send is called. So there's limited benefit to making this sync.

Right, I was too focused on error handling and I didn't see that. I understand, _processReplay happens asynchronously, we assume the replay was exported successfully, but at the point of sending, if anything went wrong during process, we will just not find the replay now (before, we'd actually end up in an invalid state with a null replay).

I'll make _processReplay async again, no trouble!

And the benefit you mention would be the fact that, if the processing fails, the replayId in item wouldn't be set. So the occurrence wouldn't be mapped to a replay that doesn't actually exist.

It could be interesting that in those cases (which I've seen happen a lot, though not enough to justify not making this async), to actually send a replay span that's basically an error span, signaling that the replay failed to be processed. 🤔

We talked about this, that's why I had the "error span" comment when setting the replayId to null in the map if the replay processing fails.

Could be an error event on another span. I don't know that we'd want replay spans that aren't actually replays.

matux · 2025-09-24T16:50:13Z

@@ -103,15 +99,11 @@
   */
  async send(replayId) {


The send function doesn't return either true or false, but throws if there was a problem sending the replay. This makes it much easier to control the problematic flows to discard replays that fail to send.

matux · 2025-09-24T16:51:01Z

    if (this.replayManager && data.body) {
-      const replayId = data?.attributes?.find(
-        (a) => a.key === 'replay_id',
-      )?.value;
+      const rid = data.attributes?.find((a) => a.key === 'replay_id')?.value;

-      if (replayId) {
-        item.replayId = this.replayManager.add(replayId, data.uuid);
+      if (rid) {
+        item.replayId = this.replayManager.add(rid, data.uuid);
      }
    }


Renamed replayId to rid just because I felt the find was easier to read in a single line.

matux · 2025-09-24T16:51:55Z

+    } catch (err) {
      this._dequeuePendingRequest(data);
-      callback(e);
+
+      if (item.replayId) {
+        this.replayManager?.discard(item.replayId);
+      }
+
+      callback(err);


I couldn't find a more graceful way to discard the replay if makeApiRequest throws (I wanted to control the discard from a single place: the new _sendOrDiscardReplay function. But, oh well.

matux · 2025-09-24T16:54:54Z

+  /**
+   * Sends or discards a replay based on the API response.
+   *
+   * Sends the replay if the response indicates success and replay is enabled.
+   *
+   * Discards the replay if there's an error, replay is disabled, or rate limit
+   * is exceeded.
   *
-   * @param {string} replayId - The ID of the replay to handle
+   * @param {string} replayId - The ID of the replay to send or discard
+   * @param {Object} err - The error object from the API request, if any
   * @param {Object} response - The API response
   * @param {Object} headers - The response headers
-   * @returns {Promise<boolean>} A promise that resolves to true if replay was sent successfully,
-   *                             false if replay was discarded or an error occurred
   */


This function is now much simpler and renamed it to make it more explicit.

Basically:

If we can't send the replay, discard the replay.

If we send the replay and fails, discard the replay.

This function is async but the promise doesn't resolve to any value since we don't need to.

waltjones

✅ Nice improvement!

matux self-assigned this Sep 24, 2025

matux requested a review from Copilot September 24, 2025 15:21

Copilot AI reviewed Sep 24, 2025

View reviewed changes

Comment thread src/browser/replay/recorder.js

Comment thread test/replay/unit/replayManager.test.js Outdated

matux commented Sep 24, 2025

View reviewed changes

matux changed the title ~~Replay related fixes and refactoring + new tests~~ Fix Critical Memory Leak in Session Replay and Refactor for Better Error Handling Sep 24, 2025

matux marked this pull request as ready for review September 24, 2025 18:04

matux added 4 commits September 24, 2025 15:50

Replay related fixes and refactoring + new tests

11cc408

changes

fc04983

Some more changes

bc5a2f0

ReplayManager.add now returns immediately, span exporting is done async

40cd8ab

matux force-pushed the matux/fix-queue-leak branch from 90425f3 to 40cd8ab Compare September 24, 2025 18:51

matux requested a review from waltjones September 24, 2025 18:51

Add comment

228e395

waltjones approved these changes Sep 24, 2025

View reviewed changes

matux merged commit b99de9b into master Sep 24, 2025
4 checks passed

matux deleted the matux/fix-queue-leak branch September 24, 2025 20:04

Uh oh!

Conversation

matux commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Key Changes

1. Memory Leak Fix 🚨

2. API Refactoring - dump() → exportRecordingSpan()

3. Improved Error Handling

4. Better Separation of Concerns

5. Enhanced Test Coverage

Technical Details

Testing

Impact

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matux Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

waltjones left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matux commented Sep 24, 2025 •

edited

Loading

2. API Refactoring - `dump()` → `exportRecordingSpan()`

matux Sep 24, 2025 •

edited

Loading