Fix Critical Memory Leak in Session Replay and Refactor for Better Error Handling#1328
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR refactors replay-related functionality with synchronous processing, improved error handling, and comprehensive test coverage for edge cases and memory leak prevention.
- Replaces asynchronous replay processing with synchronous execution to improve error handling and control flow
- Updates the Recorder API from
dump()toexportRecordingSpan()with proper error boundaries and span export separation - Enhances error handling in Queue and ReplayManager with proper exception propagation instead of silent failures
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/replay/unit/replayManager.test.js | Updates unit tests to match new synchronous API and error handling patterns |
| test/replay/unit/queue.replayManager.test.js | Adds comprehensive tests for replay cleanup scenarios and renames methods to match new API |
| test/replay/integration/sessionRecording.test.js | Updates integration tests to use new exportRecordingSpan API |
| test/replay/integration/replayManager.test.js | Updates integration tests for synchronous processing and proper error propagation |
| test/replay/integration/queue.replayManager.test.js | Adds extensive memory leak prevention tests and modernizes callback syntax |
| test/replay/integration/e2e.test.js | Updates E2E tests for new recorder API and span ordering changes |
| test/browser.replay.recorder.test.js | Updates recorder tests for new exportRecordingSpan API with proper error handling |
| src/queue.js | Refactors replay handling with better error boundaries and cleanup logic |
| src/browser/replay/replayManager.js | Changes from async to sync processing with proper error propagation |
| src/browser/replay/recorder.js | Replaces dump() with exportRecordingSpan() and improves event collection logic |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| // Recording may be up to two checkout intervals, therefore the checkout | ||
| // interval is set to half of the maxSeconds. | ||
| return (this.options.maxSeconds || 10) * 1000 / 2; | ||
| return ((this.options.maxSeconds || 10) * 1000) / 2; |
There was a problem hiding this comment.
For some reason either Prettier or ESLint decided to add parenthesis now. This shouldn't make any difference, but 🤷♂️ .
| /** | ||
| * Converts recorded events into a formatted payload ready for transport. | ||
| * Exports the recording span with all recorded events. | ||
| * | ||
| * This method takes the recorder's stored events, creates a new span with the | ||
| * provided tracing context, attaches all events with their timestamps as span | ||
| * events, and then returns a payload ready for transport to the server. | ||
| * events, and exports the span to the tracing exporter. This is a side-effect | ||
| * function that doesn't return anything - the span is exported internally. | ||
| * | ||
| * @param {Object} tracing - The tracing system instance to create spans | ||
| * @param {string} replayId - Unique identifier to associate with this replay recording | ||
| * @returns {Object|null} A formatted payload containing spans data in OTLP format, or null if no events exist | ||
| * @param {Object} attributes - Attributes to add to the span | ||
| * (e.g., rollbar.replay.id, rollbar.occurrence.uuid) | ||
| */ |
There was a problem hiding this comment.
This function now has the same shape as Telemeter's exportTelemetrySpan, and instead of returning the payload (which forces us to produce all export side-effects before calling this function), it now returns nothing, making it a pure side-effect like Telemeter's exportTelemetrySpan.
| if (events.length < 2) { | ||
| logger.error('Replay recording cannot have less than 2 events'); | ||
| return null; | ||
| if (events.length < 3) { |
There was a problem hiding this comment.
+ 1 since now we have the final ending event, which is being added in _collectEvents now.
| */ | ||
| dump(tracing, replayId, occurrenceUuid) { | ||
| const events = this._events.previous.concat(this._events.current); | ||
| exportRecordingSpan(tracing, attributes = {}) { |
There was a problem hiding this comment.
To follow exportTelemetrySpan, this function now accepts attributes instead of explicitly requiring to pass the occurrence uuid and the replay id, which is controlled by the ReplayManager now as it should.
| /** | ||
| * Processes a replay by converting recorder events into a transport-ready payload. | ||
| * Processes a replay by exporting spans and generating a transport-ready payload. | ||
| * | ||
| * Calls recorder.dump() to capture events as spans, formats them into a proper payload, | ||
| * and stores the result in the map using replayId as the key. | ||
| * Exports both telemetry and recording spans, then generates the complete payload | ||
| * using the tracing exporter and stores it in the map using replayId as the key. | ||
| * | ||
| * @param {string} replayId - The unique ID for this replay | ||
| * @returns {Promise<string>} A promise resolving to the processed replayId | ||
| * @param {string} occurrenceUuid - The UUID of the associated error occurrence | ||
| * @returns {string|null} The replayId if successful, or null if an error occurred | ||
| * @private | ||
| */ |
There was a problem hiding this comment.
Fixed a couple of things in this function:
exportTelemetrySpanshould happen after exporting the recording span since the recording span export can throw if the recording happens to be invalid. If this happens, we don't want the telemetry span to end. We want to bail out from the whole situation.Recorder.dumpdoesn't exist anymore and now uses the same strategy asTelemeter.exportTelemetrySpan, this is useful because it not only makes our code simpler and more consistent, but it allows us to control the order of exports in order to fix the problem described in 1.ReplayManageris now in charge of calling tracing'stoPayloadto generate the payload necessary instead ofRecording.- If the
_processReplayprocess fails, we don't return a replay id anymore, we returnnull. This is because the replay doesn't exist if this function fails. So we end up with a replay id that points at nothing. - We don't map the
replayIdtonullin our Map collection if processing the replay fails, since the replay doesn't exist and we have no logic to make any use ofnullreplays. - This function isn't
asyncanymore because it doesn't have to be.
| /** | ||
| * Adds a replay to the map and returns a uniquely generated replay ID. | ||
| * | ||
| * This method immediately returns the replayId and asynchronously processes | ||
| * the replay data in the background. The processing involves converting | ||
| * recorder events into a payload format and storing it in the map. | ||
| * The processing involves converting recorder events into a payload format | ||
| * and storing it in the map. | ||
| * | ||
| * @returns {string} A unique identifier for this replay | ||
| * @returns {string|null} A unique identifier for this replay, or null if an | ||
| * error occurred | ||
| */ |
There was a problem hiding this comment.
This function now returns the replay id if we were able to add it, or null if we didn't (signaling the replay doesn't exist).
The catch in _processReplay did nothing so it was removed together with making the function non-async.
There was a problem hiding this comment.
_processReplay is async so that queue.add_item doesn't have to wait for this work to happen before sending the occurrence payload.
The main benefit of making it sync is so the id can be omitted from the occurrence payload, but the great majority of replays not sent will be due to API rate limits and account quotas, which won't be known until send is called. So there's limited benefit to making this sync.
There was a problem hiding this comment.
Right, I was too focused on error handling and I didn't see that. I understand, _processReplay happens asynchronously, we assume the replay was exported successfully, but at the point of sending, if anything went wrong during process, we will just not find the replay now (before, we'd actually end up in an invalid state with a null replay).
I'll make _processReplay async again, no trouble!
There was a problem hiding this comment.
And the benefit you mention would be the fact that, if the processing fails, the replayId in item wouldn't be set. So the occurrence wouldn't be mapped to a replay that doesn't actually exist.
There was a problem hiding this comment.
It could be interesting that in those cases (which I've seen happen a lot, though not enough to justify not making this async), to actually send a replay span that's basically an error span, signaling that the replay failed to be processed. 🤔
We talked about this, that's why I had the "error span" comment when setting the replayId to null in the map if the replay processing fails.
There was a problem hiding this comment.
Could be an error event on another span. I don't know that we'd want replay spans that aren't actually replays.
| @@ -103,15 +99,11 @@ | |||
| */ | |||
| async send(replayId) { | |||
There was a problem hiding this comment.
The send function doesn't return either true or false, but throws if there was a problem sending the replay. This makes it much easier to control the problematic flows to discard replays that fail to send.
| if (this.replayManager && data.body) { | ||
| const replayId = data?.attributes?.find( | ||
| (a) => a.key === 'replay_id', | ||
| )?.value; | ||
| const rid = data.attributes?.find((a) => a.key === 'replay_id')?.value; | ||
|
|
||
| if (replayId) { | ||
| item.replayId = this.replayManager.add(replayId, data.uuid); | ||
| if (rid) { | ||
| item.replayId = this.replayManager.add(rid, data.uuid); | ||
| } | ||
| } |
There was a problem hiding this comment.
Renamed replayId to rid just because I felt the find was easier to read in a single line.
| } catch (err) { | ||
| this._dequeuePendingRequest(data); | ||
| callback(e); | ||
|
|
||
| if (item.replayId) { | ||
| this.replayManager?.discard(item.replayId); | ||
| } | ||
|
|
||
| callback(err); |
There was a problem hiding this comment.
I couldn't find a more graceful way to discard the replay if makeApiRequest throws (I wanted to control the discard from a single place: the new _sendOrDiscardReplay function. But, oh well.
| /** | ||
| * Sends or discards a replay based on the API response. | ||
| * | ||
| * Sends the replay if the response indicates success and replay is enabled. | ||
| * | ||
| * Discards the replay if there's an error, replay is disabled, or rate limit | ||
| * is exceeded. | ||
| * | ||
| * @param {string} replayId - The ID of the replay to handle | ||
| * @param {string} replayId - The ID of the replay to send or discard | ||
| * @param {Object} err - The error object from the API request, if any | ||
| * @param {Object} response - The API response | ||
| * @param {Object} headers - The response headers | ||
| * @returns {Promise<boolean>} A promise that resolves to true if replay was sent successfully, | ||
| * false if replay was discarded or an error occurred | ||
| */ |
There was a problem hiding this comment.
This function is now much simpler and renamed it to make it more explicit.
Basically:
- If we can't send the replay, discard the replay.
- If we send the replay and fails, discard the replay.
This function is async but the promise doesn't resolve to any value since we don't need to.
90425f3 to
40cd8ab
Compare
Note
PR description written by Claude with my guidance and by analyzing the diff of this branch.
Problem
We discovered a critical memory leak in the Session Replay feature where replays were not being properly discarded when API errors occurred. This could lead to unbounded memory growth in production applications experiencing API failures.
Solution
This PR fixes the memory leak and significantly improves the Session Replay architecture through better separation of concerns, explicit error handling, and API consistency.
Key Changes
1. Memory Leak Fix 🚨
_canSendReplay()and_sendOrDiscardReplay()methods toQueuefor clearer logic2. API Refactoring -
dump()→exportRecordingSpan()Recorder.dump()toexportRecordingSpan()for consistency withexportTelemetrySpan()_processReplay()to_exportSpansAndAddTracingPayload()for clarity3. Improved Error Handling
4. Better Separation of Concerns
exportRecordingSpan()no longer returns a value - it purely exports the span_processReplay()explicitly callstoPayload()after exporting spans5. Enhanced Test Coverage
Technical Details
Files Changed:
src/queue.js- Fixed memory leak, added replay handling methodssrc/browser/replay/recorder.js- Renamed dump to exportRecordingSpan, refactoredsrc/browser/replay/replayManager.js- Improved error handling, synchronous processingBreaking Changes: None - All changes are internal
Testing
Impact
This fix is critical for production stability. Applications using Session Replay that experience API errors will no longer accumulate replays in memory, preventing potential out-of-memory crashes.