Skip to content

OOM crash on large transcript files — split_mega_files.py and normalize.py load entire file into memory #396

@igorls

Description

@igorls

Description

split_mega_files.py and normalize.py eagerly load entire file contents into memory, causing MemoryError / OOM kills on large transcript exports (multi-GB Slack or ChatGPT bulk exports).

Root Cause

split_mega_files.py

Line 185 — split_file() loads the entire file:

lines = path.read_text(errors="replace").splitlines(keepends=True)

Line 270 — main() scan loop loads it again to count session boundaries:

lines = f.read_text(errors="replace").splitlines(keepends=True)

Every file is read fully into memory twice.

normalize.py

Line 29-30:

with open(filepath, "r", encoding="utf-8", errors="replace") as f:
    content = f.read()

Memory Impact (measured)

Input: 1MB test file
read_text(): 1,040,041 bytes allocated
splitlines(): 4,071,081 bytes allocated
Overhead ratio: 2.9x the input size (string + list of lines)

For a 2GB Slack export:

  • read_text() allocates ~2GB
  • .splitlines(keepends=True) allocates an additional ~2-3GB
  • Total peak: ~5.8GB for a 2GB file
  • On an 8GB machine → OOM kill
  • split_mega_files.py reads the file twice → could peak at ~11.6GB

Impact

  • MemoryError crash on large transcript files
  • OOM killer terminates the process on memory-constrained systems
  • ChatGPT conversations.json exports can easily exceed 1GB
  • Slack workspace exports can exceed 5GB

Suggested Fix

Stream files line-by-line instead of loading everything into memory:

def find_session_boundaries_streaming(filepath):
    boundaries = []
    with open(filepath, errors="replace") as f:
        for i, line in enumerate(f):
            if "Claude Code v" in line:
                # peek ahead for context restore detection
                boundaries.append(i)
    return boundaries

And refactor split_file() to use a two-pass streaming approach or split using byte offsets.

Environment

  • mempalace v3.0.13 (current main)
  • split_mega_files.py lines 185, 270
  • normalize.py lines 29-30

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions