Description
split_mega_files.py and normalize.py eagerly load entire file contents into memory, causing MemoryError / OOM kills on large transcript exports (multi-GB Slack or ChatGPT bulk exports).
Root Cause
split_mega_files.py
Line 185 — split_file() loads the entire file:
lines = path.read_text(errors="replace").splitlines(keepends=True)
Line 270 — main() scan loop loads it again to count session boundaries:
lines = f.read_text(errors="replace").splitlines(keepends=True)
Every file is read fully into memory twice.
normalize.py
Line 29-30:
with open(filepath, "r", encoding="utf-8", errors="replace") as f:
content = f.read()
Memory Impact (measured)
Input: 1MB test file
read_text(): 1,040,041 bytes allocated
splitlines(): 4,071,081 bytes allocated
Overhead ratio: 2.9x the input size (string + list of lines)
For a 2GB Slack export:
read_text() allocates ~2GB
.splitlines(keepends=True) allocates an additional ~2-3GB
- Total peak: ~5.8GB for a 2GB file
- On an 8GB machine → OOM kill
split_mega_files.py reads the file twice → could peak at ~11.6GB
Impact
MemoryError crash on large transcript files
- OOM killer terminates the process on memory-constrained systems
- ChatGPT
conversations.json exports can easily exceed 1GB
- Slack workspace exports can exceed 5GB
Suggested Fix
Stream files line-by-line instead of loading everything into memory:
def find_session_boundaries_streaming(filepath):
boundaries = []
with open(filepath, errors="replace") as f:
for i, line in enumerate(f):
if "Claude Code v" in line:
# peek ahead for context restore detection
boundaries.append(i)
return boundaries
And refactor split_file() to use a two-pass streaming approach or split using byte offsets.
Environment
- mempalace v3.0.13 (current main)
split_mega_files.py lines 185, 270
normalize.py lines 29-30
Description
split_mega_files.pyandnormalize.pyeagerly load entire file contents into memory, causingMemoryError/ OOM kills on large transcript exports (multi-GB Slack or ChatGPT bulk exports).Root Cause
split_mega_files.py
Line 185 —
split_file()loads the entire file:Line 270 —
main()scan loop loads it again to count session boundaries:Every file is read fully into memory twice.
normalize.py
Line 29-30:
Memory Impact (measured)
For a 2GB Slack export:
read_text()allocates ~2GB.splitlines(keepends=True)allocates an additional ~2-3GBsplit_mega_files.pyreads the file twice → could peak at ~11.6GBImpact
MemoryErrorcrash on large transcript filesconversations.jsonexports can easily exceed 1GBSuggested Fix
Stream files line-by-line instead of loading everything into memory:
And refactor
split_file()to use a two-pass streaming approach or split using byte offsets.Environment
split_mega_files.pylines 185, 270normalize.pylines 29-30