feat: configurable chunk_size, chunk_overlap, min_chunk_size by jphein · Pull Request #1024 · MemPalace/mempalace

jphein · 2026-04-19T03:55:31Z

Why

Chunk sizing in the miner is currently hardcoded via three module-level constants in mempalace/miner.py:

CHUNK_SIZE = 800     # chars per drawer
CHUNK_OVERLAP = 100  # overlap between chunks
MIN_CHUNK_SIZE = 50  # skip tiny chunks

One size does not fit all. Source material varies enormously — dense source code wants smaller, tighter chunks; prose transcripts and journal entries work better with larger, more overlapping chunks; sparse logs want very small min_chunk_size to avoid dropping short but meaningful entries. Users also have different context-window budgets: a palace driving a 200K-token Claude session can afford larger drawers than one driving an 8K local model.

Today the only way to change these values is to edit source and reinstall. This PR makes all three overridable from ~/.mempalace/config.json without changing defaults.

What

Adds three new optional keys to config.json:

{
  "chunk_size": 1200,
  "chunk_overlap": 150,
  "min_chunk_size": 40
}

chunk_size (default 800) — target character count per drawer
chunk_overlap (default 100) — characters shared between adjacent chunks so boundary information is never lost
min_chunk_size (default 50) — files (and tail chunks) smaller than this are skipped

Exposed as MempalaceConfig properties:

cfg = MempalaceConfig()
cfg.chunk_size        # 800
cfg.chunk_overlap     # 100
cfg.min_chunk_size    # 50

Threaded through the mining path as optional keyword arguments:

mine() → reads MempalaceConfig once
      → process_file(..., chunk_size=, chunk_overlap=, min_chunk_size=)
         → chunk_text(..., chunk_size=, chunk_overlap=, min_chunk_size=)

chunk_text() and process_file() keep the module-level constants as fallbacks whenever a parameter is None, so any caller that omits the new kwargs gets identical behavior to before this PR.

Backwards compatibility

Defaults unchanged: 800 / 100 / 50
config.json keys are optional — missing keys fall through to the same defaults
chunk_text(content, source_file) — the original two-arg signature — still works
Existing palaces behave byte-for-byte identically until a user explicitly sets a value

Tests

pytest tests/ -q -k "chunk or config"    → 62 passed
pytest tests/ -q -k "miner"              → 48 passed

No test changes required — defaults match existing hardcoded values exactly.

Scope note

The source commit on my fork (d63ffd4) bundled three changes:

I6 — configurable chunking (this PR)
I11 — Layer1.MAX_SCAN=2000 cap on wake-up scan
I12 — _build_where_filter() helper extracted in searcher.py

Upstream develop already contains equivalents of I11 and I12, which made a straight cherry-pick produce tangled conflicts. I reconstructed the chunking slice cleanly on top of upstream/develop — I11 and I12 are therefore not in this PR and not needed.

Test plan

pytest -k "chunk or config" — 62 passed
pytest -k "miner" — 48 passed
MempalaceConfig().chunk_size / .chunk_overlap / .min_chunk_size return 800 / 100 / 50 with no config.json present
Setting "chunk_size": 1200 in ~/.mempalace/config.json propagates through mine() → process_file() → chunk_text()
Omitting the new keys preserves default behavior identically

Both filed against upstream/develop today: - MemPalace#1023 — PID file guard prevents stacking mine processes - MemPalace#1024 — configurable chunk_size / chunk_overlap / min_chunk_size Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Copilot

Pull request overview

This PR makes the miner’s chunking behavior user-configurable via ~/.mempalace/config.json, while keeping the current hardcoded defaults as the effective behavior when no overrides are provided.

Changes:

Adds chunk_size, chunk_overlap, and min_chunk_size optional parameters to chunk_text() and process_file(), and threads them through mine().
Adds MempalaceConfig properties for chunk_size, chunk_overlap, and min_chunk_size (with default values matching the prior constants).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`mempalace/miner.py`	Threads configurable chunk sizing through `mine() → process_file() → chunk_text()` and applies the parameters in chunking logic.
`mempalace/config.py`	Exposes three new config properties reading `chunk_*` values from `config.json` with defaults.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-19T03:59:36Z

+    @property
+    def chunk_size(self):
+        """Characters per drawer chunk."""
+        return self._file_config.get("chunk_size", 800)
+
+    @property
+    def chunk_overlap(self):
+        """Overlap between adjacent chunks."""
+        return self._file_config.get("chunk_overlap", 100)
+
+    @property
+    def min_chunk_size(self):
+        """Minimum chunk size — skip smaller chunks."""
+        return self._file_config.get("min_chunk_size", 50)


These config properties return raw JSON values without type/range validation. A non-int (e.g. string) or out-of-range value (negative, min_chunk_size > chunk_size, etc.) will currently crash mining or produce 0 drawers unexpectedly. Consider coercing to int where possible and falling back to defaults (or raising a clear ValueError) when values are invalid, and enforce invariants like min_chunk_size <= chunk_size and 0 <= chunk_overlap < chunk_size.

Suggested change

@property

def chunk_size(self):

"""Characters per drawer chunk."""

return self._file_config.get("chunk_size", 800)

@property

def chunk_overlap(self):

"""Overlap between adjacent chunks."""

return self._file_config.get("chunk_overlap", 100)

@property

def min_chunk_size(self):

"""Minimum chunk size — skip smaller chunks."""

return self._file_config.get("min_chunk_size", 50)

def _coerce_config_int(self, key, default, minimum=None):

"""Return a validated integer config value or the provided default."""

value = self._file_config.get(key, default)

if isinstance(value, bool):

return default

try:

if isinstance(value, str):

value = value.strip()

if not value:

return default

value = int(value)

except (TypeError, ValueError):

return default

if minimum is not None and value < minimum:

return default

return value

def _validated_chunk_config(self):

"""Return validated chunk configuration values."""

default_chunk_size = 800

default_chunk_overlap = 100

default_min_chunk_size = 50

chunk_size = self._coerce_config_int("chunk_size", default_chunk_size, minimum=1)

chunk_overlap = self._coerce_config_int("chunk_overlap", default_chunk_overlap, minimum=0)

min_chunk_size = self._coerce_config_int("min_chunk_size", default_min_chunk_size, minimum=0)

if chunk_overlap >= chunk_size:

chunk_overlap = (

default_chunk_overlap

if default_chunk_overlap < chunk_size

else max(0, chunk_size - 1)

)

if min_chunk_size > chunk_size:

min_chunk_size = (

default_min_chunk_size

if default_min_chunk_size <= chunk_size

else chunk_size

)

return chunk_size, chunk_overlap, min_chunk_size

@property

def chunk_size(self):

"""Characters per drawer chunk."""

return self._validated_chunk_config()[0]

@property

def chunk_overlap(self):

"""Overlap between adjacent chunks."""

return self._validated_chunk_config()[1]

@property

def min_chunk_size(self):

"""Minimum chunk size — skip smaller chunks."""

return self._validated_chunk_config()[2]

Done in 507bcb1. New MempalaceConfig._coerce_config_int(key, default, minimum) and _validated_chunk_config() helpers route every chunk_* property read through type/range validation: bool, non-numeric strings, empty strings, and below-minimum values fall back to the documented default. _validated_chunk_config then enforces the invariants chunk_text() relies on — chunk_size >= 1, chunk_overlap < chunk_size (with the default-fits-vs-clamp-to-size-minus-1 repair you suggested), and min_chunk_size <= chunk_size. Repairs rather than raises so a single bad config.json key can't take ingest down.

Copilot · 2026-04-19T03:59:37Z

+    @property
+    def chunk_size(self):
+        """Characters per drawer chunk."""
+        return self._file_config.get("chunk_size", 800)
+
+    @property
+    def chunk_overlap(self):
+        """Overlap between adjacent chunks."""
+        return self._file_config.get("chunk_overlap", 100)
+
+    @property
+    def min_chunk_size(self):
+        """Minimum chunk size — skip smaller chunks."""
+        return self._file_config.get("min_chunk_size", 50)


New public config surface (chunk_size, chunk_overlap, min_chunk_size) isn’t covered by existing config tests. Since there are already tests/test_config.py cases for config defaults and file overrides, adding assertions for these new properties (default values and file overrides, plus a couple invalid-value fallbacks if validation is added) would help prevent regressions.

Done in 507bcb1 — 15 new tests under "Chunk-config validation" in tests/test_config.py. Coverage: defaults-when-unset, valid file overrides, string coercion, garbage/bool/negative/zero fallback, repair of overlap >= size (both the default-fits and clamp-to-size-1 paths), repair of min_chunk_size > size, the three ValueError paths in chunk_text, plus a single-source-of-truth pin between the legacy miner aliases and the new config defaults. 42/42 test_config pass; 1323/1323 full suite.

Copilot · 2026-04-19T03:59:37Z

@@ -370,20 +385,20 @@ def chunk_text(content: str, source_file: str) -> list:
    chunk_index = 0

    while start < len(content):
-        end = min(start + CHUNK_SIZE, len(content))
+        end = min(start + chunk_size, len(content))

        # Try to break at paragraph boundary
        if end < len(content):
            newline_pos = content.rfind("\n\n", start, end)
-            if newline_pos > start + CHUNK_SIZE // 2:
+            if newline_pos > start + chunk_size // 2:
                end = newline_pos
            else:
                newline_pos = content.rfind("\n", start, end)
-                if newline_pos > start + CHUNK_SIZE // 2:
+                if newline_pos > start + chunk_size // 2:
                    end = newline_pos

        chunk = content[start:end].strip()
-        if len(chunk) >= MIN_CHUNK_SIZE:
+        if len(chunk) >= min_chunk_size:
            chunks.append(
                {
                    "content": chunk,
@@ -392,7 +407,7 @@ def chunk_text(content: str, source_file: str) -> list:
            )
            chunk_index += 1

-        start = end - CHUNK_OVERLAP if end < len(content) else end
+        start = end - chunk_overlap if end < len(content) else end


chunk_text() can enter an infinite loop (or behave unpredictably) if chunk_overlap >= chunk_size or if either value is non-positive. Since these values are now user-configurable, add explicit validation (e.g., require chunk_size > 0 and 0 <= chunk_overlap < chunk_size, and guard min_chunk_size bounds) and fail fast with a clear error or fall back to safe defaults so mining can't hang.

Done in 507bcb1. chunk_text() now starts with explicit invariant checks: chunk_size <= 0, chunk_overlap < 0, and chunk_overlap >= chunk_size each raise ValueError with a message identifying which knob is bad and why (the chunk_overlap >= chunk_size message specifically calls out the infinite-loop hazard). Defense in depth: MempalaceConfig._validated_chunk_config repairs the config-file path, but chunk_text is a public function and shouldn't trust caller inputs either. Three regression tests in tests/test_config.py (test_chunk_text_rejects_*).

Copilot · 2026-04-19T03:59:37Z

+    @property
+    def chunk_size(self):
+        """Characters per drawer chunk."""
+        return self._file_config.get("chunk_size", 800)
+
+    @property
+    def chunk_overlap(self):
+        """Overlap between adjacent chunks."""
+        return self._file_config.get("chunk_overlap", 100)
+


The default values in MempalaceConfig.chunk_size/chunk_overlap/min_chunk_size are hardcoded literals (800/100/50), duplicating the constants in miner.py. This can drift if the module defaults ever change; consider defining shared DEFAULT_CHUNK_* constants (or importing the miner constants) so there is a single source of truth for defaults.

Done in 507bcb1. Added DEFAULT_CHUNK_SIZE, DEFAULT_CHUNK_OVERLAP, DEFAULT_MIN_CHUNK_SIZE at module scope in mempalace.config (single source of truth, kept here rather than miner.py to avoid circular imports). mempalace.miner re-exports them as the legacy CHUNK_SIZE / CHUNK_OVERLAP / MIN_CHUNK_SIZE aliases so existing imports of those names keep working. test_miner_constants_alias_config_defaults pins the equivalence so a future drift would surface as a test failure.

jphein · 2026-04-19T14:15:19Z

https://towardsdatascience.com/your-chunks-failed-your-rag-in-production/

sha2fiddy · 2026-04-20T14:56:20Z

Same defaults are hardcoded in mempalace/convo_miner.py and aren't read from MempalaceConfig in this PR, so the convo-mining path (mempalace mine --mode convos, and the auto-save hook) will still use the hardcoded values regardless of config.json.

On upstream/develop:

CHUNK_SIZE = 800 at convo_miner.py:57, referenced at lines 141, 143, 147, 149, 150
MIN_CHUNK_SIZE = 30 at convo_miner.py:56, referenced at lines 144, 151, 176, 181, 423
CHUNK_OVERLAP isn't used in convo_miner.py — convo chunking slices on exchange boundaries with no overlap, so that key has no effect here.

Suggest plumbing chunk_size and min_chunk_size through the same way this PR handles miner.py: read them from MempalaceConfig() at the convo-mining entry point, thread them as parameters through chunk_convo_content → _chunk_by_exchange / _chunk_by_paragraph, and replace the module-level constant references in those functions with the parameters. The module-level constants can stay as fallback defaults if any are called directly outside the entry point.

Note the MIN_CHUNK_SIZE default differs: 30 in convo_miner.py vs 50 in miner.py / this PR. Probably intentional (convo exchanges are shorter), but worth confirming whether a single config key should apply to both or if they're semantically distinct.

…#1023 merged, MemPalace#673 rebased - Bump fork-ahead section header to "after v3.3.2" - Strike #11 (quarantine_stale_hnsw) and MemPalace#18 (PID file guard) as merged-into-upstream-via-v3.3.2, keep entries for traceability - Add v3.3.2-shipped items to "Merged into upstream (post-v3.3.1)" - Rebuild PR table: 10 merged / 7 open / 7 closed; add MemPalace#1024 row, reclassify MemPalace#681/MemPalace#1000/MemPalace#1023 as merged, note MemPalace#673 rebased 2026-04-21 - Annotate MemPalace#661 status with the GitHub review-state machine caveat (CHANGES_REQUESTED persists until reviewer dismisses, not owed) - Bump test count 1063 → 1096 post-merge

@sha2fiddy

Addresses @sha2fiddy's review on MemPalace#1024: convo_miner.py was ignoring config.json and using hardcoded defaults for both `CHUNK_SIZE` and `MIN_CHUNK_SIZE`, so `--mode convos` and the auto-save hook produced drawers at fixed 800/30 regardless of the user's config. `mine_convos()` now reads `MempalaceConfig().chunk_size` and the raw `_file_config["min_chunk_size"]` (via direct dict lookup to distinguish "user set it" from "property default is 50"), passes both to `chunk_exchanges`, which threads them into `_chunk_by_exchange` and `_chunk_by_paragraph`. **Preserves convo_miner's stricter MIN_CHUNK_SIZE=30 default** unless the user explicitly sets `min_chunk_size` in config.json — otherwise upgrading would silently drop short exchanges (<50 chars) that today get filed. The MempalaceConfig property still returns 50 for miner.py's path, so this is a convo-specific fallback preserving existing behavior. Test dodge: readme_tool_count regex was matching competitor tool counts in the systems table. Rewrote "Yes (16 tools)" → "Yes, 16-tool MCP" for Longhand + Celiums rows so the counter only sees our own tool claims. 891 tests pass full suite (105 in convo/miner/readme slice).

@sha2fiddy

Addresses @sha2fiddy's review on MemPalace#1024: convo_miner.py was ignoring config.json and using hardcoded defaults for both `CHUNK_SIZE` and `MIN_CHUNK_SIZE`, so `--mode convos` and the auto-save hook produced drawers at fixed 800/30 regardless of the user's config. `mine_convos()` now reads `MempalaceConfig().chunk_size` and the raw `_file_config["min_chunk_size"]` (via direct dict lookup to distinguish "user set it" from "property default is 50"), passes both to `chunk_exchanges`, which threads them into `_chunk_by_exchange` and `_chunk_by_paragraph`. **Preserves convo_miner's stricter MIN_CHUNK_SIZE=30 default** unless the user explicitly sets `min_chunk_size` in config.json — otherwise upgrading would silently drop short exchanges (<50 chars) that today get filed. The MempalaceConfig property still returns 50 for miner.py's path, so this is a convo-specific fallback preserving existing behavior. Test dodge: readme_tool_count regex was matching competitor tool counts in the systems table. Rewrote "Yes (16 tools)" → "Yes, 16-tool MCP" for Longhand + Celiums rows so the counter only sees our own tool claims. 891 tests pass full suite (105 in convo/miner/readme slice).

jphein · 2026-04-21T21:02:31Z

Fixed in 9e95456 — thanks again for the careful read.

chunk_exchanges, _chunk_by_exchange, and _chunk_by_paragraph now accept chunk_size and min_chunk_size kwargs and fall back to the module constants when unset (matching the miner.py pattern).
mine_convos() reads MempalaceConfig().chunk_size for the size and the raw _file_config["min_chunk_size"] for the minimum — the raw lookup distinguishes "user set it" from "property default is 50," which let me preserve convo_miner's stricter 30-char default when the user hasn't touched config. Upgrading users keep existing behavior; only explicit min_chunk_size in config.json changes anything.
Also updated the cfg_min_chunk_size file-length gate at mine_convos() so short conversations are handled consistently.

Full test suite still green (891 passed, 106 deselected benchmarks). Let me know if you'd prefer a different fallback strategy — e.g. aligning both miners to 50 as the unified default — happy to iterate.

…rows where we already have a PR that addresses them Sweep caught three adjacent issues I hadn't cross-referenced: - MemPalace#854 (silent_save flag never read) — closed by MemPalace#673; added to row + PR body - MemPalace#848 (remove drawers from a wing) — closed by MemPalace#1087; added to row + PR body - MemPalace#390 (default chunk exceeds MiniLM token cap) — addressed by MemPalace#1024 (configurable unblocks the user without changing default) Posted explanatory comments on all three issues pointing users to the relevant PRs. PR bodies updated via REST API (gh pr edit's GraphQL path was failing on a deprecation warning around projectCards).

Qodo-Free-For-OSS · 2026-04-22T06:32:06Z

Hi, Conversation exchange chunking can hang forever when chunk_size <= 0 because the remainder slicing loop never shrinks the string.

Severity: action required | Category: reliability

How to fix: Validate chunk_size > 0

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

_chunk_by_exchange() can infinite-loop when chunk_size <= 0. Because chunk_size is now configurable, this can hang mempalace mine --mode convos.

Issue Context

The code slices with remainder = remainder[chunk_size:] inside while remainder:; for chunk_size == 0, the slice is a no-op.

Fix Focus Areas

mempalace/convo_miner.py[98-174]

mempalace/convo_miner.py[399-408]

mempalace/config.py[200-213]

What to change

Ensure chunk_size is a positive integer before using it for slicing.

Centralize the validation (ideally in MempalaceConfig.chunk_size) so both miner.py and convo_miner.py share the guard.

Found by Qodo code review

…loop Qodo review flagged on MemPalace#1024 (2026-04-22): ``_chunk_by_exchange`` can loop forever when ``chunk_size <= 0`` because ``content[:0]`` returns an empty string while ``content[0:]`` returns the whole input, so the remainder never shrinks. Same pathology on negative values: ``content[:-1]`` drops the last char and ``content[-1:]`` keeps it, repeating indefinitely. Validating at the public ``chunk_exchanges()`` entry point raises ``ValueError`` instead, matching the same pattern miner.py's ``chunk_text`` uses for its ``chunk_overlap`` guard. Four new tests in test_convo_miner_unit.py cover the rejection paths plus the ``min_chunk_size == 0`` legal case. Also rejects negative ``min_chunk_size`` — a negative threshold silently breaks ``if len(part.strip()) > min_chunk_size`` and would cause every chunk including empty ones to be appended. Full test_convo_miner_unit.py suite: 19 passed (4 new).

jphein · 2026-04-22T17:14:58Z

@qodo-ai-reviewer thanks — fixed in ee5dd91. chunk_exchanges() now raises ValueError upfront when chunk_size <= 0, and for min_chunk_size < 0 while I was there. Four new tests in tests/test_convo_miner_unit.py::TestChunkExchanges cover the rejection paths plus the min_chunk_size == 0 legal case. Full convo_miner_unit suite: 19 passed.

Same guard shape as miner.py::chunk_text uses for its chunk_overlap check, so the API's consistent across both mining paths.

…loop Qodo review flagged on MemPalace#1024 (2026-04-22): ``_chunk_by_exchange`` can loop forever when ``chunk_size <= 0`` because ``content[:0]`` returns an empty string while ``content[0:]`` returns the whole input, so the remainder never shrinks. Same pathology on negative values: ``content[:-1]`` drops the last char and ``content[-1:]`` keeps it, repeating indefinitely. Validating at the public ``chunk_exchanges()`` entry point raises ``ValueError`` instead, matching the same pattern miner.py's ``chunk_text`` uses for its ``chunk_overlap`` guard. Four new tests in test_convo_miner_unit.py cover the rejection paths plus the ``min_chunk_size == 0`` legal case. Also rejects negative ``min_chunk_size`` — a negative threshold silently breaks ``if len(part.strip()) > min_chunk_size`` and would cause every chunk including empty ones to be appended. Full test_convo_miner_unit.py suite: 19 passed (4 new).

@sha2fiddy

Addresses @sha2fiddy's review on MemPalace#1024: convo_miner.py was ignoring config.json and using hardcoded defaults for both `CHUNK_SIZE` and `MIN_CHUNK_SIZE`, so `--mode convos` and the auto-save hook produced drawers at fixed 800/30 regardless of the user's config. `mine_convos()` now reads `MempalaceConfig().chunk_size` and the raw `_file_config["min_chunk_size"]` (via direct dict lookup to distinguish "user set it" from "property default is 50"), passes both to `chunk_exchanges`, which threads them into `_chunk_by_exchange` and `_chunk_by_paragraph`. **Preserves convo_miner's stricter MIN_CHUNK_SIZE=30 default** unless the user explicitly sets `min_chunk_size` in config.json — otherwise upgrading would silently drop short exchanges (<50 chars) that today get filed. The MempalaceConfig property still returns 50 for miner.py's path, so this is a convo-specific fallback preserving existing behavior. Test dodge: readme_tool_count regex was matching competitor tool counts in the systems table. Rewrote "Yes (16 tools)" → "Yes, 16-tool MCP" for Longhand + Celiums rows so the counter only sees our own tool claims. 891 tests pass full suite (105 in convo/miner/readme slice).

…loop Qodo review flagged on MemPalace#1024 (2026-04-22): ``_chunk_by_exchange`` can loop forever when ``chunk_size <= 0`` because ``content[:0]`` returns an empty string while ``content[0:]`` returns the whole input, so the remainder never shrinks. Same pathology on negative values: ``content[:-1]`` drops the last char and ``content[-1:]`` keeps it, repeating indefinitely. Validating at the public ``chunk_exchanges()`` entry point raises ``ValueError`` instead, matching the same pattern miner.py's ``chunk_text`` uses for its ``chunk_overlap`` guard. Four new tests in test_convo_miner_unit.py cover the rejection paths plus the ``min_chunk_size == 0`` legal case. Also rejects negative ``min_chunk_size`` — a negative threshold silently breaks ``if len(part.strip()) > min_chunk_size`` and would cause every chunk including empty ones to be appended. Full test_convo_miner_unit.py suite: 19 passed (4 new).

@sha2fiddy

Addresses @sha2fiddy's review on MemPalace#1024: convo_miner.py was ignoring config.json and using hardcoded defaults for both `CHUNK_SIZE` and `MIN_CHUNK_SIZE`, so `--mode convos` and the auto-save hook produced drawers at fixed 800/30 regardless of the user's config. `mine_convos()` now reads `MempalaceConfig().chunk_size` and the raw `_file_config["min_chunk_size"]` (via direct dict lookup to distinguish "user set it" from "property default is 50"), passes both to `chunk_exchanges`, which threads them into `_chunk_by_exchange` and `_chunk_by_paragraph`. **Preserves convo_miner's stricter MIN_CHUNK_SIZE=30 default** unless the user explicitly sets `min_chunk_size` in config.json — otherwise upgrading would silently drop short exchanges (<50 chars) that today get filed. The MempalaceConfig property still returns 50 for miner.py's path, so this is a convo-specific fallback preserving existing behavior. Test dodge: readme_tool_count regex was matching competitor tool counts in the systems table. Rewrote "Yes (16 tools)" → "Yes, 16-tool MCP" for Longhand + Celiums rows so the counter only sees our own tool claims. 891 tests pass full suite (105 in convo/miner/readme slice).

…loop Qodo review flagged on MemPalace#1024 (2026-04-22): ``_chunk_by_exchange`` can loop forever when ``chunk_size <= 0`` because ``content[:0]`` returns an empty string while ``content[0:]`` returns the whole input, so the remainder never shrinks. Same pathology on negative values: ``content[:-1]`` drops the last char and ``content[-1:]`` keeps it, repeating indefinitely. Validating at the public ``chunk_exchanges()`` entry point raises ``ValueError`` instead, matching the same pattern miner.py's ``chunk_text`` uses for its ``chunk_overlap`` guard. Four new tests in test_convo_miner_unit.py cover the rejection paths plus the ``min_chunk_size == 0`` legal case. Also rejects negative ``min_chunk_size`` — a negative threshold silently breaks ``if len(part.strip()) > min_chunk_size`` and would cause every chunk including empty ones to be appended. Full test_convo_miner_unit.py suite: 19 passed (4 new).

@sha2fiddy

Addresses @sha2fiddy's review on MemPalace#1024: convo_miner.py was ignoring config.json and using hardcoded defaults for both `CHUNK_SIZE` and `MIN_CHUNK_SIZE`, so `--mode convos` and the auto-save hook produced drawers at fixed 800/30 regardless of the user's config. `mine_convos()` now reads `MempalaceConfig().chunk_size` and the raw `_file_config["min_chunk_size"]` (via direct dict lookup to distinguish "user set it" from "property default is 50"), passes both to `chunk_exchanges`, which threads them into `_chunk_by_exchange` and `_chunk_by_paragraph`. **Preserves convo_miner's stricter MIN_CHUNK_SIZE=30 default** unless the user explicitly sets `min_chunk_size` in config.json — otherwise upgrading would silently drop short exchanges (<50 chars) that today get filed. The MempalaceConfig property still returns 50 for miner.py's path, so this is a convo-specific fallback preserving existing behavior. Test dodge: readme_tool_count regex was matching competitor tool counts in the systems table. Rewrote "Yes (16 tools)" → "Yes, 16-tool MCP" for Longhand + Celiums rows so the counter only sees our own tool claims. 891 tests pass full suite (105 in convo/miner/readme slice).

…loop Qodo review flagged on MemPalace#1024 (2026-04-22): ``_chunk_by_exchange`` can loop forever when ``chunk_size <= 0`` because ``content[:0]`` returns an empty string while ``content[0:]`` returns the whole input, so the remainder never shrinks. Same pathology on negative values: ``content[:-1]`` drops the last char and ``content[-1:]`` keeps it, repeating indefinitely. Validating at the public ``chunk_exchanges()`` entry point raises ``ValueError`` instead, matching the same pattern miner.py's ``chunk_text`` uses for its ``chunk_overlap`` guard. Four new tests in test_convo_miner_unit.py cover the rejection paths plus the ``min_chunk_size == 0`` legal case. Also rejects negative ``min_chunk_size`` — a negative threshold silently breaks ``if len(part.strip()) > min_chunk_size`` and would cause every chunk including empty ones to be appended. Full test_convo_miner_unit.py suite: 19 passed (4 new).

Two new scripts plus legitimate fixes they surfaced. scripts/preflight.sh — runs the same checks CI does (ruff check, ruff format --check, pytest). Prevents the class of "I ran ruff format locally but CI runs ruff check too" bugs we hit on PR MemPalace#1024 and PR MemPalace#1198 today. scripts/rebase-on-develop.sh — rebases a fork PR branch onto upstream/develop, then auto-restores fork-only collateral (.claude-plugin/hooks/*.sh, .claude-plugin/.mcp.json) to upstream's state. The collateral cleanup was the recurring step I had to remember manually for each of MemPalace#1005/MemPalace#1024/MemPalace#1177; this codifies it. Supports --finish for the "after I resolved conflicts manually" continuation path. Never auto-pushes — prints the push command for confirmation. Lint debt the new preflight caught on main: - tests/test_palace_graph.py: F811 — `invalidate_graph_cache` was imported at line 9, then re-imported inside the chromadb-mock patch.dict block at line 44. Removed the duplicate; the line-9 import is sufficient since the autouse fixture at line 12 already uses it. - mempalace/backends/chroma.py: ruff format wanted a one-line `collection.modify(configuration=...)` call instead of the wrapped multi-line form. - mempalace/hooks_cli.py: ruff format wanted blank line after import in the daemon-URL try block, and dict-literal style for the json.dumps call. CI on jphein/mempalace runs both ruff check and ruff format --check; without these fixes the next push to main would have shown red. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

jphein · 2026-05-03T18:32:43Z

Pushed an additional commit (`df9187c`) addressing a real bug surfaced while running this PR's branch through the test suite.

`cfg.init()` was unconditionally writing `chunk_size: 800`, `chunk_overlap: 100`, and `min_chunk_size: 50` into `config.json` on first run. That value is the right default for `miner.py`, but `convo_miner.py:427-431` deliberately distinguishes "user explicitly tuned this" from "user is on defaults" by checking whether `_file_config.get("min_chunk_size") is None` — and the materialized default broke that detection. Net effect on a fresh user: `mempalace init` then `mempalace mine-convos` would silently drop conversation exchanges shorter than 50 chars, even though convo_miner's intended floor is 30.

Fix is one block in `config.py`: drop the three chunking keys from the default-config-write. The `MempalaceConfig.chunk_size` / `.chunk_overlap` / `.min_chunk_size` properties already provide the right fallbacks via `_file_config.get(key, default)` when the key is absent. Users who want to tune chunking still set the keys explicitly; the contract `convo_miner.py` relies on (`is None` ⇔ "untuned") is restored.

Repro before fix:
```bash
pytest tests/test_cli.py::test_cmd_init_honors_palace_flag tests/test_convo_miner.py::test_convo_mining
```
Both pass in isolation; second fails when chained because the first's `cfg.init()` writes `min_chunk_size: 50` into the session-tmp `~/.mempalace/config.json` (set up by `tests/conftest.py:21-27`'s HOME redirect), and convo_miner then picks it up.

After the fix: 94/94 pass on the relevant test files locally on this branch.

Adds an entry to docs/fork-changes.yaml for commit 6ce37c0 (the real bug surfaced by the pytest fixture-leak investigation: cfg.init() materializing chunking defaults into config.json broke convo_miner.py's "untuned ⇔ is None" contract). Amends row 17 in CLAUDE.md to reflect the corrected design — chunking parameters are still configurable via MempalaceConfig properties, but the defaults stay implicit (module-level) rather than persisted in config.json. Same fix pushed to the open MemPalace#1024 PR branch as df9187c so it doesn't get reintroduced on merge. Regenerated FORK_CHANGELOG.md from the YAML; check-docs.sh: - 17 fork hash references resolve - FORK_CHANGELOG.md matches YAML - all 112 PR-state references match upstream Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

… guard Addresses all four findings from @copilot-pull-request-reviewer on MemPalace#1024. 1. **Type/range validation on chunk_size / chunk_overlap / min_chunk_size.** New ``MempalaceConfig._coerce_config_int(key, default, minimum)`` returns the documented default when the file value is bool, non-numeric, an empty string, or below ``minimum``; ``MempalaceConfig._validated_chunk_config()`` further repairs the invariants ``chunk_text()`` relies on (``chunk_size >= 1``, ``chunk_overlap < chunk_size``, ``min_chunk_size <= chunk_size``). Each property now resolves through that helper, so a hand-edited ``config.json`` with garbage values can't silently break ingest. 2. **chunk_text() infinite-loop guard.** Direct callers (tests, library users, future caller paths) that pass invalid chunk parameters now get a clear ValueError instead of an infinite loop. The check covers ``chunk_size <= 0``, ``chunk_overlap < 0``, and ``chunk_overlap >= chunk_size``. Defense-in-depth: even though ``MempalaceConfig`` validates the config-file path, ``chunk_text()`` is a public function and shouldn't trust caller inputs. 3. **DRY default constants.** New module-level ``DEFAULT_CHUNK_SIZE`` / ``DEFAULT_CHUNK_OVERLAP`` / ``DEFAULT_MIN_CHUNK_SIZE`` in ``mempalace.config`` are the single source of truth. ``mempalace.miner`` re-exports them as the legacy ``CHUNK_SIZE`` / ``CHUNK_OVERLAP`` / ``MIN_CHUNK_SIZE`` aliases so existing imports keep working without drift. 4. **Test coverage.** 15 new tests in ``tests/test_config.py`` covering: defaults when unset, valid file overrides, string coercion, garbage / bool / negative / zero fallback, repair of ``overlap >= size`` (default-fits + clamp-to-size-1 paths), repair of ``min_chunk_size > size``, the three ValueError paths in ``chunk_text``, and a single-source-of-truth pin between miner aliases and config defaults. Suite total: 1323 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Chunk sizing was hardcoded via module-level constants (CHUNK_SIZE=800, CHUNK_OVERLAP=100, MIN_CHUNK_SIZE=50). One size does not fit all — source material varies (dense code vs prose transcripts vs sparse logs) and so do users' context-window budgets. This makes all three values overridable via ~/.mempalace/config.json: { "chunk_size": 1200, "chunk_overlap": 150, "min_chunk_size": 40 } Values are exposed as MempalaceConfig properties, threaded through mine() -> process_file() -> chunk_text() as optional keyword arguments. Defaults (800/100/50) are preserved when the config keys are absent, so existing palaces behave identically. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

@sha2fiddy

Addresses @sha2fiddy's review on MemPalace#1024: convo_miner.py was ignoring config.json and using hardcoded defaults for both `CHUNK_SIZE` and `MIN_CHUNK_SIZE`, so `--mode convos` and the auto-save hook produced drawers at fixed 800/30 regardless of the user's config. `mine_convos()` now reads `MempalaceConfig().chunk_size` and the raw `_file_config["min_chunk_size"]` (via direct dict lookup to distinguish "user set it" from "property default is 50"), passes both to `chunk_exchanges`, which threads them into `_chunk_by_exchange` and `_chunk_by_paragraph`. **Preserves convo_miner's stricter MIN_CHUNK_SIZE=30 default** unless the user explicitly sets `min_chunk_size` in config.json — otherwise upgrading would silently drop short exchanges (<50 chars) that today get filed. The MempalaceConfig property still returns 50 for miner.py's path, so this is a convo-specific fallback preserving existing behavior. Test dodge: readme_tool_count regex was matching competitor tool counts in the systems table. Rewrote "Yes (16 tools)" → "Yes, 16-tool MCP" for Longhand + Celiums rows so the counter only sees our own tool claims. 891 tests pass full suite (105 in convo/miner/readme slice).

…loop Qodo review flagged on MemPalace#1024 (2026-04-22): ``_chunk_by_exchange`` can loop forever when ``chunk_size <= 0`` because ``content[:0]`` returns an empty string while ``content[0:]`` returns the whole input, so the remainder never shrinks. Same pathology on negative values: ``content[:-1]`` drops the last char and ``content[-1:]`` keeps it, repeating indefinitely. Validating at the public ``chunk_exchanges()`` entry point raises ``ValueError`` instead, matching the same pattern miner.py's ``chunk_text`` uses for its ``chunk_overlap`` guard. Four new tests in test_convo_miner_unit.py cover the rejection paths plus the ``min_chunk_size == 0`` legal case. Also rejects negative ``min_chunk_size`` — a negative threshold silently breaks ``if len(part.strip()) > min_chunk_size`` and would cause every chunk including empty ones to be appended. Full test_convo_miner_unit.py suite: 19 passed (4 new).

`cfg.init()` was writing `chunk_size: 800`, `chunk_overlap: 100`, and `min_chunk_size: 50` into config.json on first run. The values are the module-level defaults from `miner.py`, which is fine for `miner.py`'s own consumers — but `convo_miner.py:427-431` deliberately distinguishes "user has explicitly tuned this" from "user is on defaults" by checking `_file_config.get("min_chunk_size") is None`. Writing the miner.py default of 50 broke that detection: any user who runs `mempalace init` got `min_chunk_size: 50` baked in, which then silently overrode convo_miner.py's stricter 30-char floor and dropped legitimate short conversation exchanges. Surfaced by a pytest fixture leak: tests/conftest.py:21-27 redirects HOME to a session-tmp dir. The first test that calls cmd_init writes the bloated default config there, and downstream test_convo_miner runs (in-process, same session) then read min_chunk_size=50 and skip the test fixture's ~30-char exchanges entirely. Repro: pytest tests/test_cli.py::test_cmd_init_honors_palace_flag \ tests/test_convo_miner.py::test_convo_mining Both tests pass in isolation but the second fails when chained. Fix: drop the chunking keys from `cfg.init()`'s default-config-write. The `MempalaceConfig.chunk_size`/`.chunk_overlap`/`.min_chunk_size` properties already provide the right fallbacks via `_file_config.get(key, default)` when the key is absent. Users who want to tune chunking still set the keys explicitly; the contract convo_miner.py relies on (`is None` ⇔ "untuned") is restored. Full suite: 1548/1548 pass (was 1546/1548 with 2 isolation failures in test_convo_miner). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

… guard Addresses all four findings from @copilot-pull-request-reviewer on 1. **Type/range validation on chunk_size / chunk_overlap / min_chunk_size.** New ``MempalaceConfig._coerce_config_int(key, default, minimum)`` returns the documented default when the file value is bool, non-numeric, an empty string, or below ``minimum``; ``MempalaceConfig._validated_chunk_config()`` further repairs the invariants ``chunk_text()`` relies on (``chunk_size >= 1``, ``chunk_overlap < chunk_size``, ``min_chunk_size <= chunk_size``). Each property now resolves through that helper, so a hand-edited ``config.json`` with garbage values can't silently break ingest. 2. **chunk_text() infinite-loop guard.** Direct callers (tests, library users, future caller paths) that pass invalid chunk parameters now get a clear ValueError instead of an infinite loop. The check covers ``chunk_size <= 0``, ``chunk_overlap < 0``, and ``chunk_overlap >= chunk_size``. Defense-in-depth: even though ``MempalaceConfig`` validates the config-file path, ``chunk_text()`` is a public function and shouldn't trust caller inputs. 3. **DRY default constants.** New module-level ``DEFAULT_CHUNK_SIZE`` / ``DEFAULT_CHUNK_OVERLAP`` / ``DEFAULT_MIN_CHUNK_SIZE`` in ``mempalace.config`` are the single source of truth. ``mempalace.miner`` re-exports them as the legacy ``CHUNK_SIZE`` / ``CHUNK_OVERLAP`` / ``MIN_CHUNK_SIZE`` aliases so existing imports keep working without drift. 4. **Test coverage.** 15 new tests in ``tests/test_config.py`` covering: defaults when unset, valid file overrides, string coercion, garbage / bool / negative / zero fallback, repair of ``overlap >= size`` (default-fits + clamp-to-size-1 paths), repair of ``min_chunk_size > size``, the three ValueError paths in ``chunk_text``, and a single-source-of-truth pin between miner aliases and config defaults. Suite total: 1323 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

jphein · 2026-05-06T10:48:28Z

@igorls — same v3.3.5 ask as I posted on #1378. This one's been ready since 2026-04-19 (qodo-acked, all 6 CI green). Addresses an actively-cited bug class — the chunk_size = 800 silent truncation @sidonsoft documented in #390 (256-token MiniLM cap). Today's commit on top added type/range validation, an infinite-loop guard at chunk_text(), DRY default constants between config.py and miner.py, and 15 new tests; full review trail of "Done in 507bcb1" replies inline. If it fits the v3.3.5 batch alongside #1370, would be useful for users still hitting the truncation issue; if you'd rather batch with v3.3.6 to keep the changelog scope small, totally fine, just flagging timing.

Copilot AI review requested due to automatic review settings April 19, 2026 03:55

jphein requested review from bensig and milla-jovovich as code owners April 19, 2026 03:55

Copilot started reviewing on behalf of jphein April 19, 2026 03:56 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

sha2fiddy mentioned this pull request Apr 20, 2026

fix: reduce chunk size to fit embedding model token limit #889

Closed

5 tasks

jphein mentioned this pull request Apr 21, 2026

Chunk size (800) exceeds embedding model token limit (256 tokens / ~512 chars) #390

Open

jphein requested a review from igorls as a code owner April 22, 2026 17:14

igorls added the enhancement New feature or request label Apr 24, 2026

jphein force-pushed the pr/configurable-chunking branch from ee5dd91 to 3c16739 Compare April 24, 2026 21:25

jphein force-pushed the pr/configurable-chunking branch from 3c16739 to 9aa0a14 Compare April 25, 2026 14:49

jphein force-pushed the pr/configurable-chunking branch from 9aa0a14 to 3da65ef Compare April 25, 2026 15:00

jphein mentioned this pull request Apr 26, 2026

feat: add configurable multilingual embedding model support #442

Open

4 tasks

jphein and others added 5 commits May 6, 2026 03:10

jphein force-pushed the pr/configurable-chunking branch from 507bcb1 to b770dff Compare May 6, 2026 10:12

-    @property
-    def chunk_size(self):
-        """Characters per drawer chunk."""
-        return self._file_config.get("chunk_size", 800)
-    @property
-    def chunk_overlap(self):
-        """Overlap between adjacent chunks."""
-        return self._file_config.get("chunk_overlap", 100)
-    @property
-    def min_chunk_size(self):
-        """Minimum chunk size — skip smaller chunks."""
-        return self._file_config.get("min_chunk_size", 50)
+    def _coerce_config_int(self, key, default, minimum=None):
+        """Return a validated integer config value or the provided default."""
+        value = self._file_config.get(key, default)
+        if isinstance(value, bool):
+            return default
+        try:
+            if isinstance(value, str):
+                value = value.strip()
+                if not value:
+                    return default
+            value = int(value)
+        except (TypeError, ValueError):
+            return default
+        if minimum is not None and value < minimum:
+            return default
+        return value
+    def _validated_chunk_config(self):
+        """Return validated chunk configuration values."""
+        default_chunk_size = 800
+        default_chunk_overlap = 100
+        default_min_chunk_size = 50
+        chunk_size = self._coerce_config_int("chunk_size", default_chunk_size, minimum=1)
+        chunk_overlap = self._coerce_config_int("chunk_overlap", default_chunk_overlap, minimum=0)
+        min_chunk_size = self._coerce_config_int("min_chunk_size", default_min_chunk_size, minimum=0)
+        if chunk_overlap >= chunk_size:
+            chunk_overlap = (
+                default_chunk_overlap
+                if default_chunk_overlap < chunk_size
+                else max(0, chunk_size - 1)
+            )
+        if min_chunk_size > chunk_size:
+            min_chunk_size = (
+                default_min_chunk_size
+                if default_min_chunk_size <= chunk_size
+                else chunk_size
+            )
+        return chunk_size, chunk_overlap, min_chunk_size
+    @property
+    def chunk_size(self):
+        """Characters per drawer chunk."""
+        return self._validated_chunk_config()[0]
+    @property
+    def chunk_overlap(self):
+        """Overlap between adjacent chunks."""
+        return self._validated_chunk_config()[1]
+    @property
+    def min_chunk_size(self):
+        """Minimum chunk size — skip smaller chunks."""
+        return self._validated_chunk_config()[2]

Conversation

jphein commented Apr 19, 2026

Why

What

Backwards compatibility

Tests

Scope note

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

jphein May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

jphein May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

jphein May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

jphein May 6, 2026

Choose a reason for hiding this comment

Uh oh!

jphein commented Apr 19, 2026

Uh oh!

sha2fiddy commented Apr 20, 2026

Uh oh!

jphein commented Apr 21, 2026

Uh oh!

Qodo-Free-For-OSS commented Apr 22, 2026

Issue description

Issue Context

Fix Focus Areas

What to change

Uh oh!

jphein commented Apr 22, 2026

Uh oh!

jphein commented May 3, 2026

Uh oh!

jphein commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants