Company or project name
No response
Describe what's wrong
On Proton 3.1.1, startup/catch-up is unstable for a Dynamic/JSON stream (default.stream_abc, 128 shards).
With typed-column streams we do not see this failure mode.
With Dynamic/JSON stream, bootstrap typically goes through:
- commit pool contention (
No available threads in background commit pool...)
- part pressure (
TOO_MANY_PARTS)
- memory growth
- OOM in outdated-parts loading
- hard process termination
After the crash loop starts, query engine is not available because startup never completes.
Does it reproduce on the most recent release?
Yes
How to reproduce
- Server version:
timeplusd 3.1.1 (revision 197, git 6b2e0109a188157d3985e4add5812967a9cb1c97)
- Interface: not interface-specific (crash happens during bootstrap before normal queries are available)
- Non-default settings tested:
storage_commit_pool_size tested at 32, 128, 256, 1659
- tuned
parts_to_delay_insert / parts_to_throw_insert
- tuned background pool/concurrency in profile
- Main stream involved:
default.stream_abc (Dynamic/JSON, 128 shards)
Steps
- Start Proton with persisted data for
default.stream_abc where there is backlog/large part count.
- Wait for bootstrap/catch-up.
- Observe commit-pool contention warnings and part pressure.
- Observe OOM in outdated-parts loading and process termination.
SHOW CREATE statements
CREATE STREAM IF NOT EXISTS default.stream_abc
(
chain_id uint16,
address low_cardinality(string) CODEC(ZSTD(3)),
`key` low_cardinality(string) CODEC(ZSTD(1)),
value dynamic,
_index uint128 CODEC(Delta(8), ZSTD(1)),
_tp_time datetime64(3, 'UTC') DEFAULT now64(3, 'UTC') CODEC(DoubleDelta, ZSTD(1)),
_tp_sn int64 CODEC(Delta(8), ZSTD(1)),
INDEX bf_address address TYPE bloom_filter(0.001) GRANULARITY 4,
INDEX bf_key `key` TYPE bloom_filter(0.001) GRANULARITY 4,
INDEX _tp_time_index _tp_time TYPE minmax GRANULARITY 32
)
ENGINE = Stream(128, 1, city_hash64((chain_id, address)))
PRIMARY KEY (chain_id, address, `key`, _index)
PARTITION BY (chain_id, to_YYYYMMDD(_tp_time))
ORDER BY (chain_id, address, `key`, _index)
SETTINGS index_granularity = 8192;
Queries that trigger it
- No query needed. Repro occurs on server restart/bootstrap.
Expected behavior
Startup should complete and query engine should become available.
Even with Dynamic/JSON stream backlog, Proton should not enter a tiny-part/commit-pressure cycle that ends in OOM and terminate.
Error message and/or stacktrace
Main fatal error:
Loading of outdated parts failed. Will terminate to avoid undefined behaviour due to inconsistent set of parts. Exception: Code: 241. DB::Exception: Memory limit (total) exceeded ... (MEMORY_LIMIT_EXCEEDED)
Relevant stack path:
DB::MergeTreeData::loadOutdatedDataParts(bool)
DB::MergeTreeData::loadDataPart(...)
DB::IMergeTreeDataPart::loadColumnsChecksumsIndexes(...)
DB::IMergeTreeDataPart::loadIndex()
Also frequently before fatal:
No available threads in background commit pool with size=...
TOO_MANY_PARTS
Additional context
Code paths that look related:
- Immediate commit for dynamic subcolumns:
- Retry loop on commit exceptions:
- Commit scheduling timeout path:
- Fatal terminate on outdated-part loading failure:
- OOM in part index load path:
- Runtime tuning limitations for Stream settings:
Because startup does not complete, runtime mitigation by query is not possible in the failing state.
Company or project name
No response
Describe what's wrong
On Proton 3.1.1, startup/catch-up is unstable for a Dynamic/JSON stream (
default.stream_abc, 128 shards).With typed-column streams we do not see this failure mode.
With Dynamic/JSON stream, bootstrap typically goes through:
No available threads in background commit pool...)TOO_MANY_PARTS)After the crash loop starts, query engine is not available because startup never completes.
Does it reproduce on the most recent release?
Yes
How to reproduce
timeplusd 3.1.1(revision 197, git6b2e0109a188157d3985e4add5812967a9cb1c97)storage_commit_pool_sizetested at32,128,256,1659parts_to_delay_insert/parts_to_throw_insertdefault.stream_abc(Dynamic/JSON, 128 shards)Steps
default.stream_abcwhere there is backlog/large part count.SHOW CREATE statements
Queries that trigger it
Expected behavior
Startup should complete and query engine should become available.
Even with Dynamic/JSON stream backlog, Proton should not enter a tiny-part/commit-pressure cycle that ends in OOM and terminate.
Error message and/or stacktrace
Main fatal error:
Loading of outdated parts failed. Will terminate to avoid undefined behaviour due to inconsistent set of parts. Exception: Code: 241. DB::Exception: Memory limit (total) exceeded ... (MEMORY_LIMIT_EXCEEDED)Relevant stack path:
DB::MergeTreeData::loadOutdatedDataParts(bool)DB::MergeTreeData::loadDataPart(...)DB::IMergeTreeDataPart::loadColumnsChecksumsIndexes(...)DB::IMergeTreeDataPart::loadIndex()Also frequently before fatal:
No available threads in background commit pool with size=...TOO_MANY_PARTSAdditional context
Code paths that look related:
Because startup does not complete, runtime mitigation by query is not possible in the failing state.