Skip to content

perf(proton): 3x faster bulk insert for CSV to Proton ingestion#64

Merged
yokofly merged 8 commits intomainfrom
perf/proton-bulk-insert-optimization
Mar 31, 2026
Merged

perf(proton): 3x faster bulk insert for CSV to Proton ingestion#64
yokofly merged 8 commits intomainfrom
perf/proton-bulk-insert-optimization

Conversation

@yokofly
Copy link
Copy Markdown
Collaborator

@yokofly yokofly commented Mar 30, 2026

Summary

Optimizes the Proton bulk insert path. The main win is removing g.Debug() calls from the hot loop that caused a 3x slowdown when the -d flag is enabled (recommended for customer debugging).

Benchmark Results (fresh Proton server, 1.66M rows, controlled A/B)

Variant With -d (customer debug mode) Without -d
Baseline (v1.2.20-timeplus.4) 37.3s 13.2s
This PR 13.3s 13.0s
Improvement 2.8x faster no regression

The -d flag is recommended for customer debugging. With -d enabled, the baseline spends most of its CPU on g.Debug() → runtime.Caller() stack walks for every nil value in every row. This PR removes those calls, making -d mode as fast as non-debug mode.

Changes (7 commits)

Performance

  1. Enable LZ4 compression (a9affc1) — reduces bytes on WAN
  2. Remove g.Debug() nil calls in hot loop (60a5ddd) — the main win: 37s → 13s with -d
  3. Remove per-row mutex lock (18193a0) — 1.66M unnecessary lock/unlock cycles

Refactoring

  1. Hoist invariant work out of per-batch loop (1668a82) — extract colClassification struct, compute once per schema change instead of per batch; remove no-op Begin/Commit

Bug fixes (pre-existing, found by Codex review)

  1. Stable idempotent insert keys (in retry fix commit) — old code used time.Now() which changed on retry, causing duplicate inserts in proton-to-proton path. Now uses TaskExecution.ExecID (stable across retries, unique across imports)
  2. Retry-safe array/map converters (in retry fix commit) — converters now accept already-converted native types, so batch retry after transient Send failure no longer causes permanent "expected string, got []string" errors

Infrastructure

  1. Go 1.24.11 → 1.24.13 (d0c311b) — security fixes for crypto/tls, crypto/x509

What we tested and ruled out

  • Batch-level context lock: tested, caused 5-30s regression by blocking CSV reader pipeline during Send(). Dropped.
  • Go 1.24 → 1.26: zero measurable improvement, not worth dependency churn
  • batch_limit increase (50K → 200K): negligible improvement on localhost
  • Multi-file folder import (6 files): no new hotspots in our code; per-file connection lifecycle is architectural (separate PR)
  • 10M row stress test: throughput scales linearly, no memory/GC issues

Data Integrity Verification

  • Row count: 1,660,051 exact match
  • Integer aggregates (sum_id): exact match
  • 5 random row spot-checks (floats like 16954534081.07, Chinese chars, NULLs): all exact
  • Idempotent dedup verified: same import twice without truncate → no duplicates
  • Separate imports verified: two different sling invocations both accepted (not suppressed)

Test plan

  • Controlled A/B benchmark (fresh server, alternating runs)
  • Row count and data integrity verification
  • Idempotent dedup correctness (same import = deduped, different import = accepted)
  • Multi-file folder import (6 files)
  • 10M row stress test
  • Customer UAT on their environment

yokofly and others added 6 commits March 30, 2026 02:13
Enable LZ4 compression for the proton-go-driver native protocol to reduce
bytes on the wire. Minimal impact on localhost (~1s), but significant for
remote connections through LB/NodePort where bandwidth and latency matter.

Benchmark (localhost, 1.66M rows, batch_limit=50K):
  Before: 62s
  After:  61s (~0% on localhost, expected to help on WAN)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…t loop

Remove 37 g.Debug() calls that fired for every nil value in every row.
g.Debug() calls runtime.Caller() for stack walking + string formatting +
zerolog event creation on each invocation. With -d flag enabled, 10 nullable
columns, and 1.66M rows, this generated millions of expensive debug calls
that provided zero diagnostic value (nil is normal for nullable columns).

Benchmark (localhost, 1.66M rows, batch_limit=50K, cumulative):
  Baseline:     62s (41s user, 77% CPU)
  + LZ4:        61s
  + this:       28s (10s user, 40% CPU) ← 54% faster, biggest single win

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The ds.Context.Lock()/Unlock() around batched.Append(row...) was unnecessary:
- batched is a local object created in the operation closure
- batchRows is a local snapshot fully drained from the channel
- No concurrent goroutine accesses either during the append loop

This saved 1.66M mutex lock/unlock cycles.

Benchmark (localhost, 1.66M rows, batch_limit=50K, cumulative):
  Baseline:          62s (41.0s user)
  + LZ4:             61s
  + g.Debug removal: 28s (10.5s user)
  + this:            27s ( 9.8s user) ← ~1s improvement

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…fake transactions

Three structural improvements:

1. Extract colClassification struct + classifyColumns() function:
   Column type classification (the big switch mapping DbType → column index
   lists) is now computed once in BulkImportStream and reused across all
   batches, instead of being rebuilt inside every batch's retry closure.

2. Hoist ValidateColumnNames + GenerateInsertStatement to BulkImportStream:
   These only change on schema change events, not per-batch. With
   batch_limit=1000 and 1660 batches, this avoids 1659 redundant
   recomputations per import.

3. Remove fake Begin/Commit per batch:
   Proton does not support transactions (line 168 comment says so).
   The driver's Begin() returns nil, Commit() returns nil. These were
   pure ceremony adding method-chain overhead per batch.

The retry loop now only contains: PrepareBatch → Append rows → Send.

Benchmark (localhost, 1.66M rows, cumulative):
  batch_limit=50K:  27s → 28s (no change, expected on localhost)
  batch_limit=1000: 36s → 36s (no change on localhost; expected to help
                     on WAN where per-batch CPU overhead compounds with
                     network round-trip latency)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Security fixes since 1.24.11:
- go1.24.12: security fixes in crypto/tls, archive/zip, net/url, go cmd
- go1.24.13: security fixes in crypto/tls, crypto/x509, go cmd

Relevant for fintech deployments connecting through TLS-terminated
load balancers.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
go mod tidy with Go 1.24.13 removed the local replace directives
for github.com/flarco/g and github.com/slingdata-io/sling.
These are required for development against our local forks.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@yokofly yokofly linked an issue Mar 30, 2026 that may be closed by this pull request
@yokofly yokofly force-pushed the perf/proton-bulk-insert-optimization branch 10 times, most recently from b94f5c9 to edb2fcc Compare March 31, 2026 06:02
… inserts

Fix two pre-existing bugs found by Codex review:

P1 - Idempotent ID unstable across whole-import retries:
  runProtonToProton retries by calling runFileToDB() from scratch,
  which creates a new connection and re-enters BulkImportStream().
  The old code generated idempotent_id from time.Now().UnixNano(),
  so retried batches got new IDs and were inserted again.
  Fix: use TaskExecution.ExecID (a KSUID generated once per task) as
  the session component of the idempotent key. ExecID is set in
  runFileToDB via conn.SetProp("exec_id", t.ExecID) and read in
  processBatch. The resulting key is: exec_id + table + batch_number.
  This is stable across retries (same ExecID) but different across
  separate imports (new ExecID per task).

P2 - In-place row mutation breaks retry for array/map types:
  batchRows is reused across retries. Type converters (e.g.,
  convertToArrayString) mutate row values in-place from string to
  []string. On retry, the converter sees []string instead of string
  and returns "expected string, got []string".
  Fix: add idempotent type-check at the top of all 22 converters.
  If the value is already the target type, return it as-is.

Both bugs are pre-existing (identical in v1.2.20-timeplus.4).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@yokofly yokofly force-pushed the perf/proton-bulk-insert-optimization branch from edb2fcc to 9d8ca07 Compare March 31, 2026 06:04
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@yokofly yokofly force-pushed the perf/proton-bulk-insert-optimization branch from 9d8ca07 to 0ef27eb Compare March 31, 2026 06:11
@yokofly yokofly merged commit d20ae91 into main Mar 31, 2026
@yokofly yokofly deleted the perf/proton-bulk-insert-optimization branch March 31, 2026 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ingestion slow

1 participant