Scaling chDB inserts from 5 million to 99 million rows/second
How an embedded ClickHouse — running in-process, not as a server — scales its
write path with a connection pool, and how far that actually goes when you give
it real cores.
chDB is ClickHouse compiled into a library. There's no server, no socket, no
network round-trip: you link libchdb.so into your process and run SQL against
an embedded engine. That's wonderful for an ingest pipeline — the data never
leaves your address space — but it raises an obvious question for anyone who has
run ClickHouse the normal way: a real ClickHouse server scales inserts
horizontally by accepting many client connections at once. Does the embedded
engine do the same thing, or does running in-process quietly serialize you down
to one core?
We built a set of Rust microbenchmarks against the prebuilt libchdb.so
(ClickHouse v26 vintage) to answer that, and to find the absolute ceiling for our
ingest shape. Most numbers below come from a 6-core/12-thread laptop chip;
the final scaling sweep runs the same binary on a 24-core/48-thread server
to find where the connection lever actually tops out. The benchmarks use chDB's
zero-copy Arrow insert path — Arrow RecordBatch → FFI →INSERT INTO t SELECT * FROM ArrowStream(...) — which is roughly 10× faster thanINSERT … VALUES and is what a columnar ingest pipeline would actually use.
The headline: an 8-connection pool sustains ~26 million rows/second into a
single table on a 6-core laptop chip — and the same code, unchanged, hits ~99
million rows/second on a 24-core server. A single connection already clears 5–7
million rows/second with the right schema. All of it is comfortably past our
1M-rows/s design target. Here's how each lever contributes — and how far the
connection lever scales when you stop being core-bound.
Lever 1: schema — a single connection, 2.5M → 7.5M samples/s
Before touching concurrency, it's worth knowing where the time in a single
insert actually goes. We measured the Arrow→ClickHouse conversion cost in
isolation (same batch into an ENGINE = Null sink) and compared it to the full
insert into a real MergeTree. The insert is write-bound, not
conversion-bound — conversion is only ~28% of the time; the other ~72% is
sort + compress + part-write.
That single fact decides the schema. If the engine spends most of its time
writing columns to disk, then the layout that writes the fewest and narrowest
columns wins. We compared four ways of storing the same 1,000,000 samples
(1,000 series × 1,000 timestamps, 6 tags, 5 integer fields), inserted in
50k-row batches on one connection:
| Layout | samples/s | rows/s | convert share | on-disk |
|---|---|---|---|---|
cols (one column/tag) |
2,489,553 | 2,489,553 | 36% | 15.3 MiB |
map(tags) |
3,608,356 | 3,608,356 | 43% | 16.7 MiB |
series_id (wide fact) |
7,493,418 | 7,493,418 | 28% | 15.3 MiB |
series_id + long |
2,361,445 | 11,807,228 | 26% | 19.2 MiB |
series_id (Buffer) |
5,341,809 | 5,341,809 | 33% | 15.3 MiB |
The winner is the series_id layout: tags are hashed to a single UInt64
series id and normalized out into a small dimension table, so the hot "fact"
table is fixed-width — (time, series_id, field…) with ORDER BY (series_id, time). Storing tags as one column each (cols) forces the MergeTree sort key to
sort six string columns per block; folding them into one UInt64 is 3× faster
to insert and writes the same bytes to disk. That's the entire gap, and it's
free — same data, smaller sort key.
(The series_id + long row is a curiosity: it explodes each sample into one row
per field, so it does 5× the rows and "only" 2.36M samples/s — but that's
11.8M rows/s, which tells you the raw row engine is fast; the cost is always in
how wide and how sorted the rows are.)
Takeaway: the cheapest 3× you'll ever get is picking the schema that matches
the write path. Conversion-thread and insert-thread knobs (max_threads,max_insert_threads) don't move a single insert at all — one Arrow stream
becomes one block becomes one part becomes one writer, so there's nothing to
split. Schema is the lever, not parallelism settings.
Lever 2: connections — does the embedded engine overlap inserts?
Now the real question. In a single process, can two inserts run at the same
time? The answer is buried in how chDB is wired:
- The
EmbeddedServeris a process-global singleton — one engine, one data
path per process. Asking for a second connection to a different path is
rejected outright (we probe this in the benchmark and it fails, as expected). - But each connection is its own
ChdbClientwith its own lock
(client_mutex), and every query holds that lock for its full duration. So
two queries on the same connection serialize — one connection is strictly
one-insert-at-a-time. Multiple connections, though, are independent clients on
the one shared engine, and those can genuinely overlap.
So the model is exactly like a real ClickHouse server: you scale by opening more
client connections, each issuing inserts concurrently. We tested both arms —
hammering one shared connection from N threads, versus giving each of N threads
its own connection — sweeping 1 → 8 workers:
| W | shared 1 connection | connection per worker |
|---|---|---|
| 1 | 6,047,261 rows/s | 5,303,067 rows/s |
| 2 | 7,486,733 (1.24×) | 9,904,246 (1.87×) |
| 4 | 8,052,054 (1.33×) | 17,043,113 (3.21×) |
| 8 | 8,345,705 (1.38×) | 22,691,351 (4.28×) |
(48 inserts × 200,000 rows = 9.6M rows total, each worker writing its own table.)
The shared connection is flat — it tops out around 1.38× no matter how many
threads pile on, because they all queue behind that one client's mutex. Give each
worker its own connection and it scales 4.3×, to nearly 23 million rows/s.
That confirms the engine really does overlap inserts across clients; the only
thing serializing you is sharing a single connection.
The same shape holds with small batches — it's just lower in absolute terms
because per-insert overhead dominates (20k-row inserts): the shared connection
crawls from 3.4M to 4.8M rows/s and regresses at W=8, while
connection-per-worker climbs from 4.0M to 13.3M rows/s (3.34×). Concurrency is
the binding lever precisely in this small/frequent-flush regime, where a single
connection's throughput collapses.
Putting both levers together: 26M rows/s into one table
The table above lets each worker write its own table, which is the friendliest
case (no contention on a shared table's part-commit lock). The real ingest
target is harder: many connections all writing one series_id table. We ran
a connection pool toward that, 12.8M rows split across the pool in 200k-row
batches:
| Workers | wall (s) | rows/s | speedup | ≥ 1M/s |
|---|---|---|---|---|
| 1 | 2.221 | 5,764,271 | 1.00× | yes |
| 2 | 1.097 | 11,665,425 | 2.02× | yes |
| 4 | 0.657 | 19,491,238 | 3.38× | yes |
| 8 | 0.484 | 26,419,637 | 4.58× | yes |
Even with every connection contending on the same table, an 8-connection pool
sustains 26.4 million rows/second — 4.58× a single connection, which itself
is already ~6× our 1M-rows/s goal. The shared-table case actually edged out the
separate-table case here (26.4M vs 22.7M); both land in the same ~22–26M band,
and the part-commit contention people worry about turns out to be small, because
the expensive convert/sort/compress all runs before the brief commit.
How far does it actually go? 50M was the floor, not the ceiling
The obvious next question: keep adding connections — where does it top out, and
can we hit 50M? We built a ceiling-finder (pushit) that sweeps connection
counts past the core count, in the friendliest setup (connection-per-worker,
separate tables, 200k-row batches), using fixed work per worker so each added
connection adds exactly one core's worth of load.
On the 6-core laptop chip (AMD Ryzen 5 8500G, 6 physical cores / 12 threads), the
answer was a hard no — it plateaus at ~26M:
| W | rows/s | speedup | rows/s per worker |
|---|---|---|---|
| 6 | 20.3M | 3.61× | 3.4M |
| 8 | 20.1M | 3.58× | 2.5M |
| 12 | 25.2M | 4.49× | 2.1M |
| 16 | 25.7M | 4.58× | 1.6M |
The per-worker column falling off a cliff is the tell: inserts are CPU-bound
(convert + sort + compress), so six physical cores cap you at ~6× regardless of
how many connections you open. SMT buys a little through W=12; past that, extra
connections just oversubscribe. 50M is simply not reachable on six cores — it's a
hardware wall, not a software one.
So we ran the same binary, unchanged on a 24-core server (AMD EPYC 7413, 24
physical cores / 48 threads). It scales almost linearly up to the physical-core
count and sails past 50M:
| W | rows/s | speedup | ≥50M |
|---|---|---|---|
| 4 | 16.8M | 3.95× | no |
| 8 | 38.1M | 8.95× | no |
| 16 | 69.1M | 16.24× | yes |
| 24 | 90.7M | 21.32× | yes |
| 48 | 98.8M | 23.23× | yes |
| 64 | 96.4M | 22.67× | yes (oversubscribed) |
Peak ~98.8 million rows/second — nearly 2× the 50M target. The shape is
textbook CPU-bound scaling: near-linear to W=24 (one connection per physical
core, 21×), a smaller bump from SMT out to W=48 (the 48 threads, ~99M), then a
regression at W=64 as oversubscription adds context-switching with no spare
cores to absorb it. 50M is first crossed around W=12 — you don't even need to
saturate the box to double the target.
Two things worth calling out. First, the single-connection number is actually
lower on the EPYC (4.25M vs 5.6M rows/s) — its cores clock lower than the
laptop's — yet aggregate throughput is ~4× higher purely because there are 4× the
cores to overlap across. chDB insert throughput tracks physical core count, not
per-core clock. Second, this is the same story a real ClickHouse server
tells: you scale writes by adding concurrent client connections, bounded by
cores. The only chDB-specific caveat is the process-global engine — one
engine/data-path per process — so to scale past a single machine's cores you run
multiple processes, each its own embedded engine.
Two more things that matter for real ingest
A bulk backfill is the easy case. Production ingest is small, frequent flushes —
and one app-level flush usually fans out across many tables
(<db>_<rp>_<measurement>). Two findings there:
Buffer tables earn their keep when flushes are small. Inserting one timestamp
at a time (1,000 flushes × 1,000 rows), the direct path sprays a part per insert;
routing through an ENGINE = Buffer that coalesces in memory gives 1.64× the
throughput and 1 part instead of 7+. Across the realistic 8-table fan-out
(mp, 2 retention policies × 4 measurements), the Buffer wins ~1.8× and
collapses to 1 part per table (8 vs 34). Fewer parts means less merge pressure
later, and insert calls that return fast. For big bulk loads the Buffer is pure
overhead (an extra in-memory copy) — its payoff scales with flush frequency.
async_insert does not replace a Buffer for the Arrow path. ClickHouse's
async-insert batching only applies to inline-data inserts (VALUES/FORMAT),
not to INSERT … SELECT FROM ArrowStream(...). In this build it coalesced parts
modestly (5 vs 8) but moved throughput by zero. To coalesce the columnar path you
want a Buffer table, not async_insert.
The recipe for fastest inserts
If you're inserting into embedded chDB and want the ceiling:
- Insert via Arrow, not VALUES. ~10× faster; it's the columnar path.
- Pick a write-friendly schema. A narrow, fixed-width fact table with a
single-column sort key (series_id) beats tag-per-column by 3× on insert, at
the same on-disk size. The insert is write-bound — minimize columns written
and the width of the sort key. - Use big batches. 200k-row batches amortize per-insert pipeline overhead;
single-connection throughput jumps from ~2.5M to 5–7M rows/s as batches grow. - Scale with a connection pool, one connection per worker. Never share a
connection across threads — the per-client mutex serializes you. This scales
near-linearly with physical cores: ~26M rows/s on 6 cores, ~99M on 24. Open
roughly one connection per physical core; past that, SMT adds a little and
then oversubscription costs you. The hard ceiling is core count and a single
engine/data-path per process — to go beyond one machine's cores, run multiple
processes. - Add a Buffer table for small, frequent, fanned-out flushes. ~1.8–2×
throughput and far fewer parts — exactly the regime real ingest lives in.
The short version: chDB's embedded engine scales inserts the same way the
ClickHouse server does — through concurrent connections, bounded by physical
cores — and with the right schema and a connection pool, an in-process database
sustains tens of millions of rows per second on a laptop and ~100 million on a
single server.