From 99 million to 1.9 million: what an ingest benchmark doesn't tell you
We measured an embedded ClickHouse inserting ~99 million rows/second. Then we
put a real database in front of it — a network endpoint, line/msgpack parsing, a
durable write-ahead log, series indexing — and ran a ramp test on the same
machine. It ingests ~1.9 million points/second at the wall, ~1.5M with latency
you'd actually ship. This is the story of the missing 98%, and why the number on
the benchmark slide is almost never the number you get.
In an earlier post we pushed
chDB — ClickHouse compiled into a library, running in-process — to ~99 million
rows/second on a 24-core/48-thread AMD EPYC 7413. The recipe was: pre-built
Arrow RecordBatches, a connection pool with one connection per core, a narrowseries_id fact table, and 200k-row batches inserted straight through the FFI.
No network. No parsing. No durability. Just the engine, fed exactly the shape it
likes, as fast as 48 cores can convert-sort-compress.
That number is real, and it's useful: it's the ceiling. But it's a component
benchmark — it measures the insert engine in isolation. HyperbyteDB is a
time-series database built on that engine, and a database has to do all the
things the microbenchmark deliberately skipped:
- accept writes over HTTP, from many clients at once;
- parse them (InfluxDB line protocol or msgpack) into typed points;
- hash each point's tag-set into a stable
series_idand check cardinality; - build the columnar Arrow batch;
- durably log every write to a WAL before acknowledging, so a crash doesn't
lose data; - and only then flush to chDB.
So we asked the obvious question: on the same EPYC box, with the same chDB
underneath, how many points/second can the whole database actually ingest — and
where does it stop?
We ran it as a ramp, not a saturation test, because the difference between
those two is most of the point of this post.
The ramp
We drove the server with k6 using an
open-model load generator (constant-arrival-rate): we offer a fixed
number of write requests per second and watch what comes back. Each request is a
50,000-point msgpack batch, pre-encoded to a file so the client is never the
bottleneck. We stepped the offered rate up and recorded, at each step, the
achieved throughput, the 95th-percentile request latency, and how many requests
k6 had to drop because it couldn't even start them on time.
| Offered | Offered pts/s | Achieved pts/s | % of offered | p95 latency | dropped |
|---|---|---|---|---|---|
| 8 req/s | 400,000 | 397,664 | 99.4% | 248 ms | 0 |
| 16 req/s | 800,000 | 794,445 | 99.3% | 220 ms | 0 |
| 24 req/s | 1,200,000 | 1,190,597 | 99.2% | 226 ms | 0 |
| 30 req/s | 1,500,000 | 1,486,613 | 99.1% | 281 ms | 0 |
| 34 req/s | 1,700,000 | 1,624,099 | 95.5% | 733 ms | 0 |
| 40 req/s | 2,000,000 | 1,777,944 | 88.9% | 3.46 s | 0 |
| 50 req/s | 2,500,000 | 1,898,060 | 75.9% | 5.89 s | 54 |
There are three regimes here, and they matter more than any single number:
- Linear (≤ 1.5M pts/s). Achieved tracks offered to within 1%, and p95
latency sits flat around a quarter-second. The database is comfortably
keeping up; you could run here all day. - The knee (~1.7M). Achieved starts falling behind offered (95.5%), and p95
latency 2.6×'s from 281 ms to 733 ms. The queue is starting to back up. - Saturation (≥ 2.0M). Throughput flattens toward a ceiling — it crawls
from 1.78M to 1.9M while offered load jumps 25% — and latency goes to
seconds. At 2.5M offered, k6 gives up on 54 requests entirely.
Peak achievable: ~1.9M points/second. Comfortably sustainable: ~1.5M. On the
same box where the engine alone does 99M.
Why the closed-loop benchmark would have lied to us
Before the ramp, we ran the conventional thing: a closed-model saturation
test — N virtual users, each firing the next request as soon as the last
returns, sweeping N from 64 to 512. Here's what that reported:
| Virtual users | points/s | p95 latency | errors |
|---|---|---|---|
| 64 | 1.63M | 2.29 s | 0 |
| 128 | 1.57M | 4.50 s | 0 |
| 256 | 1.68M | 8.79 s | 0 |
| 512 | 1.65M | 19.57 s | 0 |
Read that table the way a benchmark slide would: "1.6M points/sec, zero
errors, scales cleanly to 512 concurrent clients." Every run "passed." Nothing
failed. If you're reporting a headline throughput number, this is the table you
screenshot.
It's also profoundly misleading. Throughput is flat while latency climbs
linearly with concurrency — from 2.3s to 19.6s — which is the unmistakable
signature of a system that saturated at 64 users and spent the next 448 just
queueing. A closed loop can't drop a request; it just waits. So it never shows
you backpressure. It reports a peak (1.6M) as if it were an operating point,
when in reality nobody would accept 19-second write latency.
The open-model ramp tells the truth the closed loop hides: the usable number
is ~1.5M (before the latency knee), the peak is ~1.9M, and past that the
system sheds load. Same server, same minute — the benchmark methodology alone
moved the story from "1.6M, scales great" to "1.5M usable, 1.9M cliff."
That's the first way benchmarks don't show real-world values: a saturation
number measured without a latency budget is a number no one can run at.
Where it's actually constrained
So why 1.9M and not 99M? We instrumented every stage of the write path
(Prometheus histograms on parse, series registration, Arrow build, and WAL
commit) and watched the box under load. Two findings.
Finding 1: the durable WAL is single-threaded — and it's the binding
constraint. Every write is acknowledged only after it's in a RocksDB
write-ahead log. HyperbyteDB uses a group-commit design: a single writer thread
drains a queue, coalesces many requests into one RocksDB batch, and commits. At
saturation that one thread was processing ~35 requests (1.75M points) per batch
in ~1.0–1.3s — i.e. it tops out around 1.3–1.7M points/s, which is exactly
the ceiling we measured. Under load the box was only ~60% busy with the server
pinned to ~6–10 of its 48 cores: not compute-bound, serialization-bound.
We landed one optimization for this post. The writer was bincode-serializing
every point inline, on the single thread. Since those bytes don't depend on
the commit sequence number, we moved the serialization off the writer into
the parallel per-request tasks; the writer now just commits pre-encoded bytes.
That lifted the ceiling ~1.3M → ~1.65M (+27%) with no change to durability
semantics. The remaining writer cost is the RocksDB commit itself (a ~175MB
batch) plus per-row sequence stamping — both still single-threaded, and the next
wall.
Finding 2: even with a free WAL, the front door has a per-point CPU floor.
Strip the WAL away entirely and you still pay, per request, for the work that
makes this a database instead of a memcpy:
| Stage (per 50k-point request) | Time | Notes |
|---|---|---|
| msgpack parse | 64 ms | typed points |
| series-id hashing + cardinality | 478 ms | the expensive part |
| Arrow batch build | 264 ms | columnar encode |
| total parallelizable work | ~806 ms | ≈ 16 µs / point |
16 µs/point, spread across 48 cores, is a ~3M points/s ceiling — before a
single byte is made durable. To hit 100M points/s you'd need ~480 ns/point of
total budget; we spend ~16,000 ns. The front door is ~33× over budget on CPU
alone, and that's on a friendly, low-cardinality dataset where the schema is
cached. The headline engine number lives on the other side of all of this.
Finding 3 (the reassuring one): chDB isn't the problem. When we measured
HyperbyteDB's flush path — the WAL-to-chDB side, using the same connection
pool from the original benchmark — it sustained ~22M rows/s, and was
starved: ingest could only feed it 1.5M/s, so it was never working hard. The
engine underneath is fine. The cost is everything you wrap around it to make it
usable.
The real lesson: every layer is an order of magnitude
Line them up and the shape is unmistakable:
| What you're measuring | Throughput | What it includes |
|---|---|---|
| chDB raw insert (component benchmark) | ~99M rows/s | pre-built Arrow → 48 conns → 48 tables. No network, parse, or durability. |
| HyperbyteDB flush → chDB | ~22M rows/s | columnar Arrow + the connection pool + part-commit. Still no front door. |
| HyperbyteDB end-to-end, peak | ~1.9M pts/s | HTTP + parse + series hashing + durable WAL. Latency in seconds. |
| HyperbyteDB end-to-end, sustainable | ~1.5M pts/s | the same, at a latency you'd actually run (p95 ~250 ms). |
Each step down is roughly an order of magnitude, and each step is something
the layer above it deliberately didn't measure. 99M is the engine doing nothing
but inserting. 22M is the engine doing durable, columnar, pooled inserts. 1.9M
is the engine plus a network, a parser, an index, and a write-ahead log. 1.5M is
all of that, measured honestly — at an acceptable latency rather than at the
cliff.
None of these numbers is "wrong." The 99M is a true ceiling and a useful one —
it tells you the engine will never be your bottleneck, which is exactly what
Finding 3 confirmed. But it is not, and was never, an ingest rate. The number
you can put in a capacity plan is the one at the bottom of that table: ~1.5
million points per second, on a 48-thread server, end to end, durable, at
sub-second p95 — about 66× below the engine's component number.
So when you read "X million inserts per second," ask the three questions this
ramp answered: Inserts of what, through what? (a raw columnar batch, or a
parsed-and-indexed write over the wire). At what latency? (the knee, or the
cliff). And measured how? (an open model that can drop requests, or a closed
loop that just queues and calls it a pass). The gap between the slide and
production is usually hiding in those three answers.
Reproducing these numbers
Hardware: AMD EPYC 7413, 24 cores / 48 threads, 251 GB RAM, Debian 12, the same
box as the chDB component benchmark. Single HyperbyteDB node, pool_size=32,max_points_per_batch=200000, msgpack writes, a low-cardinality dataset (a
handful of series across cpu/memory/disk) chosen to stress the front door
rather than chDB. Numbers are single representative runs and vary ±10%
run-to-run; the shape — linear region, knee, cliff — is stable.
# Pre-generate a fixed 50k-point msgpack payload (no client-side encoding cost):
python3 gen_payload.py 50000 payload_50k.msgpack
# Open-model ramp: step the offered request rate and watch achieved pts/s,
# p95 latency, and dropped_iterations (the saturation signal a closed loop hides):
for R in 8 16 24 30 34 40 50; do
RATE=$R POINTS_PER_REQUEST=50000 DURATION=20s \
PAYLOAD_FILE=payload_50k.msgpack k6 run --quiet ramp_ingest.js
done
# Closed-model contrast (the table that "passes" but lies):
for V in 64 128 256 512; do
VUS=$V POINTS_PER_REQUEST=50000 DURATION=15s \
PAYLOAD_FILE=payload_50k.msgpack k6 run --quiet bench_ingest.js
done