From 99 million to 1.9 million: what an ingest benchmark doesn't tell you

From 99 million to 1.9 million: what an ingest benchmark doesn't tell you

We measured an embedded ClickHouse inserting ~99 million rows/second. Then we
put a real database in front of it — a network endpoint, line/msgpack parsing, a
durable write-ahead log, series indexing — and ran a ramp test on the same
machine. It ingests ~1.9 million points/second at the wall, ~1.5M with latency
you'd actually ship. This is the story of the missing 98%, and why the number on
the benchmark slide is almost never the number you get.


In an earlier post we pushed
chDB — ClickHouse compiled into a library, running in-process — to ~99 million
rows/second
on a 24-core/48-thread AMD EPYC 7413. The recipe was: pre-built
Arrow RecordBatches, a connection pool with one connection per core, a narrow
series_id fact table, and 200k-row batches inserted straight through the FFI.
No network. No parsing. No durability. Just the engine, fed exactly the shape it
likes, as fast as 48 cores can convert-sort-compress.

That number is real, and it's useful: it's the ceiling. But it's a component
benchmark — it measures the insert engine in isolation. HyperbyteDB is a
time-series database built on that engine, and a database has to do all the
things the microbenchmark deliberately skipped:

  • accept writes over HTTP, from many clients at once;
  • parse them (InfluxDB line protocol or msgpack) into typed points;
  • hash each point's tag-set into a stable series_id and check cardinality;
  • build the columnar Arrow batch;
  • durably log every write to a WAL before acknowledging, so a crash doesn't
    lose data;
  • and only then flush to chDB.

So we asked the obvious question: on the same EPYC box, with the same chDB
underneath, how many points/second can the whole database actually ingest — and
where does it stop?

We ran it as a ramp, not a saturation test, because the difference between
those two is most of the point of this post.


The ramp

We drove the server with k6 using an
open-model load generator (constant-arrival-rate): we offer a fixed
number of write requests per second and watch what comes back. Each request is a
50,000-point msgpack batch, pre-encoded to a file so the client is never the
bottleneck. We stepped the offered rate up and recorded, at each step, the
achieved throughput, the 95th-percentile request latency, and how many requests
k6 had to drop because it couldn't even start them on time.

Offered Offered pts/s Achieved pts/s % of offered p95 latency dropped
8 req/s 400,000 397,664 99.4% 248 ms 0
16 req/s 800,000 794,445 99.3% 220 ms 0
24 req/s 1,200,000 1,190,597 99.2% 226 ms 0
30 req/s 1,500,000 1,486,613 99.1% 281 ms 0
34 req/s 1,700,000 1,624,099 95.5% 733 ms 0
40 req/s 2,000,000 1,777,944 88.9% 3.46 s 0
50 req/s 2,500,000 1,898,060 75.9% 5.89 s 54

There are three regimes here, and they matter more than any single number:

  1. Linear (≤ 1.5M pts/s). Achieved tracks offered to within 1%, and p95
    latency sits flat around a quarter-second. The database is comfortably
    keeping up; you could run here all day.
  2. The knee (~1.7M). Achieved starts falling behind offered (95.5%), and p95
    latency 2.6×'s from 281 ms to 733 ms. The queue is starting to back up.
  3. Saturation (≥ 2.0M). Throughput flattens toward a ceiling — it crawls
    from 1.78M to 1.9M while offered load jumps 25% — and latency goes to
    seconds. At 2.5M offered, k6 gives up on 54 requests entirely.

Peak achievable: ~1.9M points/second. Comfortably sustainable: ~1.5M. On the
same box where the engine alone does 99M.


Why the closed-loop benchmark would have lied to us

Before the ramp, we ran the conventional thing: a closed-model saturation
test — N virtual users, each firing the next request as soon as the last
returns, sweeping N from 64 to 512. Here's what that reported:

Virtual users points/s p95 latency errors
64 1.63M 2.29 s 0
128 1.57M 4.50 s 0
256 1.68M 8.79 s 0
512 1.65M 19.57 s 0

Read that table the way a benchmark slide would: "1.6M points/sec, zero
errors, scales cleanly to 512 concurrent clients."
Every run "passed." Nothing
failed. If you're reporting a headline throughput number, this is the table you
screenshot.

It's also profoundly misleading. Throughput is flat while latency climbs
linearly with concurrency — from 2.3s to 19.6s — which is the unmistakable
signature of a system that saturated at 64 users and spent the next 448 just
queueing. A closed loop can't drop a request; it just waits. So it never shows
you backpressure. It reports a peak (1.6M) as if it were an operating point,
when in reality nobody would accept 19-second write latency.

The open-model ramp tells the truth the closed loop hides: the usable number
is ~1.5M (before the latency knee), the peak is ~1.9M, and past that the
system sheds load. Same server, same minute — the benchmark methodology alone
moved the story from "1.6M, scales great" to "1.5M usable, 1.9M cliff."

That's the first way benchmarks don't show real-world values: a saturation
number measured without a latency budget is a number no one can run at.


Where it's actually constrained

So why 1.9M and not 99M? We instrumented every stage of the write path
(Prometheus histograms on parse, series registration, Arrow build, and WAL
commit) and watched the box under load. Two findings.

Finding 1: the durable WAL is single-threaded — and it's the binding
constraint.
Every write is acknowledged only after it's in a RocksDB
write-ahead log. HyperbyteDB uses a group-commit design: a single writer thread
drains a queue, coalesces many requests into one RocksDB batch, and commits. At
saturation that one thread was processing ~35 requests (1.75M points) per batch
in ~1.0–1.3s — i.e. it tops out around 1.3–1.7M points/s, which is exactly
the ceiling we measured. Under load the box was only ~60% busy with the server
pinned to ~6–10 of its 48 cores: not compute-bound, serialization-bound.

We landed one optimization for this post. The writer was bincode-serializing
every point inline, on the single thread. Since those bytes don't depend on
the commit sequence number, we moved the serialization off the writer into
the parallel per-request tasks; the writer now just commits pre-encoded bytes.
That lifted the ceiling ~1.3M → ~1.65M (+27%) with no change to durability
semantics. The remaining writer cost is the RocksDB commit itself (a ~175MB
batch) plus per-row sequence stamping — both still single-threaded, and the next
wall.

Finding 2: even with a free WAL, the front door has a per-point CPU floor.
Strip the WAL away entirely and you still pay, per request, for the work that
makes this a database instead of a memcpy:

Stage (per 50k-point request) Time Notes
msgpack parse 64 ms typed points
series-id hashing + cardinality 478 ms the expensive part
Arrow batch build 264 ms columnar encode
total parallelizable work ~806 ms 16 µs / point

16 µs/point, spread across 48 cores, is a ~3M points/s ceiling — before a
single byte is made durable. To hit 100M points/s you'd need ~480 ns/point of
total budget; we spend ~16,000 ns. The front door is ~33× over budget on CPU
alone, and that's on a friendly, low-cardinality dataset where the schema is
cached. The headline engine number lives on the other side of all of this.

Finding 3 (the reassuring one): chDB isn't the problem. When we measured
HyperbyteDB's flush path — the WAL-to-chDB side, using the same connection
pool from the original benchmark — it sustained ~22M rows/s, and was
starved: ingest could only feed it 1.5M/s, so it was never working hard. The
engine underneath is fine. The cost is everything you wrap around it to make it
usable.


The real lesson: every layer is an order of magnitude

Line them up and the shape is unmistakable:

What you're measuring Throughput What it includes
chDB raw insert (component benchmark) ~99M rows/s pre-built Arrow → 48 conns → 48 tables. No network, parse, or durability.
HyperbyteDB flush → chDB ~22M rows/s columnar Arrow + the connection pool + part-commit. Still no front door.
HyperbyteDB end-to-end, peak ~1.9M pts/s HTTP + parse + series hashing + durable WAL. Latency in seconds.
HyperbyteDB end-to-end, sustainable ~1.5M pts/s the same, at a latency you'd actually run (p95 ~250 ms).

Each step down is roughly an order of magnitude, and each step is something
the layer above it deliberately didn't measure. 99M is the engine doing nothing
but inserting. 22M is the engine doing durable, columnar, pooled inserts. 1.9M
is the engine plus a network, a parser, an index, and a write-ahead log. 1.5M is
all of that, measured honestly — at an acceptable latency rather than at the
cliff.

None of these numbers is "wrong." The 99M is a true ceiling and a useful one —
it tells you the engine will never be your bottleneck, which is exactly what
Finding 3 confirmed. But it is not, and was never, an ingest rate. The number
you can put in a capacity plan is the one at the bottom of that table: ~1.5
million points per second, on a 48-thread server, end to end, durable, at
sub-second p95
— about 66× below the engine's component number.

So when you read "X million inserts per second," ask the three questions this
ramp answered: Inserts of what, through what? (a raw columnar batch, or a
parsed-and-indexed write over the wire). At what latency? (the knee, or the
cliff). And measured how? (an open model that can drop requests, or a closed
loop that just queues and calls it a pass). The gap between the slide and
production is usually hiding in those three answers.


Reproducing these numbers

Hardware: AMD EPYC 7413, 24 cores / 48 threads, 251 GB RAM, Debian 12, the same
box as the chDB component benchmark. Single HyperbyteDB node, pool_size=32,
max_points_per_batch=200000, msgpack writes, a low-cardinality dataset (a
handful of series across cpu/memory/disk) chosen to stress the front door
rather than chDB. Numbers are single representative runs and vary ±10%
run-to-run; the shape — linear region, knee, cliff — is stable.

# Pre-generate a fixed 50k-point msgpack payload (no client-side encoding cost):
python3 gen_payload.py 50000 payload_50k.msgpack

# Open-model ramp: step the offered request rate and watch achieved pts/s,
# p95 latency, and dropped_iterations (the saturation signal a closed loop hides):
for R in 8 16 24 30 34 40 50; do
  RATE=$R POINTS_PER_REQUEST=50000 DURATION=20s \
    PAYLOAD_FILE=payload_50k.msgpack k6 run --quiet ramp_ingest.js
done

# Closed-model contrast (the table that "passes" but lies):
for V in 64 128 256 512; do
  VUS=$V POINTS_PER_REQUEST=50000 DURATION=15s \
    PAYLOAD_FILE=payload_50k.msgpack k6 run --quiet bench_ingest.js
done