Tag: Performance

One Thread to Poll Them All: How a Single Pipe Made WaterDrop 50% Faster

This is Part 2 of the "Karafka to Async Journey" series. Part 1 covered WaterDrop's integration with Ruby's async ecosystem and how fibers can yield during Kafka dispatches. This article covers another improvement in this area: migration of the producer polling engine to file descriptor-based polling.

When I released WaterDrop's async/fiber support in September 2025, the results were promising - fibers significantly outperformed multiple producer instances while consuming less memory. But something kept nagging me.

Every WaterDrop producer spawns a dedicated background thread for polling librdkafka's event queue. For one or two producers, nobody cares. But Karafka runs in hundreds of thousands of production processes. Some deployments use transactional producers, where each worker thread needs its own producer instance. Ten worker threads means ten producers and ten background polling threads - each competing for Ruby's GVL, each consuming memory, each doing the same repetitive work. Things will get even more intense once Karafka consumer becomes async-friendly, as it is under development.

The Thread Problem

Every time you create a WaterDrop producer, rdkafka-ruby spins up a background thread (rdkafka.native_kafka#<n>) that calls rd_kafka_poll(timeout) in a loop. Its job is to check whether librdkafka has delivery reports ready and to invoke the appropriate callbacks.

With one producer, you get one extra thread. With 25, you get 25. Each consumes roughly 1MB of stack space. Each competes with your application threads for the GVL. And most of the time, they're doing nothing - sleeping inside poll(timeout), waiting for events that may arrive once every few milliseconds.

I wanted one thread that could monitor all producers simultaneously, reacting only when there's actual work to do.

How librdkafka Polling Works (and Why It's Wasteful)

librdkafka is inherently asynchronous. When you produce a message, it gets buffered internally and dispatched by librdkafka's own I/O threads. When the broker acknowledges delivery, librdkafka places a delivery report on an internal event queue. rd_kafka_poll() drains that queue and invokes your callbacks.

The problem is how rd_kafka_poll(timeout) waits. Calling rd_kafka_poll(250) blocks for up to 250 milliseconds. From Ruby's perspective, this is a blocking C function call. The rdkafka-ruby FFI binding releases the GVL during this call so other threads can run, but the calling thread is stuck until either an event arrives or the timeout expires.

Every rd_kafka_poll(timeout) call must release the GVL before entering C and reacquire it afterward. This cycle happens continuously, even when the queue is empty. With 25 producers, that's 25 threads constantly cycling through GVL release/reacquire. And there's no way to say "watch these 25 queues and wake me when any of them has events."

The File Descriptor Alternative

Luckily for me, librdkafka has a lesser-known API that solves both problems: rd_kafka_queue_io_event_enable().

You can create an OS pipe and hand the write end to librdkafka:

int pipefd[2];
pipe(pipefd);
rd_kafka_queue_io_event_enable(queue, pipefd[1], "1", 1);

Whenever the queue transitions from empty to non-empty, librdkafka writes a single byte to the pipe. The actual events are still on librdkafka's internal queue - the pipe is purely a wake-up signal. This is edge-triggered: it only fires on the empty-to-non-empty transition, not per-event.

The read end of the pipe is a regular file descriptor that works with Ruby's IO.select. The Poller thread spends most of its time in IO.select, which handles GVL release natively. When a pipe signals readiness, we call poll_nb(0) - a non-blocking variant that skips GVL release entirely:

100,000 iterations:
  rd_kafka_poll:    ~19ms (5.1M calls/s) - releases GVL
  rd_kafka_poll_nb: ~12ms (8.1M calls/s) - keeps GVL
  poll_nb is ~1.6x faster

Instead of 25 threads each paying the GVL tax on every iteration, one thread pays it once in IO.select and then drains events across all producers without GVL overhead.

One Thread to Poll Them All

By default, a singleton Poller manages all FD-mode producers in a single thread:

When a producer is created with config.polling.mode = :fd, it registers with the global Poller instead of spawning its own thread. The Poller creates a pipe for each producer and tells librdkafka to signal through it.

The polling loop calls IO.select on all registered pipes. When any pipe becomes readable, the Poller drains it and runs a tight loop that processes events until the queue is empty or a configurable time limit is hit:

def poll_drain_nb(max_time_ms)
  deadline = monotonic_now + max_time_ms
  loop do
    events = rd_kafka_poll_nb(0)
    return true if events.zero?       # fully drained
    return false if monotonic_now >= deadline  # hit time limit
  end
end

When IO.select times out (~1 second by default), the Poller does a periodic poll on all producers regardless of pipe activity - a safety net for edge cases like OAuth token refresh that may not trigger a queue write. Regular events, including statistics.emitted callbacks, do write to the pipe and wake the Poller immediately.

The Numbers

Benchmarked on Ruby 4.0.1 with a local Kafka broker, 1,000 messages per producer, 100-byte payloads:

Producers Thread Mode FD Mode Improvement
1 27,300 msg/s 41,900 msg/s +54%
2 29,260 msg/s 40,740 msg/s +39%
5 27,850 msg/s 40,080 msg/s +44%
10 26,170 msg/s 39,590 msg/s +51%
25 24,140 msg/s 36,110 msg/s +50%

39-54% faster across the board. The improvement comes from three things: immediate event notification via the pipe, the 1.6x faster poll_nb that skips GVL overhead, and consolidating all producers into a single polling thread that eliminates GVL contention.

The Trade-offs

Callbacks execute on the Poller thread. In thread mode, each producer's callbacks ran on its own polling thread. In FD mode with the default singleton Poller, all callbacks share the single Poller thread. Don't perform expensive or blocking operations inside message.acknowledged or statistics.emitted. This was never recommended in thread mode either, but FD mode makes it worse - if your callback takes 500ms, it delays polling for all producers on that Poller, not just one.

Don't close a producer from within its own callback when using FD mode. Callbacks execute on the Poller thread, and closing from within would cause synchronization issues. Close producers from your application threads.

How to Use It

producer = WaterDrop::Producer.new do |config|
  config.kafka = { 'bootstrap.servers': 'localhost:9092' }
  config.polling.mode = :fd
end

Pipe creation, Poller registration, lifecycle management - all handled internally.

You can differentiate priorities between producers:

high = WaterDrop::Producer.new do |config|
  config.polling.mode = :fd
  config.polling.fd.max_time = 200  # more polling time
end

low = WaterDrop::Producer.new do |config|
  config.polling.mode = :fd
  config.polling.fd.max_time = 50   # less polling time
end

max_time controls how long the Poller spends draining events for each producer per cycle. Higher values mean more events processed per wake-up but less fair scheduling across producers.

Dedicated Pollers for Callback Isolation

By default, all FD-mode producers share a single global Poller. If a slow callback in one producer risks starving others, you can assign a dedicated Poller via config.polling.poller:

dedicated_poller = WaterDrop::Polling::Poller.new

producer = WaterDrop::Producer.new do |config|
  config.kafka = { 'bootstrap.servers': 'localhost:9092' }
  config.polling.mode = :fd
  config.polling.poller = dedicated_poller
end

Each dedicated Poller runs its own thread (waterdrop.poller#0, waterdrop.poller#1, etc.). You can also share a dedicated Poller between a subset of producers to group them - for example, giving critical producers their own shared Poller while background producers use the global singleton. The dedicated Poller shuts down automatically when its last producer closes.

When config.polling.poller is nil (the default), the global singleton is used. Setting a custom Poller is only valid with config.polling.mode = :fd.

The Rollout Plan

I'm being deliberately cautious. Karafka runs in too many production environments to rush this.

Phase 1 (WaterDrop 2.8, now): FD mode is opt-in. Thread mode stays the default.

Phase 2 (WaterDrop 2.9): FD mode becomes the default. Thread mode remains available with a deprecation warning.

Phase 3 (WaterDrop 2.10): Thread mode is removed. Every producer uses FD-based polling.

A full major version cycle to test before it becomes mandatory.

What's Next: The Consumer Side

The producer was the easier target - simpler event loop, more straightforward queue management. I'm working on similar improvements for Karafka's consumer, where the gains could be even more significant. Consumer polling has additional complexity around max.poll.interval.ms and consumer group membership, but the core idea is the same: replace per-thread blocking polls with file descriptor notifications and efficient multiplexing.


Find WaterDrop on GitHub and check PR #780 for the full implementation details.

Ruby Floats: When 2.6x Faster Is Actually Slower (and Then Faster Again)

Update: This article originally concluded that Eisel-Lemire wasn't worth it for Ruby. I was wrong. After revisiting the problem, I found a way to make it work - and submitted a PR to Ruby. Read the full update at the end.

Recently, I submitted a PR to Ruby that optimizes Float#to_s using the Ryu algorithm, achieving 2-4x performance improvements for float-to-string conversion. While that work deserves its own article, this article is about what happened when I tried to optimize the other direction: string-to-float parsing.

String-to-float seemed like an equally promising target. It's a fundamental operation used everywhere - parsing JSON, reading configuration files, processing CSV data, and handling user input. Since the Ryu optimization worked so well for float-to-string, surely the reverse direction would yield similar gains?

I did my research. I found a state-of-the-art algorithm backed by an academic paper. I implemented it. All tests passed. It worked exactly as promised.

And then I threw it all away.

Finding the "Perfect" Algorithm

The Eisel-Lemire algorithm, published by Daniel Lemire in 2021 in his paper "Number Parsing at a Gigabyte per Second", looked like exactly what I needed. It's a modern approach to converting decimal strings to floating-point numbers, using 128-bit multiplication with precomputed powers of 5.

Rust uses it. Go adopted it in 1.16. The fast_float C++ library is built around it.

When two performance-conscious language communities both adopt the same algorithm, you pay attention.

The Implementation

I wrote about 1,100 lines of C: 128-bit multiplication helpers, a ~10KB lookup table for powers of 5, the core algorithm, and a wrapper matching Ruby's existing strtod interface. For edge cases (hex floats, numbers with more than 19 significant digits, ambiguous rounding), it falls back to the original implementation. In practice, maybe 0.01% of inputs hit the fallback.

All 59 Float tests passed. Round-trip verification worked.

So how much faster was it?

The First Benchmark

Here's where I almost made a mistake.

I ran a benchmark with 3 million iterations across various float formats:

Test Case Unmodified Eisel-Lemire Speedup
Decimal (0.123456789) 0.185s 0.186s 1.00x
Scientific notation 0.162s 0.182s 0.89x
Math constants (Pi, E) 0.538s 0.205s 2.62x
Currency values 0.155s 0.167s 0.93x
Coordinates 0.172s 0.171s 1.01x
Very small (1e-15) 0.220s 0.171s 1.29x
Very large (1e15) 0.218s 0.169s 1.29x
TOTAL 2.316s 1.948s 1.19x

19% faster overall. The math constants case was 2.62x faster. I was ready to open a PR.

But something about the benchmark bothered me. I'd designed it to cover "various float formats" - which sounds reasonable until you realize I was testing what I expected to matter, not what actually matters.

The Second Benchmark

What numbers do Ruby applications actually parse?

Ruby runs web apps, reads config files, processes business data. It's not crunching scientific datasets. The floats it sees are prices, percentages, coordinates, timeouts. Mostly simple stuff.

So I benchmarked that:

Test Case Unmodified Eisel-Lemire Change
Single digit (1-9) 0.236s 0.255s -8%
Two digits (10-99) 0.240s 0.289s -17%
Simple decimal (1.5, 2.0) 0.244s 0.281s -13%
Price-like (9.99, 19.95) 0.258s 0.272s -5%
Short decimal (0.5, 0.25) 0.255s 0.277s -8%
Simple scientific (1e5) 0.250s 0.268s -7%
Common short (3.14, 2.71) 0.253s 0.264s -4%
TOTAL 2.482s 2.710s -9%

9% slower on simple numbers. The numbers Ruby actually parses.

What Went Wrong

Eisel-Lemire has fixed overhead: parse the string, look up powers of 5, do 128-bit multiplication, construct the IEEE 754 double. That overhead pays off when the alternative is expensive.

But Ruby's existing strtod - based on David Gay's code from 1991 - has been tuned for 30+ years. It has fast paths for simple inputs like "1.5" or "99.99". For those cases, the old code is already fast. Eisel-Lemire's setup cost ends up being more expensive than the work it replaces.

The algorithm works exactly as advertised. It just solves a different problem than the one Ruby has, in my opinion.

Trying to Have It Both Ways

What if I used strtod for simple numbers and Eisel-Lemire only for complex ones?

Approach Total Time vs Baseline
Unmodified strtod 2.316s baseline
Pure Eisel-Lemire 1.948s +19%
Hybrid (digit threshold=8) 2.164s +7%
Hybrid (digit threshold=10) 2.194s +6%
Hybrid (length-based) 2.060s +11%

Any dispatch check adds overhead. Counting digits or checking string length isn't free. The check itself eats into the gains.


Update: It Worked After All

After publishing this article, I decided to revisit the problem. The insight came from re-reading Nigel Tao's blog post, which mentions that the algorithm includes a "simple case" optimization for small mantissas that can be multiplied exactly by powers of 10.

The key realization: don't fight strtod on its home turf. Instead of replacing strtod entirely, I added fast paths that intercept simple numbers before they ever reach either algorithm:

  1. Ultra-fast path for small integers - handles "5", "42", "-123" (up to 3 digits) with direct digit parsing
  2. Ultra-fast path for simple decimals - handles "1.5", "9.99", "199.95" (up to 3+3 digits) using precomputed divisors
  3. Eisel-Lemire - handles complex numbers with many significant digits
  4. Fallback to strtod - for edge cases (hex floats, >19 digits, ambiguous rounding)

The fast paths are trivial - just a few comparisons and arithmetic operations. No 128-bit multiplication, no table lookups. For simple inputs, they're faster than both strtod and Eisel-Lemire.

New Benchmark Results

After implementing the fast paths, I ran the same benchmarks against Ruby master (3 million iterations):

Input Type Master Optimized Improvement
Simple decimals ("1.5", "3.14") 0.154s 0.125s 19% faster
Prices ("9.99", "19.95") 0.155s 0.125s 19% faster
Small integers ("5", "42") 0.149s 0.116s 22% faster
Math constants ("3.141592653589793") 0.674s 0.197s 3.4x faster
High precision ("0.123456789012345") 0.554s 0.199s 2.8x faster
Scientific ("1e5", "2e10") 0.154s 0.153s ~same

The numbers that were 9% slower are now 19-22% faster. The complex numbers that were 2.6x faster are now 2.8-3.4x faster. No regressions anywhere.

The PR

Based on this work, I submitted PR #15655 to Ruby. The implementation adds about 320 lines to object.c plus a 10KB lookup table for powers of 5.

Summary

My first benchmark was designed to make me feel good about my work. It covered "various formats" which happened to include cases where Eisel-Lemire shines. Only when I forced myself to benchmark what Ruby actually does did reality show up.

My original conclusion wasn't wrong - pure Eisel-Lemire is slower for simple numbers. The mistake was treating it as an all-or-nothing choice. Theoretical performance gains are hypotheses. Benchmarks against real workloads are proof. And sometimes the best optimization isn't replacing an algorithm - it's knowing when not to run it.

References

Copyright © 2026 Closer to Code

Theme by Anders NorenUp ↑