Tag: apache kafka

One Thread to Poll Them All: How a Single Pipe Made WaterDrop 50% Faster

This is Part 2 of the "Karafka to Async Journey" series. Part 1 covered WaterDrop's integration with Ruby's async ecosystem and how fibers can yield during Kafka dispatches. This article covers another improvement in this area: migration of the producer polling engine to file descriptor-based polling.

When I released WaterDrop's async/fiber support in September 2025, the results were promising - fibers significantly outperformed multiple producer instances while consuming less memory. But something kept nagging me.

Every WaterDrop producer spawns a dedicated background thread for polling librdkafka's event queue. For one or two producers, nobody cares. But Karafka runs in hundreds of thousands of production processes. Some deployments use transactional producers, where each worker thread needs its own producer instance. Ten worker threads means ten producers and ten background polling threads - each competing for Ruby's GVL, each consuming memory, each doing the same repetitive work. Things will get even more intense once Karafka consumer becomes async-friendly, as it is under development.

The Thread Problem

Every time you create a WaterDrop producer, rdkafka-ruby spins up a background thread (rdkafka.native_kafka#<n>) that calls rd_kafka_poll(timeout) in a loop. Its job is to check whether librdkafka has delivery reports ready and to invoke the appropriate callbacks.

With one producer, you get one extra thread. With 25, you get 25. Each consumes roughly 1MB of stack space. Each competes with your application threads for the GVL. And most of the time, they're doing nothing - sleeping inside poll(timeout), waiting for events that may arrive once every few milliseconds.

I wanted one thread that could monitor all producers simultaneously, reacting only when there's actual work to do.

How librdkafka Polling Works (and Why It's Wasteful)

librdkafka is inherently asynchronous. When you produce a message, it gets buffered internally and dispatched by librdkafka's own I/O threads. When the broker acknowledges delivery, librdkafka places a delivery report on an internal event queue. rd_kafka_poll() drains that queue and invokes your callbacks.

The problem is how rd_kafka_poll(timeout) waits. Calling rd_kafka_poll(250) blocks for up to 250 milliseconds. From Ruby's perspective, this is a blocking C function call. The rdkafka-ruby FFI binding releases the GVL during this call so other threads can run, but the calling thread is stuck until either an event arrives or the timeout expires.

Every rd_kafka_poll(timeout) call must release the GVL before entering C and reacquire it afterward. This cycle happens continuously, even when the queue is empty. With 25 producers, that's 25 threads constantly cycling through GVL release/reacquire. And there's no way to say "watch these 25 queues and wake me when any of them has events."

The File Descriptor Alternative

Luckily for me, librdkafka has a lesser-known API that solves both problems: rd_kafka_queue_io_event_enable().

You can create an OS pipe and hand the write end to librdkafka:

int pipefd[2];
pipe(pipefd);
rd_kafka_queue_io_event_enable(queue, pipefd[1], "1", 1);

Whenever the queue transitions from empty to non-empty, librdkafka writes a single byte to the pipe. The actual events are still on librdkafka's internal queue - the pipe is purely a wake-up signal. This is edge-triggered: it only fires on the empty-to-non-empty transition, not per-event.

The read end of the pipe is a regular file descriptor that works with Ruby's IO.select. The Poller thread spends most of its time in IO.select, which handles GVL release natively. When a pipe signals readiness, we call poll_nb(0) - a non-blocking variant that skips GVL release entirely:

100,000 iterations:
  rd_kafka_poll:    ~19ms (5.1M calls/s) - releases GVL
  rd_kafka_poll_nb: ~12ms (8.1M calls/s) - keeps GVL
  poll_nb is ~1.6x faster

Instead of 25 threads each paying the GVL tax on every iteration, one thread pays it once in IO.select and then drains events across all producers without GVL overhead.

One Thread to Poll Them All

By default, a singleton Poller manages all FD-mode producers in a single thread:

When a producer is created with config.polling.mode = :fd, it registers with the global Poller instead of spawning its own thread. The Poller creates a pipe for each producer and tells librdkafka to signal through it.

The polling loop calls IO.select on all registered pipes. When any pipe becomes readable, the Poller drains it and runs a tight loop that processes events until the queue is empty or a configurable time limit is hit:

def poll_drain_nb(max_time_ms)
  deadline = monotonic_now + max_time_ms
  loop do
    events = rd_kafka_poll_nb(0)
    return true if events.zero?       # fully drained
    return false if monotonic_now >= deadline  # hit time limit
  end
end

When IO.select times out (~1 second by default), the Poller does a periodic poll on all producers regardless of pipe activity - a safety net for edge cases like OAuth token refresh that may not trigger a queue write. Regular events, including statistics.emitted callbacks, do write to the pipe and wake the Poller immediately.

The Numbers

Benchmarked on Ruby 4.0.1 with a local Kafka broker, 1,000 messages per producer, 100-byte payloads:

Producers Thread Mode FD Mode Improvement
1 27,300 msg/s 41,900 msg/s +54%
2 29,260 msg/s 40,740 msg/s +39%
5 27,850 msg/s 40,080 msg/s +44%
10 26,170 msg/s 39,590 msg/s +51%
25 24,140 msg/s 36,110 msg/s +50%

39-54% faster across the board. The improvement comes from three things: immediate event notification via the pipe, the 1.6x faster poll_nb that skips GVL overhead, and consolidating all producers into a single polling thread that eliminates GVL contention.

The Trade-offs

Callbacks execute on the Poller thread. In thread mode, each producer's callbacks ran on its own polling thread. In FD mode with the default singleton Poller, all callbacks share the single Poller thread. Don't perform expensive or blocking operations inside message.acknowledged or statistics.emitted. This was never recommended in thread mode either, but FD mode makes it worse - if your callback takes 500ms, it delays polling for all producers on that Poller, not just one.

Don't close a producer from within its own callback when using FD mode. Callbacks execute on the Poller thread, and closing from within would cause synchronization issues. Close producers from your application threads.

How to Use It

producer = WaterDrop::Producer.new do |config|
  config.kafka = { 'bootstrap.servers': 'localhost:9092' }
  config.polling.mode = :fd
end

Pipe creation, Poller registration, lifecycle management - all handled internally.

You can differentiate priorities between producers:

high = WaterDrop::Producer.new do |config|
  config.polling.mode = :fd
  config.polling.fd.max_time = 200  # more polling time
end

low = WaterDrop::Producer.new do |config|
  config.polling.mode = :fd
  config.polling.fd.max_time = 50   # less polling time
end

max_time controls how long the Poller spends draining events for each producer per cycle. Higher values mean more events processed per wake-up but less fair scheduling across producers.

Dedicated Pollers for Callback Isolation

By default, all FD-mode producers share a single global Poller. If a slow callback in one producer risks starving others, you can assign a dedicated Poller via config.polling.poller:

dedicated_poller = WaterDrop::Polling::Poller.new

producer = WaterDrop::Producer.new do |config|
  config.kafka = { 'bootstrap.servers': 'localhost:9092' }
  config.polling.mode = :fd
  config.polling.poller = dedicated_poller
end

Each dedicated Poller runs its own thread (waterdrop.poller#0, waterdrop.poller#1, etc.). You can also share a dedicated Poller between a subset of producers to group them - for example, giving critical producers their own shared Poller while background producers use the global singleton. The dedicated Poller shuts down automatically when its last producer closes.

When config.polling.poller is nil (the default), the global singleton is used. Setting a custom Poller is only valid with config.polling.mode = :fd.

The Rollout Plan

I'm being deliberately cautious. Karafka runs in too many production environments to rush this.

Phase 1 (WaterDrop 2.8, now): FD mode is opt-in. Thread mode stays the default.

Phase 2 (WaterDrop 2.9): FD mode becomes the default. Thread mode remains available with a deprecation warning.

Phase 3 (WaterDrop 2.10): Thread mode is removed. Every producer uses FD-based polling.

A full major version cycle to test before it becomes mandatory.

What's Next: The Consumer Side

The producer was the easier target - simpler event loop, more straightforward queue management. I'm working on similar improvements for Karafka's consumer, where the gains could be even more significant. Consumer polling has additional complexity around max.poll.interval.ms and consumer group membership, but the core idea is the same: replace per-thread blocking polls with file descriptor notifications and efficient multiplexing.


Find WaterDrop on GitHub and check PR #780 for the full implementation details.

The 60-Second Wait: How I Spent Months Solving the Ruby’s Most Annoying Gem Installation Problem

Notice: While native extensions for rdkafka have been extensively tested and are no longer experimental, they may not work in all environments or configurations. If you find any issues with the precompiled extensions, please report them immediately and they will be resolved.

Every Ruby developer knows this excruciating feeling: you're setting up a project, running bundle install, and then... you wait. And wait. And wait some more as rdkafka compiles for what feels like eternity. Sixty to ninety seconds of pure frustration, staring at a seemingly frozen terminal that gives no indication of progress while your coffee is getting cold.

I've been there countless times. As the maintainer of the Karafka ecosystem, I've watched developers struggle with this for years. The rdkafka gem - essential for Apache Kafka integration in Ruby - was notorious for its painfully slow installation. Docker builds took forever. CI pipelines crawled. New developers gave up before they even started. Not to mention the countless compilation crashes that were nearly impossible to debug.

Something had to change.

The Moment of Truth

It was during a particularly frustrating debugging session that I realized the real scope of the problem. I was helping a developer who couldn't get rdkafka to install on his macOS dev machine. Build tools missing. Compilation failing. Dependencies conflicting. The usual nightmare.

As I walked him through the solution for the hundredth time, I did some quick math. The rdkafka gem gets downloaded over a million times per month. Each installation takes about 60 seconds to compile. That's 60 million seconds of CPU time every month - nearly two years of continuous processing power wasted on compilation alone.

But the real kicker? All those installations were essentially building the same thing over and over again. The same librdkafka library. The same OpenSSL. The same compression libraries. Millions of identical compilations happening across the world, burning through CPU cycles and developer patience.

That's when I decided to solve this once and for all.

Why Nobody Had Done This Before

You might wonder: if this was such an obvious problem, why hadn't anyone solved it already? The answer lies in what I call "compatibility hell."

Unlike many other Ruby gems that might need basic compilation, rdkafka is a complex beast. It wraps librdkafka, a sophisticated C library that depends on a web of other libraries:

  • OpenSSL for encryption
  • Cyrus SASL for authentication
  • MIT Kerberos for enterprise security
  • Multiple compression libraries (zlib, zstd, lz4, snappy)
  • System libraries that vary wildly across platforms

Every Linux distribution has slightly different versions of these libraries. Ubuntu uses one version of OpenSSL, CentOS uses another. Alpine Linux uses musl instead of glibc. macOS has its own quirks. Creating a single binary that works everywhere seemed impossible.

My previous attempts had failed, because they tried to link against system libraries dynamically. This works great... until you deploy to a system with different library versions. Then everything breaks spectacularly.

The Deep Dive Begins

I started by studying how other Ruby gems had tackled similar problems. The nokogiri gem had become the gold standard for this approach - they'd successfully shipped precompiled binaries that eliminated the notorious compilation headaches that had plagued XML processing in Ruby for years. Their success proved that it was possible.

Other ecosystems had figured this out years ago. Python has wheels. Go has static binaries. Rust has excellent cross-compilation. While Ruby has improved with precompiled gems for many platforms, the ecosystem still feels inconsistent - you never know if you'll get a precompiled gem or need to compile from source. The solution, I realized, was static linking. Instead of depending on system libraries, I would bundle everything into self-contained binaries.

Every dependency would be compiled from source and linked statically into the final library.

Sounds simple, right? It wasn't.

The First Breakthrough

My first success came with the Linux x86_64 GNU systems - your typical Ubuntu or CentOS server. After days of tweaking the compiler flags and build scripts, I had a working prototype. The binary was larger than the dynamically linked version, but it worked anywhere.

The installation time dropped from 60+ seconds to under 5 seconds!

But then I tried it on Alpine Linux. Complete failure. Alpine uses musl libc instead of glibc, and my carefully crafted build didn't work at all.

Platform-Specific Nightmares

Each platform brought its own specific challenges:

  • Alpine Linux (musl): Different system calls, different library conventions, different compiler behavior. I had to rebuild the entire toolchain with the musl-specific flags. The Cyrus SASL library was particularly troublesome - it kept trying to use glibc-specific functions that didn't exist in musl.

  • macOS ARM64: Apple Silicon Macs use a completely different architecture. The build system had to use different SDK paths, and handle Apple's unique library linking requirements. Plus, macOS has its own ideas about where libraries should live.

  • Security concerns: Precompiled binaries are inherently less trustworthy than source code. How do you prove that a binary contains exactly what it claims to contain? I implemented the SHA256 verification for every dependency, but that was just the beginning.

The Compilation Ballet

Building a single native extension became an intricate dance of dependencies. Each library had to be compiled in the correct order, with the right flags, targeting the right architecture. One mistake and the entire build would fail.

I developed a common build system that could handle all platforms:

# Download and verify every dependency
secure_download "$(get_openssl_url)" "$OPENSSL_TARBALL"
verify_checksum "$OPENSSL_TARBALL"

# Build in the correct order
build_openssl_for_platform "$PLATFORM"
build_kerberos_for_platform "$PLATFORM"
build_sasl_for_platform "$PLATFORM"
# ... and so on

The build scripts grew to over 2,000 lines of carefully crafted shell code. Every platform had its own nuances, its own gotchas, its own way of making my life difficult.

The Security Rabbit Hole

Precompiled binaries introduce a fundamental security challenge: trust. When you compile from source, you can theoretically inspect every line of code. With precompiled binaries, you're trusting that the binary contains exactly what it claims to contain.

I spent weeks, implementing a comprehensive security model consisting of:

  • SHA256 verification for every downloaded dependency
  • Cryptographic attestation through RubyGems Trusted Publishing
  • Reproducible builds with the pinned versions
  • Supply chain protection against malicious dependencies

The build logs became a security audit trail:

[SECURITY] Verifying checksum for openssl-3.0.16.tar.gz...
[SECURITY] ✅ Checksum verified for openssl-3.0.16.tar.gz
[SECURITY] 🔒 SECURITY VERIFICATION COMPLETE

The CI/CD Nightmare

Testing native extensions across multiple platforms and Ruby versions created a combinatorial explosion of complexity. My GitHub Actions configuration grew from a simple test matrix to a multi-stage pipeline with separate build and test phases.

Each platform needed its own runners:

  • Linux builds ran on Ubuntu with Docker containers
  • macOS builds ran on actual macOS runners
  • Each build had to be tested across Ruby 3.1, 3.2, 3.3, 3.4, and 3.5

The CI pipeline became a carefully choreographed dance of builds, tests, and releases. One failure anywhere would cascade through the entire system.

Each platform requires 10 separate CI actions - from compilation to testing across Ruby versions. Multiplied across all supported platforms, this creates a complex 30-action pipeline.

The Release Process Revolution

Publishing native extensions isn't just gem push. Each release now involves building on multiple platforms simultaneously, testing each binary across Ruby versions, and coordinating the release of multiple platform-specific gems.

I implemented RubyGems Trusted Publishing, which uses cryptographic tokens instead of API keys. This meant rebuilding the entire release process from scratch, but it provided better security and audit trails.

The First Success

After months of work, I finally had working native extensions for all three major platforms. The moment of truth came, when I installed the gem for the first time using the precompiled binary:

$ gem install rdkafka
Successfully installed rdkafka-0.22.0-x86_64-linux-gnu
1 gem installed

I sat there staring at my terminal, hardly believing it had worked. Months of frustration, debugging, and near misses had led to this moment. Three seconds instead of sixty!

The Ripple Effect

The impact is beyond just faster installations. The Docker builds that previously had taken several minutes, now completed much faster. The CI pipelines that developers had learned to ignore suddenly became responsive. New contributors could set up the development environments without fighting the compiler errors.

But the numbers that really struck me were the environmental ones. With over a million downloads per month, those 60 seconds of compilation time added up to 60 million seconds of CPU usage monthly. That's 16,667 hours of processing power - and all the associated energy consumption and CO2 emissions.

The Unexpected Challenges

Just when I thought I was done, new challenges emerged. Some users needed to stick with source compilation for custom configurations. Others wanted to verify that the precompiled binaries were truly equivalent to the source builds.

I added fallback mechanisms and comprehensive documentation. You can still force source compilation if needed:

gem 'rdkafka', force_ruby_platform: true

But for most users, the native extensions work transparently. Install the gem, and you automatically get the fastest experience possible.

Looking Back

This project has taught me, that sometimes the most valuable improvements are the ones that the users never notice. Nobody celebrates faster gem installation. There are no awards for reducing the compilation time. But those small improvements compound into something much larger.

Every developer who doesn't wait for rdkafka to compile can focus on building something amazing instead. Every CI pipeline that completes faster means more iterations, more experiments, more innovation.

The 60-second problem is solved. It took months of engineering effort, thousands of lines of code, and more debugging sessions than I care to count. But now, when you run gem install rdkafka or bundle install, it just works.

Fast.


The rdkafka and karafka-rdkafka gems with native extensions are available now. Your next bundle install will be faster than ever. For complete documentation, visit the Karafka documentation.

Copyright © 2026 Closer to Code

Theme by Anders NorenUp ↑