This is Part 2 of the "Karafka to Async Journey" series. Part 1 covered WaterDrop's integration with Ruby's async ecosystem and how fibers can yield during Kafka dispatches. This article covers another improvement in this area: migration of the producer polling engine to file descriptor-based polling.
When I released WaterDrop's async/fiber support in September 2025, the results were promising - fibers significantly outperformed multiple producer instances while consuming less memory. But something kept nagging me.
Every WaterDrop producer spawns a dedicated background thread for polling librdkafka's event queue. For one or two producers, nobody cares. But Karafka runs in hundreds of thousands of production processes. Some deployments use transactional producers, where each worker thread needs its own producer instance. Ten worker threads means ten producers and ten background polling threads - each competing for Ruby's GVL, each consuming memory, each doing the same repetitive work. Things will get even more intense once Karafka consumer becomes async-friendly, as it is under development.
The Thread Problem
Every time you create a WaterDrop producer, rdkafka-ruby spins up a background thread (rdkafka.native_kafka#<n>) that calls rd_kafka_poll(timeout) in a loop. Its job is to check whether librdkafka has delivery reports ready and to invoke the appropriate callbacks.
With one producer, you get one extra thread. With 25, you get 25. Each consumes roughly 1MB of stack space. Each competes with your application threads for the GVL. And most of the time, they're doing nothing - sleeping inside poll(timeout), waiting for events that may arrive once every few milliseconds.
I wanted one thread that could monitor all producers simultaneously, reacting only when there's actual work to do.
How librdkafka Polling Works (and Why It's Wasteful)
librdkafka is inherently asynchronous. When you produce a message, it gets buffered internally and dispatched by librdkafka's own I/O threads. When the broker acknowledges delivery, librdkafka places a delivery report on an internal event queue. rd_kafka_poll() drains that queue and invokes your callbacks.
The problem is how rd_kafka_poll(timeout) waits. Calling rd_kafka_poll(250) blocks for up to 250 milliseconds. From Ruby's perspective, this is a blocking C function call. The rdkafka-ruby FFI binding releases the GVL during this call so other threads can run, but the calling thread is stuck until either an event arrives or the timeout expires.
Every rd_kafka_poll(timeout) call must release the GVL before entering C and reacquire it afterward. This cycle happens continuously, even when the queue is empty. With 25 producers, that's 25 threads constantly cycling through GVL release/reacquire. And there's no way to say "watch these 25 queues and wake me when any of them has events."
The File Descriptor Alternative
Luckily for me, librdkafka has a lesser-known API that solves both problems: rd_kafka_queue_io_event_enable().
You can create an OS pipe and hand the write end to librdkafka:
int pipefd[2];
pipe(pipefd);
rd_kafka_queue_io_event_enable(queue, pipefd[1], "1", 1);
Whenever the queue transitions from empty to non-empty, librdkafka writes a single byte to the pipe. The actual events are still on librdkafka's internal queue - the pipe is purely a wake-up signal. This is edge-triggered: it only fires on the empty-to-non-empty transition, not per-event.
The read end of the pipe is a regular file descriptor that works with Ruby's IO.select. The Poller thread spends most of its time in IO.select, which handles GVL release natively. When a pipe signals readiness, we call poll_nb(0) - a non-blocking variant that skips GVL release entirely:
Instead of 25 threads each paying the GVL tax on every iteration, one thread pays it once in IO.select and then drains events across all producers without GVL overhead.
One Thread to Poll Them All
By default, a singleton Poller manages all FD-mode producers in a single thread:
When a producer is created with config.polling.mode = :fd, it registers with the global Poller instead of spawning its own thread. The Poller creates a pipe for each producer and tells librdkafka to signal through it.
The polling loop calls IO.select on all registered pipes. When any pipe becomes readable, the Poller drains it and runs a tight loop that processes events until the queue is empty or a configurable time limit is hit:
def poll_drain_nb(max_time_ms)
deadline = monotonic_now + max_time_ms
loop do
events = rd_kafka_poll_nb(0)
return true if events.zero? # fully drained
return false if monotonic_now >= deadline # hit time limit
end
end
When IO.select times out (~1 second by default), the Poller does a periodic poll on all producers regardless of pipe activity - a safety net for edge cases like OAuth token refresh that may not trigger a queue write. Regular events, including statistics.emitted callbacks, do write to the pipe and wake the Poller immediately.
The Numbers
Benchmarked on Ruby 4.0.1 with a local Kafka broker, 1,000 messages per producer, 100-byte payloads:
Producers
Thread Mode
FD Mode
Improvement
1
27,300 msg/s
41,900 msg/s
+54%
2
29,260 msg/s
40,740 msg/s
+39%
5
27,850 msg/s
40,080 msg/s
+44%
10
26,170 msg/s
39,590 msg/s
+51%
25
24,140 msg/s
36,110 msg/s
+50%
39-54% faster across the board. The improvement comes from three things: immediate event notification via the pipe, the 1.6x faster poll_nb that skips GVL overhead, and consolidating all producers into a single polling thread that eliminates GVL contention.
The Trade-offs
Callbacks execute on the Poller thread. In thread mode, each producer's callbacks ran on its own polling thread. In FD mode with the default singleton Poller, all callbacks share the single Poller thread. Don't perform expensive or blocking operations inside message.acknowledged or statistics.emitted. This was never recommended in thread mode either, but FD mode makes it worse - if your callback takes 500ms, it delays polling for all producers on that Poller, not just one.
Don't close a producer from within its own callback when using FD mode. Callbacks execute on the Poller thread, and closing from within would cause synchronization issues. Close producers from your application threads.
How to Use It
producer = WaterDrop::Producer.new do |config|
config.kafka = { 'bootstrap.servers': 'localhost:9092' }
config.polling.mode = :fd
end
Pipe creation, Poller registration, lifecycle management - all handled internally.
You can differentiate priorities between producers:
high = WaterDrop::Producer.new do |config|
config.polling.mode = :fd
config.polling.fd.max_time = 200 # more polling time
end
low = WaterDrop::Producer.new do |config|
config.polling.mode = :fd
config.polling.fd.max_time = 50 # less polling time
end
max_time controls how long the Poller spends draining events for each producer per cycle. Higher values mean more events processed per wake-up but less fair scheduling across producers.
Dedicated Pollers for Callback Isolation
By default, all FD-mode producers share a single global Poller. If a slow callback in one producer risks starving others, you can assign a dedicated Poller via config.polling.poller:
Each dedicated Poller runs its own thread (waterdrop.poller#0, waterdrop.poller#1, etc.). You can also share a dedicated Poller between a subset of producers to group them - for example, giving critical producers their own shared Poller while background producers use the global singleton. The dedicated Poller shuts down automatically when its last producer closes.
When config.polling.poller is nil (the default), the global singleton is used. Setting a custom Poller is only valid with config.polling.mode = :fd.
The Rollout Plan
I'm being deliberately cautious. Karafka runs in too many production environments to rush this.
Phase 1 (WaterDrop 2.8, now): FD mode is opt-in. Thread mode stays the default.
Phase 2 (WaterDrop 2.9): FD mode becomes the default. Thread mode remains available with a deprecation warning.
Phase 3 (WaterDrop 2.10): Thread mode is removed. Every producer uses FD-based polling.
A full major version cycle to test before it becomes mandatory.
What's Next: The Consumer Side
The producer was the easier target - simpler event loop, more straightforward queue management. I'm working on similar improvements for Karafka's consumer, where the gains could be even more significant. Consumer polling has additional complexity around max.poll.interval.ms and consumer group membership, but the core idea is the same: replace per-thread blocking polls with file descriptor notifications and efficient multiplexing.
Find WaterDrop on GitHub and check PR #780 for the full implementation details.
Every developer who maintains Ruby gems knows that sinking feeling when a user reports an error that shouldn't be possible. Not "difficult to reproduce", but truly impossible according to everything you know about how your code works.
That's exactly what hit me when Karafka user's error tracker logged 2,700 identical errors in a single incident:
NoMethodError: undefined method 'default' for an instance of String
vendor/bundle/ruby/3.4.0/gems/karafka-rdkafka-0.22.2-x86_64-linux-musl/lib/rdkafka/consumer/topic_partition_list.rb:112 FFI::Struct#[]
The error was because something was calling #default on a String. I had never used a #default method anywhere in Karafka or rdkafka-ruby. Suddenly, there were 2,700 reports in rapid succession until the process restarted and everything went back to normal.
The user added casually: "No worry, no harm done since this hasn't occurred on prod yet."
Yet. That word stuck with me.
Something had to change. Fast.
TL;DR: FFI < 1.17.0 has missing write barriers that cause Ruby's GC to free internal Hashes, allowing them to be replaced by other objects at the same memory address. Rare but catastrophic.
The Impossible Error
I opened the rdkafka-ruby code at line 112:
native_tpl[:cnt].times do |i|
ptr = native_tpl[:elems] + (i * Rdkafka::Bindings::TopicPartition.size)
elem = Rdkafka::Bindings::TopicPartition.new(ptr)
# Line 112 - Where everything exploded
if elem[:partition] == -1
The crash happened when accessing elem[:partition]. But elem is an FFI::Struct - a foreign function interface structure that bridges Ruby and C code and partition was declared as an integer:
I dove into FFI's internals to understand what was happening. FFI doesn't use many Hashes, neither in Ruby nor in its C extension - there are only a few critical data structures. The most important one is rbFieldMap, an internal Hash that every struct layout maintains to store field definitions. When you access elem[:partition], FFI looks up :partition in this Hash to find the field's type, offset, and size.
This Hash is the heart of the FFI's struct system. Without it, FFI can't map field names to their C memory locations.
Why would it be calling default on a String?
I searched the entire codebase. No calls to #default anywhere in my code. I checked FFI's Ruby code. No calls to #default there either.
But #default is a Hash method. Ruby's Hash implementation calls hash#default when you access a key that might not exist.
I stared at the backtrace. After billions of messages processed successfully, something in FFI's internals had fundamentally broken. An internal Hash that should contain field definitions was somehow... a String.
Investigating the musl Hypothesis
The gem was precompiled: karafka-rdkafka-0.22.2-x86_64-linux-musl. That suffix made me immediately suspicious. The user was running ruby:3.4.5-alpine in Docker, which uses musl libc instead of glibc.
I've debugged enough production issues to know that precompiled gems and Alpine Linux make a notorious combination. Different libc versions, different struct alignment assumptions, different CPU architecture quirks.
"This has to be musl," I thought. I spent some time building diagnostic scripts:
require 'ffi'
# Check FFI integer type sizes
module Test
extend FFI::Library
class IntTest < FFI::Struct
layout :a, :int
end
class Int32Test < FFI::Struct
layout :a, :int32
end
end
int_size = Test::IntTest.size
int32_size = Test::Int32Test.size
puts "FFI :int size: #{int_size} bytes"
puts "FFI :int32 size: #{int32_size} bytes"
puts "Match: #{int_size == int32_size ? 'Yes' : 'No'}"
Actual gem struct: size=64, err_offset=48
Expected: size=64, err_offset=48
Everything matched and every "obvious" explanation had failed. The struct definitions were perfect. The memory layout was correct. There was no ABI mismatch, no musl-specific quirk, no CPU architecture issue.
And yet the undefined method 'default' for an instance of Stringoccurred.
The Moment Everything Stopped Making Sense
I went back to that error message with fresh eyes. Why default specifically?
In Ruby, when you access a Hash with hash[key], the implementation can call hash.default to check for a default value if the key doesn't exist. So if FFI is trying to call #default on a String, this would mean thatrbFieldMap - the internal Hash that stores field definitions - is actually a String.
Sounds crazy, but wait! What if there was a case where Ruby could replace a Hash with a String at runtime? Not corrupt the Hash's data, but literally free the Hash and allocate a String in the same memory location?
That would explain everything. The C code would still have a pointer to memory address 0x000078358a3dfd28, thinking it points to a Hash. But Ruby's GC would have freed that Hash, and the memory allocator could create a String at the exact same address. The pointer would be valid. The memory would contain valid data. Just... the wrong type of data.
Not corrupted.
Not misaligned.
Not reading wrong offsets.
An object changing type at runtime. That shouldn't be possible unless... I searched FFI's GitHub issues and found #1079: "Crash with [BUG] try to mark T_NONE object" - about segfaults, not this specific error. But buried in the comments, KJ mentioned "missing write barriers" in FFI's C extension.
A write barrier is a mechanism that tells Ruby's garbage collector about references between objects. When C code stores a Ruby object pointer without using RB_OBJ_WRITE, the GC doesn't know that reference exists. The GC can then free the object, thinking nothing needs it anymore.
That's when it clicked. If FFI's rbFieldMap Hash was being freed by the GC, then Ruby could allocate a String in that exact memory location.
But first, I needed to understand the #1079 issue better. I wrote a simple reproduction:
require 'ffi'
puts "Ruby: #{RUBY_VERSION} | FFI: #{FFI::VERSION}"
# Enable aggressive GC to trigger the bug faster
GC.stress = 0x01 | 0x04
i = 0
loop do
i += 1
# Create transient struct class that immediately goes out of scope
struct_class = Class.new(FFI::Struct) do
layout :field1, :int32,
:field2, :int64,
:field3, :pointer,
:field4, :string,
:field5, :double,
:field6, :uint8,
:field7, :uint32,
:field8, :pointer
end
instance = struct_class.new
instance[:field1] = rand
instance[:field2]
# ... access various fields
field = struct_class.layout[:field5]
field.offset
field.size
print "." if i % 1000 == 0
end
This reproduced the #1079 segfaults beautifully - the "T_NONE object" errors where the GC frees objects so aggressively that Ruby tries to access null pointers.
rb_obj_info_dump:
/3.4.0/gems/ffi-1.16.3/lib/ffi/struct_layout_builder.rb:171: [BUG] try to mark T_NONE object
ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0044 p:---- s:0246 e:000245 CFUNC :initialize
c:0043 p:---- s:0243 e:000242 CFUNC :new
c:0042 p:0033 s:0236 e:000235 METHOD /gems/3.4.0/gems/ffi-1.16.3/lib/ffi/struct_layout_builder.rb:171
But my production bug wasn't a segfault. It was a magical transformation. The timing had to be different.
With GC.stress = true, the GC runs after every possible allocation. That causes immediate segfaults because objects get freed before Ruby can even allocate new objects in their memory slots.
But for a Hash to become a String, you need:
GC to run and free the Hash
Time to pass between the free and the next access
Ruby to allocate a String in that exact memory slot
Code to try accessing the "Hash" that's now a String
I couldn't use GC.stress. I needed natural GC timing with precise memory pressure.
Down the Rabbit Hole
I dove deeper into FFI's C extension code. In ext/ffi_c/StructLayout.c, I found the vulnerable code:
static VALUE
struct_layout_initialize(VALUE self, VALUE fields, VALUE size, VALUE align)
{
StructLayout* layout;
// ... initialization code ...
layout->rbFieldMap = rb_hash_new(); // ← NO WRITE BARRIER
layout->rbFields = rb_ary_new();
layout->rbFieldNames = rb_ary_new();
// Without RB_OBJ_WRITE, the GC doesn't know about these references!
// ...
}
When FFI creates a struct layout, it allocates three Ruby objects:
a Hash for field lookups,
an Array of fields,
and an Array of field names.
It stores pointers to these objects in a C struct.
But it didn't use RB_OBJ_WRITE to register these references with Ruby garbage collector in FFI 1.16.3.
From the GC's perspective, the following is happening:
A Hash is allocated at memory address 0x000078358a3dfd28.
No Ruby code stores a reference to this memory address (as far as the GC can see).
The struct class goes out of scope.
The GC thinks: "Nobody needs this Hash anymore".
The GC frees the memory.
Ruby allocates a String at the address 0x000078358a3dfd28.
FFI's C code still has the pointer, thinking it points to a Hash.
Boom: undefined method 'default' for String.
The fix in FFI 1.17.0 added proper write barriers:
static VALUE
struct_layout_initialize(VALUE self, VALUE fields, VALUE size, VALUE align)
{
StructLayout* layout;
// ... initialization code ...
RB_OBJ_WRITE(self, &layout->rbFieldMap, rb_hash_new()); // ← FIXED!
RB_OBJ_WRITE(self, &layout->rbFields, rb_ary_new());
RB_OBJ_WRITE(self, &layout->rbFieldNames, rb_ary_new());
// Now the GC knows: "self owns these objects, don't free them"
// ...
}
This single macro call, RB_OBJ_WRITE, tells Ruby's garbage collector: "This C struct holds a reference to this Ruby object. Don't free it while the struct is alive."
Without it, you have a use-after-free vulnerability where C thinks that it has a valid pointer, but Ruby has freed the memory and reused it for something else entirely.
Reproducing the Bug
Understanding the bug wasn't enough. I needed to reproduce it. Not the #1079 segfaults - the specific case where a Hash becomes something else.
The requirements were precise:
Thousands of transient struct class definitions that go out of scope.
Natural memory pressure to trigger GC (not GC.stress which causes segfaults).
Time between GC and field access for Ruby to allocate new objects.
Multi-threaded execution to increase memory churn.
Constant struct creation to maximize the replacement window.
Here's what I built:
#!/usr/bin/env ruby
require 'ffi'
# Unbuffer stdout so we see output immediately
$stdout.sync = true
$stderr.sync = true
2.times do
Thread.new do
loop do
# Create an array to hold references temporarily
# This creates more allocation pressure
arr = []
# Allocate many strings rapidly
5000.times do
arr << rand.to_s * 100
arr << Time.now.to_s
arr << "test string #{rand(10000)}"
end
end
end
end
sleep(0.1)
# Keep all struct instances here so we can access them later
garbage_strings = []
ars = Array.new(5) do |round|
Thread.new do
round_instances = []
10000.times do |i|
# Create a new struct class - this creates an rbFieldMap
klass = Class.new(FFI::Struct) do
layout :partition, :int32,
:offset, :int64,
:metadata, :pointer,
:err, :int32,
:value, :int64
end
# Create instance from this class
ptr = FFI::MemoryPointer.new(klass.size)
instance = klass.new(ptr)
instance[:partition] = round * 100 + i
instance[:offset] = (round * 100 + i) * 1000
instance[:err] = 0
round_instances << instance
end
round_instances.each_with_index do |instance, i|
begin
partition = instance[:partition]
offset = instance[:offset]
err = instance[:err]
rescue NoMethodError => e
puts "\n" + "=" * 60
puts "🐛 BUG REPRODUCED! 🐛"
puts "=" * 60
puts "Error: #{e.message}"
puts "\nBacktrace:"
puts e.backtrace[0..10]
exit 1
end
end
# Clear old strings to increase memory churn
if garbage_strings.size > 50_000
garbage_strings.shift(25_000)
end
end
end
ars.each(&:join)
Key differences from typical FFI tests:
No GC.stress (it would cause segfaults, not object replacement)
Transient classes that go out of scope immediately
I wrapped it in a Docker container with memory constraints:
FROM ruby:3.4.5-alpine
RUN apk add --no-cache build-base
RUN gem install ffi -v 1.16.3
WORKDIR /app
COPY poc.rb .
CMD ["ruby", "poc.rb"]
Then I created a bash script to run it in a loop, filtering for the specific error:
#!/bin/bash
run_count=0
log_dir="./logs"
mkdir -p "$log_dir"
echo "Building Docker image..."
docker build -t ffi-bug-poc .
echo "Running POC in a loop until bug is reproduced..."
echo "Looking for exit code 1 with 'undefined' in output"
echo
while true; do
run_count=$((run_count + 1))
timestamp=$(date +%Y%m%d_%H%M%S)
log_file="${log_dir}/run_${run_count}_${timestamp}.log"
echo -n "Run #${run_count} at $(date +%H:%M:%S)... "
# Run with memory constraints to increase GC pressure
docker run --rm \
--memory=512m \
--memory-swap=0m \
ffi-bug-poc > "$log_file" 2>&1
exit_code=$?
# Filter: only care about exit code 1 with "undefined" in output
# Ignore segfaults (exit 139) - those are from #1079
if [ $exit_code -eq 1 ] && grep -qi "undefined" "$log_file"; then
echo ""
echo "🐛 BUG REPRODUCED on run #${run_count}! 🐛"
cat "$log_file"
exit 0
elif [ $exit_code -eq 0 ]; then
echo "completed successfully (no bug)"
rm "$log_file"
else
echo "exit code $exit_code (segfault) - continuing..."
fi
sleep 0.1
done
I hit Enter and watched the terminal:
Building Docker image...
Running POC in a loop until bug is reproduced...
Looking for exit code 1 with 'undefined' in output
Run #1 at 14:32:15... completed successfully (no bug)
Run #2 at 14:32:18... completed successfully (no bug)
Run #3 at 14:32:21... exit code 139 (segfault) - continuing...
Run #4 at 14:32:24... completed successfully (no bug)
Lots of segfaults - those were the #1079 issue. I was hunting for the specific undefined method error.
After realizing I needed even more memory churn, I opened multiple terminals and ran the loop script several times in parallel. Within minutes:
Run #23 at 15:18:42... exit code 139 (segfault) - continuing...
Run #24 at 15:18:45... completed successfully (no bug)
Run #25 at 15:18:48...
============================================================
🐛 BUG REPRODUCED! 🐛
============================================================
Error: undefined method 'default' for an instance of String
Backtrace:
poc.rb:82:in `[]'
poc.rb:82:in `block (2 levels) in <main>'
poc.rb:80:in `each'
poc.rb:80:in `each_with_index'
poc.rb:80:in `block in <main>'
<internal:numeric>:237:in `times'
poc.rb:50:in `<main>'
============================================================
There!
Not a segfault. Not the T_NONE error from #1079. There it is, the exact error from production: undefined method 'default' for an instance of String
An FFI internal Hash had been freed by the GC and replaced by a String object in the same memory location!
The Microsecond Window
Here's what happens in those microseconds when the bug triggers:
The Hash didn't get corrupted. It ceased to exist. A String was born in its place, wearing the Hash's memory address like a stolen identity.
What This Means for Ruby's Memory Model
This bug reveals something fundamental about how Ruby manages memory at the lowest level.
Objects don't have permanent identities. They're data structures at the memory addresses. When the garbage collector frees memory, Ruby will reuse it. If you're holding a C pointer to that address without proper write barriers, you're now pointing at whatever Ruby decided to create there next.
This is why write barriers exist. They're not optional extras for C extension authors. They're how you tell the garbage collector: "I'm holding a reference. Don't free this." Without them, you have use-after-free bugs that can manifest as objects changing identity at runtime.
The Fix and The Future
If you're using FFI < 1.17.0, the fix is straightforward:
# Gemfile
gem 'ffi', '~> 1.17.0'
That's it. Upgrade and the bug goes from million-to-one to zero.
The fix made by KJ adds proper write barriers throughout FFI's C codebase. The garbage collector now knows not to free rbFieldMap while it's still needed. Your Hashes stay Hashes. Your Strings stay Strings. Reality remains consistent.
Lessons From the Hunt
After spending days debugging what seemed impossible, a few things stood out. Sometimes the obvious answer is wrong. I burned hours convinced this was a musl issue - every diagnostic came back green, but the bug had nothing to do with data layout. It was about object identity.
The timing of garbage collection matters as much as whether it happens. GC.stress triggers immediate segfaults, while natural GC timing reveals delayed object transformations. My diagnostic scripts verified struct layouts perfectly but couldn't detect that FFI's internal Hash could be freed and replaced at runtime. They checked structure, not behavior.
Initially, I blamed myself. I guess that's what maintainership feels like sometimes - you own the stack, even when the bug is deeper than your code. The fix was already in FFI 1.17.0 when this happened. The user just hadn't upgraded yet.
Acknowledgments
The root cause - missing write barriers in FFI < 1.17.0 - was fixed in FFI issue #1079 by KJ, who has been my invaluable rubber duck throughout this debugging journey.
The Bottom Line
If you're running FFI < 1.17.0 in production - especially in high-restart environments like Kubernetes, ECS, or serverless platforms - upgrade today. The bug may be one in a million restarts, but at scale, million-to-one odds aren't odds at all.
They're inevitabilities waiting to happen.
If you want to understand more about write barriers and Ruby's GC internals, start here: Garbage Collection in Ruby.