Tag: rdkafka

When Your Hash Becomes a String: Hunting Ruby’s Million-to-One Memory Bug

Every developer who maintains Ruby gems knows that sinking feeling when a user reports an error that shouldn't be possible. Not "difficult to reproduce", but truly impossible according to everything you know about how your code works.

That's exactly what hit me when Karafka user's error tracker logged 2,700 identical errors in a single incident:

NoMethodError: undefined method 'default' for an instance of String
vendor/bundle/ruby/3.4.0/gems/karafka-rdkafka-0.22.2-x86_64-linux-musl/lib/rdkafka/consumer/topic_partition_list.rb:112 FFI::Struct#[]

The error was because something was calling #default on a String. I had never used a #default method anywhere in Karafka or rdkafka-ruby. Suddenly, there were 2,700 reports in rapid succession until the process restarted and everything went back to normal.

The user added casually: "No worry, no harm done since this hasn't occurred on prod yet."

Yet. That word stuck with me.

Something had to change. Fast.

TL;DR: FFI < 1.17.0 has missing write barriers that cause Ruby's GC to free internal Hashes, allowing them to be replaced by other objects at the same memory address. Rare but catastrophic.

The Impossible Error

I opened the rdkafka-ruby code at line 112:

native_tpl[:cnt].times do |i|
  ptr = native_tpl[:elems] + (i * Rdkafka::Bindings::TopicPartition.size)
  elem = Rdkafka::Bindings::TopicPartition.new(ptr)

  # Line 112 - Where everything exploded
  if elem[:partition] == -1

The crash happened when accessing elem[:partition]. But elem is an FFI::Struct - a foreign function interface structure that bridges Ruby and C code and partition was declared as an integer:

class TopicPartition < FFI::Struct
  layout :topic, :string,
         :partition, :int32,
         :offset, :int64,
         :metadata, :pointer,
         :metadata_size, :size_t,
         :opaque, :pointer,
         :err, :int,
         :_private, :pointer
end

I dove into FFI's internals to understand what was happening. FFI doesn't use many Hashes, neither in Ruby nor in its C extension - there are only a few critical data structures. The most important one is rbFieldMap, an internal Hash that every struct layout maintains to store field definitions. When you access elem[:partition], FFI looks up :partition in this Hash to find the field's type, offset, and size.

This Hash is the heart of the FFI's struct system. Without it, FFI can't map field names to their C memory locations.

Why would it be calling default on a String?

I searched the entire codebase. No calls to #default anywhere in my code. I checked FFI's Ruby code. No calls to #default there either.

But #default is a Hash method. Ruby's Hash implementation calls hash#default when you access a key that might not exist.

I stared at the backtrace. After billions of messages processed successfully, something in FFI's internals had fundamentally broken. An internal Hash that should contain field definitions was somehow... a String.

Investigating the musl Hypothesis

The gem was precompiled: karafka-rdkafka-0.22.2-x86_64-linux-musl. That suffix made me immediately suspicious. The user was running ruby:3.4.5-alpine in Docker, which uses musl libc instead of glibc.

I've debugged enough production issues to know that precompiled gems and Alpine Linux make a notorious combination. Different libc versions, different struct alignment assumptions, different CPU architecture quirks.

"This has to be musl," I thought. I spent some time building diagnostic scripts:

require 'ffi'

# Check FFI integer type sizes
module Test
  extend FFI::Library

  class IntTest < FFI::Struct
    layout :a, :int
  end

  class Int32Test < FFI::Struct
    layout :a, :int32
  end
end

int_size = Test::IntTest.size
int32_size = Test::Int32Test.size

puts "FFI :int size: #{int_size} bytes"
puts "FFI :int32 size: #{int32_size} bytes"
puts "Match: #{int_size == int32_size ? 'Yes' : 'No'}"

The response came back:

FFI :int size: 4 bytes
FFI :int32 size: 4 bytes
Match: Yes

The sizes matched. That ruled out basic type mismatches. But maybe alignment?

I sent another diagnostic to check struct padding:

# Check actual struct field offsets
module AlignTest
  extend FFI::Library

  class WithInt < FFI::Struct
    layout :topic, :pointer, :partition, :int32, :offset, :int64,
           :metadata, :pointer, :metadata_size, :size_t, :opaque, :pointer,
           :err, :int, :_private, :pointer
  end

  class WithInt32 < FFI::Struct
    layout :topic, :pointer, :partition, :int32, :offset, :int64,
           :metadata, :pointer, :metadata_size, :size_t, :opaque, :pointer,
           :err, :int32, :_private, :pointer
  end
end

err_offset_int = AlignTest::WithInt.offset_of(:err)
err_offset_int32 = AlignTest::WithInt32.offset_of(:err)
puts "Struct alignment: :err offset #{err_offset_int} vs #{err_offset_int32}"

Response:

Struct alignment: :err offset 48 vs 48

Perfect alignment. Now let's check the actual compiled struct from the gem:

actual_size = Rdkafka::Bindings::TopicPartition.size
actual_err_offset = Rdkafka::Bindings::TopicPartition.offset_of(:err)
puts "Actual gem struct: size=#{actual_size}, err_offset=#{actual_err_offset}"

expected_size = 64
expected_err_offset = 48
puts "Expected: size=#{expected_size}, err_offset=#{expected_err_offset}"

Response:

Actual gem struct: size=64, err_offset=48
Expected: size=64, err_offset=48

Everything matched and every "obvious" explanation had failed. The struct definitions were perfect. The memory layout was correct. There was no ABI mismatch, no musl-specific quirk, no CPU architecture issue.

And yet the undefined method 'default' for an instance of Stringoccurred.

The Moment Everything Stopped Making Sense

I went back to that error message with fresh eyes. Why default specifically?

In Ruby, when you access a Hash with hash[key], the implementation can call hash.default to check for a default value if the key doesn't exist. So if FFI is trying to call #default on a String, this would mean thatrbFieldMap - the internal Hash that stores field definitions - is actually a String.

Sounds crazy, but wait! What if there was a case where Ruby could replace a Hash with a String at runtime? Not corrupt the Hash's data, but literally free the Hash and allocate a String in the same memory location?

That would explain everything. The C code would still have a pointer to memory address 0x000078358a3dfd28, thinking it points to a Hash. But Ruby's GC would have freed that Hash, and the memory allocator could create a String at the exact same address. The pointer would be valid. The memory would contain valid data. Just... the wrong type of data.

  • Not corrupted.
  • Not misaligned.
  • Not reading wrong offsets.

An object changing type at runtime. That shouldn't be possible unless... I searched FFI's GitHub issues and found #1079: "Crash with [BUG] try to mark T_NONE object" - about segfaults, not this specific error. But buried in the comments, KJ mentioned "missing write barriers" in FFI's C extension.

A write barrier is a mechanism that tells Ruby's garbage collector about references between objects. When C code stores a Ruby object pointer without using RB_OBJ_WRITE, the GC doesn't know that reference exists. The GC can then free the object, thinking nothing needs it anymore.

That's when it clicked. If FFI's rbFieldMap Hash was being freed by the GC, then Ruby could allocate a String in that exact memory location.

But first, I needed to understand the #1079 issue better. I wrote a simple reproduction:

require 'ffi'

puts "Ruby: #{RUBY_VERSION} | FFI: #{FFI::VERSION}"

# Enable aggressive GC to trigger the bug faster
GC.stress = 0x01 | 0x04

i = 0

loop do
  i += 1

  # Create transient struct class that immediately goes out of scope
  struct_class = Class.new(FFI::Struct) do
    layout :field1, :int32,
           :field2, :int64,
           :field3, :pointer,
           :field4, :string,
           :field5, :double,
           :field6, :uint8,
           :field7, :uint32,
           :field8, :pointer
  end

  instance = struct_class.new
  instance[:field1] = rand
  instance[:field2]
  # ... access various fields

  field = struct_class.layout[:field5]
  field.offset
  field.size

  print "." if i % 1000 == 0
end

This reproduced the #1079 segfaults beautifully - the "T_NONE object" errors where the GC frees objects so aggressively that Ruby tries to access null pointers.

rb_obj_info_dump: 
/3.4.0/gems/ffi-1.16.3/lib/ffi/struct_layout_builder.rb:171: [BUG] try to mark T_NONE object
ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0044 p:---- s:0246 e:000245 CFUNC  :initialize
c:0043 p:---- s:0243 e:000242 CFUNC  :new
c:0042 p:0033 s:0236 e:000235 METHOD /gems/3.4.0/gems/ffi-1.16.3/lib/ffi/struct_layout_builder.rb:171

But my production bug wasn't a segfault. It was a magical transformation. The timing had to be different.

With GC.stress = true, the GC runs after every possible allocation. That causes immediate segfaults because objects get freed before Ruby can even allocate new objects in their memory slots.

But for a Hash to become a String, you need:

  1. GC to run and free the Hash
  2. Time to pass between the free and the next access
  3. Ruby to allocate a String in that exact memory slot
  4. Code to try accessing the "Hash" that's now a String

I couldn't use GC.stress. I needed natural GC timing with precise memory pressure.

Down the Rabbit Hole

I dove deeper into FFI's C extension code. In ext/ffi_c/StructLayout.c, I found the vulnerable code:

static VALUE
struct_layout_initialize(VALUE self, VALUE fields, VALUE size, VALUE align)
{
    StructLayout* layout;
    // ... initialization code ...

    layout->rbFieldMap = rb_hash_new();  // ← NO WRITE BARRIER
    layout->rbFields = rb_ary_new();
    layout->rbFieldNames = rb_ary_new();

    // Without RB_OBJ_WRITE, the GC doesn't know about these references!
    // ...
}

When FFI creates a struct layout, it allocates three Ruby objects:

  • a Hash for field lookups,
  • an Array of fields,
  • and an Array of field names.

It stores pointers to these objects in a C struct.

But it didn't use RB_OBJ_WRITE to register these references with Ruby garbage collector in FFI 1.16.3.

From the GC's perspective, the following is happening:

  1. A Hash is allocated at memory address 0x000078358a3dfd28.
  2. No Ruby code stores a reference to this memory address (as far as the GC can see).
  3. The struct class goes out of scope.
  4. The GC thinks: "Nobody needs this Hash anymore".
  5. The GC frees the memory.
  6. Ruby allocates a String at the address 0x000078358a3dfd28.
  7. FFI's C code still has the pointer, thinking it points to a Hash.
  8. Boom: undefined method 'default' for String.

The fix in FFI 1.17.0 added proper write barriers:

static VALUE
struct_layout_initialize(VALUE self, VALUE fields, VALUE size, VALUE align)
{
    StructLayout* layout;
    // ... initialization code ...

    RB_OBJ_WRITE(self, &layout->rbFieldMap, rb_hash_new());  // ← FIXED!
    RB_OBJ_WRITE(self, &layout->rbFields, rb_ary_new());
    RB_OBJ_WRITE(self, &layout->rbFieldNames, rb_ary_new());

    // Now the GC knows: "self owns these objects, don't free them"
    // ...
}

This single macro call, RB_OBJ_WRITE, tells Ruby's garbage collector: "This C struct holds a reference to this Ruby object. Don't free it while the struct is alive."

Without it, you have a use-after-free vulnerability where C thinks that it has a valid pointer, but Ruby has freed the memory and reused it for something else entirely.

Reproducing the Bug

Understanding the bug wasn't enough. I needed to reproduce it. Not the #1079 segfaults - the specific case where a Hash becomes something else.

The requirements were precise:

  • Thousands of transient struct class definitions that go out of scope.
  • Natural memory pressure to trigger GC (not GC.stress which causes segfaults).
  • Time between GC and field access for Ruby to allocate new objects.
  • Multi-threaded execution to increase memory churn.
  • Constant struct creation to maximize the replacement window.

Here's what I built:

#!/usr/bin/env ruby

require 'ffi'

# Unbuffer stdout so we see output immediately
$stdout.sync = true
$stderr.sync = true

2.times do
  Thread.new do
    loop do
      # Create an array to hold references temporarily
      # This creates more allocation pressure
      arr = []

      # Allocate many strings rapidly
      5000.times do
        arr << rand.to_s * 100
        arr << Time.now.to_s
        arr << "test string #{rand(10000)}"
      end
    end
  end
end

sleep(0.1)

# Keep all struct instances here so we can access them later
garbage_strings = []

ars = Array.new(5) do |round|
  Thread.new do
    round_instances = []

    10000.times do |i|
      # Create a new struct class - this creates an rbFieldMap
      klass = Class.new(FFI::Struct) do
        layout :partition, :int32,
               :offset, :int64,
               :metadata, :pointer,
               :err, :int32,
               :value, :int64
      end

      # Create instance from this class
      ptr = FFI::MemoryPointer.new(klass.size)
      instance = klass.new(ptr)
      instance[:partition] = round * 100 + i
      instance[:offset] = (round * 100 + i) * 1000
      instance[:err] = 0

      round_instances << instance
    end

    round_instances.each_with_index do |instance, i|
      begin
        partition = instance[:partition]
        offset = instance[:offset]
        err = instance[:err]
      rescue NoMethodError => e
        puts "\n" + "=" * 60
        puts "🐛 BUG REPRODUCED! 🐛"
        puts "=" * 60
        puts "Error: #{e.message}"
        puts "\nBacktrace:"
        puts e.backtrace[0..10]
        exit 1
      end
    end

    # Clear old strings to increase memory churn
    if garbage_strings.size > 50_000
      garbage_strings.shift(25_000)
    end
  end
end

ars.each(&:join)

Key differences from typical FFI tests:

  • No GC.stress (it would cause segfaults, not object replacement)
  • Multiple threads creating memory pressure simultaneously
  • Natural GC timing from memory allocation patterns
  • Time gap between struct creation and field access
  • Transient classes that go out of scope immediately

I wrapped it in a Docker container with memory constraints:

FROM ruby:3.4.5-alpine

RUN apk add --no-cache build-base

RUN gem install ffi -v 1.16.3

WORKDIR /app
COPY poc.rb .

CMD ["ruby", "poc.rb"]

Then I created a bash script to run it in a loop, filtering for the specific error:

#!/bin/bash

run_count=0
log_dir="./logs"
mkdir -p "$log_dir"

echo "Building Docker image..."
docker build -t ffi-bug-poc .

echo "Running POC in a loop until bug is reproduced..."
echo "Looking for exit code 1 with 'undefined' in output"
echo

while true; do
  run_count=$((run_count + 1))
  timestamp=$(date +%Y%m%d_%H%M%S)
  log_file="${log_dir}/run_${run_count}_${timestamp}.log"

  echo -n "Run #${run_count} at $(date +%H:%M:%S)... "

  # Run with memory constraints to increase GC pressure
  docker run --rm \
    --memory=512m \
    --memory-swap=0m \
    ffi-bug-poc > "$log_file" 2>&1

  exit_code=$?

  # Filter: only care about exit code 1 with "undefined" in output
  # Ignore segfaults (exit 139) - those are from #1079
  if [ $exit_code -eq 1 ] && grep -qi "undefined" "$log_file"; then
    echo ""
    echo "🐛 BUG REPRODUCED on run #${run_count}! 🐛"
    cat "$log_file"
    exit 0
  elif [ $exit_code -eq 0 ]; then
    echo "completed successfully (no bug)"
    rm "$log_file"
  else
    echo "exit code $exit_code (segfault) - continuing..."
  fi

  sleep 0.1
done

I hit Enter and watched the terminal:

Building Docker image...
Running POC in a loop until bug is reproduced...
Looking for exit code 1 with 'undefined' in output

Run #1 at 14:32:15... completed successfully (no bug)
Run #2 at 14:32:18... completed successfully (no bug)
Run #3 at 14:32:21... exit code 139 (segfault) - continuing...
Run #4 at 14:32:24... completed successfully (no bug)

Lots of segfaults - those were the #1079 issue. I was hunting for the specific undefined method error.

After realizing I needed even more memory churn, I opened multiple terminals and ran the loop script several times in parallel. Within minutes:

Run #23 at 15:18:42... exit code 139 (segfault) - continuing...
Run #24 at 15:18:45... completed successfully (no bug)
Run #25 at 15:18:48... 

============================================================
🐛 BUG REPRODUCED! 🐛
============================================================
Error: undefined method 'default' for an instance of String

Backtrace:
  poc.rb:82:in `[]'
  poc.rb:82:in `block (2 levels) in <main>'
  poc.rb:80:in `each'
  poc.rb:80:in `each_with_index'
  poc.rb:80:in `block in <main>'
  <internal:numeric>:237:in `times'
  poc.rb:50:in `<main>'
============================================================

There!

Not a segfault. Not the T_NONE error from #1079. There it is, the exact error from production: undefined method 'default' for an instance of String

An FFI internal Hash had been freed by the GC and replaced by a String object in the same memory location!

The Microsecond Window

Here's what happens in those microseconds when the bug triggers:

The Hash didn't get corrupted. It ceased to exist. A String was born in its place, wearing the Hash's memory address like a stolen identity.

What This Means for Ruby's Memory Model

This bug reveals something fundamental about how Ruby manages memory at the lowest level.

Objects don't have permanent identities. They're data structures at the memory addresses. When the garbage collector frees memory, Ruby will reuse it. If you're holding a C pointer to that address without proper write barriers, you're now pointing at whatever Ruby decided to create there next.

This is why write barriers exist. They're not optional extras for C extension authors. They're how you tell the garbage collector: "I'm holding a reference. Don't free this." Without them, you have use-after-free bugs that can manifest as objects changing identity at runtime.

The Fix and The Future

If you're using FFI < 1.17.0, the fix is straightforward:

# Gemfile
gem 'ffi', '~> 1.17.0'

That's it. Upgrade and the bug goes from million-to-one to zero.

The fix made by KJ adds proper write barriers throughout FFI's C codebase. The garbage collector now knows not to free rbFieldMap while it's still needed. Your Hashes stay Hashes. Your Strings stay Strings. Reality remains consistent.

Lessons From the Hunt

After spending days debugging what seemed impossible, a few things stood out. Sometimes the obvious answer is wrong. I burned hours convinced this was a musl issue - every diagnostic came back green, but the bug had nothing to do with data layout. It was about object identity.

The timing of garbage collection matters as much as whether it happens. GC.stress triggers immediate segfaults, while natural GC timing reveals delayed object transformations. My diagnostic scripts verified struct layouts perfectly but couldn't detect that FFI's internal Hash could be freed and replaced at runtime. They checked structure, not behavior.

Initially, I blamed myself. I guess that's what maintainership feels like sometimes - you own the stack, even when the bug is deeper than your code. The fix was already in FFI 1.17.0 when this happened. The user just hadn't upgraded yet.

Acknowledgments

The root cause - missing write barriers in FFI < 1.17.0 - was fixed in FFI issue #1079 by KJ, who has been my invaluable rubber duck throughout this debugging journey.

The Bottom Line

If you're running FFI < 1.17.0 in production - especially in high-restart environments like Kubernetes, ECS, or serverless platforms - upgrade today. The bug may be one in a million restarts, but at scale, million-to-one odds aren't odds at all.

They're inevitabilities waiting to happen.

If you want to understand more about write barriers and Ruby's GC internals, start here: Garbage Collection in Ruby.

The 60-Second Wait: How I Spent Months Solving the Ruby’s Most Annoying Gem Installation Problem

Notice: While native extensions for rdkafka have been extensively tested and are no longer experimental, they may not work in all environments or configurations. If you find any issues with the precompiled extensions, please report them immediately and they will be resolved.

Every Ruby developer knows this excruciating feeling: you're setting up a project, running bundle install, and then... you wait. And wait. And wait some more as rdkafka compiles for what feels like eternity. Sixty to ninety seconds of pure frustration, staring at a seemingly frozen terminal that gives no indication of progress while your coffee is getting cold.

I've been there countless times. As the maintainer of the Karafka ecosystem, I've watched developers struggle with this for years. The rdkafka gem - essential for Apache Kafka integration in Ruby - was notorious for its painfully slow installation. Docker builds took forever. CI pipelines crawled. New developers gave up before they even started. Not to mention the countless compilation crashes that were nearly impossible to debug.

Something had to change.

The Moment of Truth

It was during a particularly frustrating debugging session that I realized the real scope of the problem. I was helping a developer who couldn't get rdkafka to install on his macOS dev machine. Build tools missing. Compilation failing. Dependencies conflicting. The usual nightmare.

As I walked him through the solution for the hundredth time, I did some quick math. The rdkafka gem gets downloaded over a million times per month. Each installation takes about 60 seconds to compile. That's 60 million seconds of CPU time every month - nearly two years of continuous processing power wasted on compilation alone.

But the real kicker? All those installations were essentially building the same thing over and over again. The same librdkafka library. The same OpenSSL. The same compression libraries. Millions of identical compilations happening across the world, burning through CPU cycles and developer patience.

That's when I decided to solve this once and for all.

Why Nobody Had Done This Before

You might wonder: if this was such an obvious problem, why hadn't anyone solved it already? The answer lies in what I call "compatibility hell."

Unlike many other Ruby gems that might need basic compilation, rdkafka is a complex beast. It wraps librdkafka, a sophisticated C library that depends on a web of other libraries:

  • OpenSSL for encryption
  • Cyrus SASL for authentication
  • MIT Kerberos for enterprise security
  • Multiple compression libraries (zlib, zstd, lz4, snappy)
  • System libraries that vary wildly across platforms

Every Linux distribution has slightly different versions of these libraries. Ubuntu uses one version of OpenSSL, CentOS uses another. Alpine Linux uses musl instead of glibc. macOS has its own quirks. Creating a single binary that works everywhere seemed impossible.

My previous attempts had failed, because they tried to link against system libraries dynamically. This works great... until you deploy to a system with different library versions. Then everything breaks spectacularly.

The Deep Dive Begins

I started by studying how other Ruby gems had tackled similar problems. The nokogiri gem had become the gold standard for this approach - they'd successfully shipped precompiled binaries that eliminated the notorious compilation headaches that had plagued XML processing in Ruby for years. Their success proved that it was possible.

Other ecosystems had figured this out years ago. Python has wheels. Go has static binaries. Rust has excellent cross-compilation. While Ruby has improved with precompiled gems for many platforms, the ecosystem still feels inconsistent - you never know if you'll get a precompiled gem or need to compile from source. The solution, I realized, was static linking. Instead of depending on system libraries, I would bundle everything into self-contained binaries.

Every dependency would be compiled from source and linked statically into the final library.

Sounds simple, right? It wasn't.

The First Breakthrough

My first success came with the Linux x86_64 GNU systems - your typical Ubuntu or CentOS server. After days of tweaking the compiler flags and build scripts, I had a working prototype. The binary was larger than the dynamically linked version, but it worked anywhere.

The installation time dropped from 60+ seconds to under 5 seconds!

But then I tried it on Alpine Linux. Complete failure. Alpine uses musl libc instead of glibc, and my carefully crafted build didn't work at all.

Platform-Specific Nightmares

Each platform brought its own specific challenges:

  • Alpine Linux (musl): Different system calls, different library conventions, different compiler behavior. I had to rebuild the entire toolchain with the musl-specific flags. The Cyrus SASL library was particularly troublesome - it kept trying to use glibc-specific functions that didn't exist in musl.

  • macOS ARM64: Apple Silicon Macs use a completely different architecture. The build system had to use different SDK paths, and handle Apple's unique library linking requirements. Plus, macOS has its own ideas about where libraries should live.

  • Security concerns: Precompiled binaries are inherently less trustworthy than source code. How do you prove that a binary contains exactly what it claims to contain? I implemented the SHA256 verification for every dependency, but that was just the beginning.

The Compilation Ballet

Building a single native extension became an intricate dance of dependencies. Each library had to be compiled in the correct order, with the right flags, targeting the right architecture. One mistake and the entire build would fail.

I developed a common build system that could handle all platforms:

# Download and verify every dependency
secure_download "$(get_openssl_url)" "$OPENSSL_TARBALL"
verify_checksum "$OPENSSL_TARBALL"

# Build in the correct order
build_openssl_for_platform "$PLATFORM"
build_kerberos_for_platform "$PLATFORM"
build_sasl_for_platform "$PLATFORM"
# ... and so on

The build scripts grew to over 2,000 lines of carefully crafted shell code. Every platform had its own nuances, its own gotchas, its own way of making my life difficult.

The Security Rabbit Hole

Precompiled binaries introduce a fundamental security challenge: trust. When you compile from source, you can theoretically inspect every line of code. With precompiled binaries, you're trusting that the binary contains exactly what it claims to contain.

I spent weeks, implementing a comprehensive security model consisting of:

  • SHA256 verification for every downloaded dependency
  • Cryptographic attestation through RubyGems Trusted Publishing
  • Reproducible builds with the pinned versions
  • Supply chain protection against malicious dependencies

The build logs became a security audit trail:

[SECURITY] Verifying checksum for openssl-3.0.16.tar.gz...
[SECURITY] ✅ Checksum verified for openssl-3.0.16.tar.gz
[SECURITY] 🔒 SECURITY VERIFICATION COMPLETE

The CI/CD Nightmare

Testing native extensions across multiple platforms and Ruby versions created a combinatorial explosion of complexity. My GitHub Actions configuration grew from a simple test matrix to a multi-stage pipeline with separate build and test phases.

Each platform needed its own runners:

  • Linux builds ran on Ubuntu with Docker containers
  • macOS builds ran on actual macOS runners
  • Each build had to be tested across Ruby 3.1, 3.2, 3.3, 3.4, and 3.5

The CI pipeline became a carefully choreographed dance of builds, tests, and releases. One failure anywhere would cascade through the entire system.

Each platform requires 10 separate CI actions - from compilation to testing across Ruby versions. Multiplied across all supported platforms, this creates a complex 30-action pipeline.

The Release Process Revolution

Publishing native extensions isn't just gem push. Each release now involves building on multiple platforms simultaneously, testing each binary across Ruby versions, and coordinating the release of multiple platform-specific gems.

I implemented RubyGems Trusted Publishing, which uses cryptographic tokens instead of API keys. This meant rebuilding the entire release process from scratch, but it provided better security and audit trails.

The First Success

After months of work, I finally had working native extensions for all three major platforms. The moment of truth came, when I installed the gem for the first time using the precompiled binary:

$ gem install rdkafka
Successfully installed rdkafka-0.22.0-x86_64-linux-gnu
1 gem installed

I sat there staring at my terminal, hardly believing it had worked. Months of frustration, debugging, and near misses had led to this moment. Three seconds instead of sixty!

The Ripple Effect

The impact is beyond just faster installations. The Docker builds that previously had taken several minutes, now completed much faster. The CI pipelines that developers had learned to ignore suddenly became responsive. New contributors could set up the development environments without fighting the compiler errors.

But the numbers that really struck me were the environmental ones. With over a million downloads per month, those 60 seconds of compilation time added up to 60 million seconds of CPU usage monthly. That's 16,667 hours of processing power - and all the associated energy consumption and CO2 emissions.

The Unexpected Challenges

Just when I thought I was done, new challenges emerged. Some users needed to stick with source compilation for custom configurations. Others wanted to verify that the precompiled binaries were truly equivalent to the source builds.

I added fallback mechanisms and comprehensive documentation. You can still force source compilation if needed:

gem 'rdkafka', force_ruby_platform: true

But for most users, the native extensions work transparently. Install the gem, and you automatically get the fastest experience possible.

Looking Back

This project has taught me, that sometimes the most valuable improvements are the ones that the users never notice. Nobody celebrates faster gem installation. There are no awards for reducing the compilation time. But those small improvements compound into something much larger.

Every developer who doesn't wait for rdkafka to compile can focus on building something amazing instead. Every CI pipeline that completes faster means more iterations, more experiments, more innovation.

The 60-second problem is solved. It took months of engineering effort, thousands of lines of code, and more debugging sessions than I care to count. But now, when you run gem install rdkafka or bundle install, it just works.

Fast.


The rdkafka and karafka-rdkafka gems with native extensions are available now. Your next bundle install will be faster than ever. For complete documentation, visit the Karafka documentation.

Copyright © 2025 Closer to Code

Theme by Anders NorenUp ↑