Performance Archives - Closer to Code

Update: This article originally concluded that Eisel-Lemire wasn't worth it for Ruby. I was wrong. After revisiting the problem, I found a way to make it work - and submitted a PR to Ruby. Read the full update at the end.

Recently, I submitted a PR to Ruby that optimizes Float#to_s using the Ryu algorithm, achieving 2-4x performance improvements for float-to-string conversion. While that work deserves its own article, this article is about what happened when I tried to optimize the other direction: string-to-float parsing.

String-to-float seemed like an equally promising target. It's a fundamental operation used everywhere - parsing JSON, reading configuration files, processing CSV data, and handling user input. Since the Ryu optimization worked so well for float-to-string, surely the reverse direction would yield similar gains?

I did my research. I found a state-of-the-art algorithm backed by an academic paper. I implemented it. All tests passed. It worked exactly as promised.

And then I threw it all away.

Finding the "Perfect" Algorithm

The Eisel-Lemire algorithm, published by Daniel Lemire in 2021 in his paper "Number Parsing at a Gigabyte per Second", looked like exactly what I needed. It's a modern approach to converting decimal strings to floating-point numbers, using 128-bit multiplication with precomputed powers of 5.

Rust uses it. Go adopted it in 1.16. The fast_float C++ library is built around it.

When two performance-conscious language communities both adopt the same algorithm, you pay attention.

The Implementation

I wrote about 1,100 lines of C: 128-bit multiplication helpers, a ~10KB lookup table for powers of 5, the core algorithm, and a wrapper matching Ruby's existing strtod interface. For edge cases (hex floats, numbers with more than 19 significant digits, ambiguous rounding), it falls back to the original implementation. In practice, maybe 0.01% of inputs hit the fallback.

All 59 Float tests passed. Round-trip verification worked.

So how much faster was it?

The First Benchmark

Here's where I almost made a mistake.

I ran a benchmark with 3 million iterations across various float formats:

Test Case	Unmodified	Eisel-Lemire	Speedup
Decimal (0.123456789)	0.185s	0.186s	1.00x
Scientific notation	0.162s	0.182s	0.89x
Math constants (Pi, E)	0.538s	0.205s	2.62x
Currency values	0.155s	0.167s	0.93x
Coordinates	0.172s	0.171s	1.01x
Very small (1e-15)	0.220s	0.171s	1.29x
Very large (1e15)	0.218s	0.169s	1.29x
TOTAL	2.316s	1.948s	1.19x

19% faster overall. The math constants case was 2.62x faster. I was ready to open a PR.

But something about the benchmark bothered me. I'd designed it to cover "various float formats" - which sounds reasonable until you realize I was testing what I expected to matter, not what actually matters.

The Second Benchmark

What numbers do Ruby applications actually parse?

Ruby runs web apps, reads config files, processes business data. It's not crunching scientific datasets. The floats it sees are prices, percentages, coordinates, timeouts. Mostly simple stuff.

So I benchmarked that:

Test Case	Unmodified	Eisel-Lemire	Change
Single digit (1-9)	0.236s	0.255s	-8%
Two digits (10-99)	0.240s	0.289s	-17%
Simple decimal (1.5, 2.0)	0.244s	0.281s	-13%
Price-like (9.99, 19.95)	0.258s	0.272s	-5%
Short decimal (0.5, 0.25)	0.255s	0.277s	-8%
Simple scientific (1e5)	0.250s	0.268s	-7%
Common short (3.14, 2.71)	0.253s	0.264s	-4%
TOTAL	2.482s	2.710s	-9%

9% slower on simple numbers. The numbers Ruby actually parses.

What Went Wrong

Eisel-Lemire has fixed overhead: parse the string, look up powers of 5, do 128-bit multiplication, construct the IEEE 754 double. That overhead pays off when the alternative is expensive.

But Ruby's existing strtod - based on David Gay's code from 1991 - has been tuned for 30+ years. It has fast paths for simple inputs like "1.5" or "99.99". For those cases, the old code is already fast. Eisel-Lemire's setup cost ends up being more expensive than the work it replaces.

The algorithm works exactly as advertised. It just solves a different problem than the one Ruby has, in my opinion.

Trying to Have It Both Ways

What if I used strtod for simple numbers and Eisel-Lemire only for complex ones?

Approach	Total Time	vs Baseline
Unmodified strtod	2.316s	baseline
Pure Eisel-Lemire	1.948s	+19%
Hybrid (digit threshold=8)	2.164s	+7%
Hybrid (digit threshold=10)	2.194s	+6%
Hybrid (length-based)	2.060s	+11%

Any dispatch check adds overhead. Counting digits or checking string length isn't free. The check itself eats into the gains.

Update: It Worked After All

After publishing this article, I decided to revisit the problem. The insight came from re-reading Nigel Tao's blog post, which mentions that the algorithm includes a "simple case" optimization for small mantissas that can be multiplied exactly by powers of 10.

The key realization: don't fight strtod on its home turf. Instead of replacing strtod entirely, I added fast paths that intercept simple numbers before they ever reach either algorithm:

Ultra-fast path for small integers - handles "5", "42", "-123" (up to 3 digits) with direct digit parsing
Ultra-fast path for simple decimals - handles "1.5", "9.99", "199.95" (up to 3+3 digits) using precomputed divisors
Eisel-Lemire - handles complex numbers with many significant digits
Fallback to strtod - for edge cases (hex floats, >19 digits, ambiguous rounding)

The fast paths are trivial - just a few comparisons and arithmetic operations. No 128-bit multiplication, no table lookups. For simple inputs, they're faster than both strtod and Eisel-Lemire.

New Benchmark Results

After implementing the fast paths, I ran the same benchmarks against Ruby master (3 million iterations):

Input Type	Master	Optimized	Improvement
Simple decimals (`"1.5"`, `"3.14"`)	0.154s	0.125s	19% faster
Prices (`"9.99"`, `"19.95"`)	0.155s	0.125s	19% faster
Small integers (`"5"`, `"42"`)	0.149s	0.116s	22% faster
Math constants (`"3.141592653589793"`)	0.674s	0.197s	3.4x faster
High precision (`"0.123456789012345"`)	0.554s	0.199s	2.8x faster
Scientific (`"1e5"`, `"2e10"`)	0.154s	0.153s	~same

The numbers that were 9% slower are now 19-22% faster. The complex numbers that were 2.6x faster are now 2.8-3.4x faster. No regressions anywhere.

The PR

Based on this work, I submitted PR #15655 to Ruby. The implementation adds about 320 lines to object.c plus a 10KB lookup table for powers of 5.

Summary

My first benchmark was designed to make me feel good about my work. It covered "various formats" which happened to include cases where Eisel-Lemire shines. Only when I forced myself to benchmark what Ruby actually does did reality show up.

My original conclusion wasn't wrong - pure Eisel-Lemire is slower for simple numbers. The mistake was treating it as an all-or-nothing choice. Theoretical performance gains are hypotheses. Benchmarks against real workloads are proof. And sometimes the best optimization isn't replacing an algorithm - it's knowing when not to run it.

References

Ruby developers have faced an uncomfortable truth for years: when you need to talk to external systems like Kafka, you're going to block. Sure, you could reach for heavyweight solutions like EventMachine, Celluloid, or spawn additional threads, but each comes with its own complexity tax.

EventMachine forces you into callback hell. Threading introduces race conditions and memory overhead. Meanwhile, other ecosystems had elegant solutions: Go's goroutines, Node.js's event loops, and Python's asyncio.

Ruby felt clunky for high-performance I/O-bound applications.

Enter the Async Gem

Samuel Williams' async gem brought something revolutionary to Ruby: lightweight concurrency that actually feels like Ruby. No callbacks. No complex threading primitives. Just fibers.

require 'async'

Async do |task|
  # These run concurrently
  task1 = task.async { fetch_user_data }
  task2 = task.async { fetch_order_data }
  task3 = task.async { fetch_metrics_data }

  [task1, task2, task3].each(&:wait)
end

The genius is in the underlying architecture. When an I/O operation would normally block, the fiber automatically yields control to other fibers – no manual coordination is required.

Why Lightweight Concurrency Matters

Traditional threading and evented architectures are heavy. Threads consume a significant amount of memory (1MB stack per thread) and come with complex synchronization requirements. Event loops force you to restructure your entire programming model.

Fibers are lightweight:

Memory efficient: Kilobytes instead of megabytes
No synchronization complexity: Cooperative scheduling
Familiar programming model: Looks like regular Ruby code
Automatic yielding: Runtime handles I/O coordination

WaterDrop: Built for Async

Starting with the 2.8.7 release, every #produce_sync and #produce_many_sync operation in WaterDrop automatically yields during Kafka I/O. You don't configure it. It just works:

require 'async'
require 'waterdrop'

producer = WaterDrop::Producer.new do |config|
  config.kafka = { 'bootstrap.servers': 'localhost:9092' }
end

Async do |task|
  # These run truly concurrently
  user_events = task.async do
    100.times do |i|
      producer.produce_sync(
        topic: 'user_events',
        payload: { user_id: i, action: 'login' }.to_json
      )
    end
  end

  # This also runs concurrently during Kafka I/O
  metrics_task = task.async do
    collect_application_metrics
  end

  [user_events, metrics_task].each(&:wait)
end

Real Performance Impact

Performance Note: These benchmarks show single-message synchronous production (produce_sync) for clarity. WaterDrop also supports batch production (produce_many_sync), async dispatching (produce_async), and promise-based workflows. When combined with fibers, these methods can achieve much higher throughput than shown here.

I benchmarked a Rails application processing 10,000 Kafka messages across various concurrency patterns:

Sequential processing (baseline):

Total time: 62.7 seconds
Throughput: 160 messages/second
Memory overhead: Baseline

Single fiber (no concurrency):

Total time: 63.2 seconds
Throughput: 158 messages/second
Improvement: 0.99x - No benefit without actual concurrency

Real-world scenario (3 concurrent event streams):

Total time: 23.8 seconds
Throughput: 420 messages/second
Improvement: 2.6x - What most applications will see in production

Optimized fiber concurrency (controlled batching):

Total time: 12.6 seconds
Throughput: 796 messages/second
Improvement: 5.0x - Peak performance with proper structure

Multiple producers (traditional parallelism):

Total time: 15.2 seconds
Throughput: 659 messages/second
Improvement: 4.1x - Good, but uses more memory than fibers

A single producer using fibers outperforms multiple producer instances (5.0x vs 4.1x) while using less memory and resources. This isn't about making individual operations faster - it's about enabling Ruby to handle concurrent I/O elegantly and efficiently.

Transparent Integration

What makes WaterDrop's async integration cool is that it's completely transparent:

# This code works with or without async
producer.produce_sync(
  topic: 'events',
  payload: data.to_json
)

Running in a fiber scheduler? It yields during I/O. Running traditionally? It blocks normally. No configuration. No special methods.

The Transactional Reality

Transactions have limitations. Multiple transactions from one producer remain sequential due to the transactional.id design:

# These transactions will block each other
Async do |task|
  task.async { producer.transaction { ... } }
  task.async { producer.transaction { ... } } # Waits for first
end

But: transactions still yield during I/O, allowing other fibers doing different work to continue. For concurrent transactions, use separate producers.

Real-World Example

class EventProcessor
  def process_user_activity(sessions)
    Async do |task|
      # Process different types concurrently
      login_task = task.async { process_logins(sessions) }
      activity_task = task.async { process_activity(sessions) }

      # Analytics runs during Kafka I/O
      analytics_task = task.async { update_analytics(sessions) }

      [login_task, activity_task, analytics_task].each(&:wait)
    end
  end

  private

  def process_logins(sessions)
    sessions.each do |session|
      producer.produce_sync(
        topic: 'user_logins',
        payload: session.to_json
      )
    end
  end
end

Why This Matters

WaterDrop's async integration proves Ruby can compete in high-performance I/O scenarios without sacrificing elegance. Combined with Samuel's broader ecosystem (async-http, async-postgres, falcon), you get a complete stack for building high-performance Ruby applications.

Try wrapping any I/O-heavy operations in Async do |task| blocks. Whether it's API calls, database queries, or Kafka operations with WaterDrop, the performance improvement may be immediate and dramatic.

Find WaterDrop on GitHub and explore the async ecosystem that's making Ruby fast again.

Tag: Performance

Ruby Floats: When 2.6x Faster Is Actually Slower (and Then Faster Again)