parallel | Closer to Code

Table of Contents

1 Ephemeral bug from a test-suite
2 Process shutdown coordination
3 Pinpointing the issue
4 Reproduction
5 Fixing the issue
6 Fixing the world
7 Summary
8 Afterwords

For the past several weeks, I've been trying to fix a cranky spec in Karafka integrations suite, which in the end, lead me to become a Ruby on Rails micro-contributor and submitting similar fix to several other high-popularity projects from the Ruby ecosystem. Here's my story of trying to make sense of my specs and Ruby concurrency.

Ephemeral bug from a test-suite

Karafka is a Ruby and Rails multi-threaded efficient Kafka processing framework. To provide reliable OSS that is multi-threaded, I had to have the option to run my test suite concurrently to simulate how Karafka operates. Since it was a specific use case, I created my micro-framework.

Long story short: It runs end-to-end integration specs by running them in separate Ruby processes. Each starts Karafka, runs all the code in various configurations, connects to Kafka, checks assertions, and at the end, shuts down.

Such an approach allowed me to ensure that the process's whole lifecycle and its components work as expected. Specs are started with supervision, so in case of any hang, it will be killed after 5 minutes.

Karafka itself also has an internal shutdown supervisor. In case of a user shutdown request, if the shutdown takes longer than the defined expected time, Karafka will stop despite having things running. And this is what was happening with this single spec:

E, [2022-11-19T16:47:49.602718 #14843] ERROR -- : Forceful Karafka server stop
F, [2022-11-19T16:47:49.602825 #14843] FATAL -- : #<Karafka::Core::Monitoring::Event:0x0000562932d752b0 @id="error.occurred", @payload={:caller=>Karafka::Server, :error=>#<Karafka::Errors::ForcefulShutdownError: Karafka::Errors::ForcefulShutdownError>, :type=>"app.stopping.error"}>

This damn spec did not want to stop!

Many things are working under the hood:

workers that process jobs that could hang and force the process to wait
jobs queue that is also connected to the polling thread (to poll more data when no work is to be done)
listeners that poll data from Kafka that could hang
consumer groups with several threads polling Kafka data that might get stuck because of some underlying error
Other bugs in the coordination of work and states.

One thing that certainly worked was the process supervision that would forcefully kill it after 30 seconds.

Process shutdown coordination

The graceful shutdown of such a process takes work. When you have many connections to Kafka, upon a poorly organized shutdown, you may trigger several rebalances that may cause short-lived topics assignments causing nothing except friction and potentially blocking the whole process.

To mitigate this, Karafka shuts down actively and gracefully. That is, until the absolute end, it claims the ownership of given topics and partitions, actively waiting for all the current work to be finished. This looks more or less like so:

Note: Consumer groups internally in Karafka are a bit different than Kafka consumer groups. Here we focus on internal Karafka concepts.

Pinpointing the issue

After several failed attempts and fixing other bugs, I added a lot of extra instrumentation to check what Karafka hangs on. It was hanging because there were hanging listener threads!

As stated above, to close Karafka gracefully, all work from the jobs queue needs to be finished, and listeners that poll data from Kafka need to be able to exit the polling loops. It's all coordinated using a job queue. The job queue we're using is pretty complex with some blocking capabilities, and you can read about it here, but the interesting part of the code can be reduced to this:

@semaphores = Concurrent::Map.new { |h, k| h[k] = Queue.new }

Those queues are used as semaphores in the polling loops until all the current work is done. Since each Queue is assigned to a different subscription group within its thread and hidden behind a concurrent map, there should be no problem. Right?

Reproduction

Once I had my crazy suspicion, I decided to reduce it down to a proof of concept:

require 'concurrent-ruby'

100.times do
  ids = Set.new
  semaphores = Concurrent::Hash.new { |h, k| h[k] = Queue.new }

  100.times.map do
    Thread.new do
      ids << semaphores['test'].object_id
    end
  end.each(&:join)

  raise "I expected 1 semaphore but got #{ids.size}" if ids.size != 1
end

once executed, boom:

poc.rb:13:in `<main>': I expected 1 semaphore but got 2 (RuntimeError)

There is more than one semaphore for one listener! This caused Karafka to wait until forced to stop because the worker thread would use a different semaphore than the listener thread.

But how is that even possible?

Well, Concurrent::Hash and Concurrent::Map initialization is indeed thread-safe but not precisely as you would expect them to be. The docs state that:

This version locks against the object itself for every method call, ensuring only one thread can be reading or writing at a time. This includes iteration methods like #each, which takes the lock repeatedly when reading an item.

"only one thread can be reading or writing at a time". However, we are doing both at different times. Our code:

semaphores = Concurrent::Hash.new { |h, k| h[k] = Queue.new }

is actually equivalent to:

semaphores = Concurrent::Hash.new do |h, k|
  queue = Queue.new
  h[k] = queue
end

and the block content is not locked fully. One threads queue can overwrite the other if the Ruby scheduler stops the execution in the middle. Here's the flow of things happening in the form of a diagram:

Once in a while listener would receive a dangling queue object, effectively blocking the polling process.

Fixing the issue

This can be fixed either by replacing the Concurrent::Hash with Concurrent::Map and using the #compute_if_absent method or by introducing a lock inside of the Concurrent::Hash initialization block:

Concurrent::Map.new do |k, v|
  k.compute_if_absent(v) { [] }
end

mutex = Mutex.new

Concurrent::Hash.new do |k, v|
  mutex.synchronize do
    break k[v] if k.key?(v)
    k[v] = []
  end
end

Okay, but what does Ruby on Rails and other projects do with all of this?

Fixing the world

I figured out that if I made this mistake, maybe others did. I decided to check my local gems to find occurrences quickly. Inside my local gem cache, I executed the following code:

fgrep -R 'Concurrent::Hash.new {' ./
fgrep -R 'Concurrent::Hash.new do' ./
fgrep -R 'Concurrent::Map.new {' ./
fgrep -R 'Concurrent::Map.new do' ./

and validated that I'm not an isolated case. I wasn't alone!

Then using Sourcegraph I pinpointed a few projects that had the potential for fixes:

rails (activesupport and actionview)
i18n
dry-schema
finite_machine
graphql-ruby
rom-factory
apache whimsy
krane
puppet

I am not a domain expert in any of those, and understanding the severity of each was beyond my time constraints, but I decided to give it a shot.

Rails (ActiveSupport and ActionView)

Within Rails, this "pattern" was used twice: in ActiveSupport and ActionView.

In ActionView, it was used within a cache:

PREFIXED_PARTIAL_NAMES = Concurrent::Map.new do |h, k|
  h[k] = Concurrent::Map.new
end

and assuming that the cached result is stateless (same result each time for the same key), the issue could only cause an extra computation upon first parallel requests to this cache.

In the case of ActiveSupport, none of the concurrency code was needed, so I just replaced it with a simple Hash.

Both, luckily, were not that severe, though worth fixing nonetheless.

PR: https://github.com/rails/rails/pull/46536
PR: https://github.com/rails/rails/pull/46534

Both were merged, and this is how I became a Ruby on Rails contributor :)

i18n

This case was slightly more interesting because the concurrent cache stores all translations. In theory, this could cause similar leakage as in Karafka, effectively losing a language by loading it to a different Concurrent::Hash:

100.times.map do
  Thread.new do
    I18n.backend.store_translations(rand.to_s, :foo => { :bar => 'bar', :baz => 'baz' })
  end
end.each(&:join)

I18n.available_locales.count #=> 1

This could lead to hard-to-debug problems. Once in a while, your system could raise something like this:

:en is not a valid locale (I18n::InvalidLocale)

without an apparent reason, and this problem would go away after a restart.

PR: https://github.com/ruby-i18n/i18n/pull/644

dry-schema

Another cache case where the risk would revolve around double-computing.

PR: https://github.com/dry-rb/dry-schema/pull/440

rom-factory

This one is interesting! Let's reduce the code to a smaller POC first and see what will happen under heavy threading:

require 'singleton'
require 'concurrent-ruby'

class Sequences
  include Singleton

  attr_reader :registry

  def initialize
    reset
  end

  def next(key)
    registry[key] += 1
  end

  def reset
    @registry = Concurrent::Map.new { |h, k| h[k] = 0 }
    self
  end
end

seq = Sequences.instance

loop do
  100.times.map do
    Thread.new { seq.next('boom') }
  end.each(&:join)

  size = seq.registry['boom']

  raise "Wanted 100 but got #{size}" unless size == 100

  seq.reset
end

poc.rb:37:in `block in <main>': Wanted 100 but got 1 (RuntimeError)

The counter value gets biased. What is even more interesting is that making the map safe won't be enough:

@registry = Concurrent::Map.new { |h, k| h.compute_if_absent(k) { 0 } }

poc.rb:36:in `block in <main>': Wanted 100 but got 55 (RuntimeError)

there is one more "unsafe" method:

def next(key)
  registry[key] += 1
end

this operation also is not atomic, thus needs to be wrapped with a mutex:

def initialize
  @mutex = Mutex.new
  reset
end

def next(key)
  @mutex.synchronize do
    registry[key] += 1
  end
end

Only then is this code safe to be used.

https://github.com/rom-rb/rom-factory/pull/80

Other repositories

finite-machine: https://github.com/piotrmurach/finite_machine/pull/77
graphql-Ruby: https://github.com/rmosolgo/graphql-ruby/pull/4251
Apache Whimsy: https://github.com/apache/whimsy/pull/169
Puppet: https://github.com/puppetlabs/puppet/pull/8951
Shopify Krane: https://github.com/Shopify/krane/pull/911
Issue reported in the concurrent-ruby repo: https://github.com/ruby-concurrency/concurrent-ruby/issues/970

Summary

In my opinion, there are a few outcomes of this story:

Karafka has a solid test-suite!
If you are doing concurrency-related work, you better test it in a multi-threaded environment and test it well.
Concurrency is hard to many of us (maybe that's because we are special ;) ).
RTFM and read it well :)
Do not be afraid to help others by submitting pull requests!

On the other hand, looking at the frequency of this issue, it may be worth opening a discussion about changing this behavior and making the initialization fully locked.

Afterwords

Concurrent::Hash under cRuby is just a Hash. You can check it out here.

Cover photo by James Broad on Attribution-NonCommercial-ShareAlike 2.0 Generic (CC BY-NC-SA 2.0). Image has been cropped.

There are various methods for calling shell commands in Ruby. Most of them either not wait at all for the results, or wait until the command execution is finished.

In many cases it is ok, because programmers want shell command results for further processing. Unfortunately this means that while a shell command runs, there's no way to get partial results and process them (multitasking FTW). It also means that all the results have to be buffered. It might be (for a long running intensive commands) a source of memory leaks in your applications.

Luckily there's a great way to process shell command data in a stream (row after row).

The task

Lets assume that we want to find first 10 files in our file system that match a given pattern (note that it could be achieved way better with just shell commands but it's not the point here).

The bad way

Here's a typical code to achieve that (and believe me - I've encountered solutions like that in production systems):

require 'memory_profiler'
report = MemoryProfiler.report do
  pattern = /test/
  results = `find / 2> /dev/null`.split("\n")
  selection = results.select { |file| file =~ pattern }
  selection[0..10]
end

report.pretty_print

This might seem elegant and it definitely works, but let's check Ruby's process memory usage:

Memory usage in MB (before and after find)

Total allocated: 661999052 bytes (2925598 objects)
Total retained:  40 bytes (1 objects)

allocated memory by gem
-----------------------------------
 661999052  other

allocated memory by class
-----------------------------------
 632408588  String
  26181752  Array
   3408440  MatchData
       232  Hash
        40  Process::Status

We had to use nearly 700MB of memory and it took us 4.7 seconds just to find few matching files. The time wouldn't be that bad, but memory usage like this is a bit overkill. It is bad mostly because find / lists all the files and the more things we have, the bigger output we get. This also means that our code will behave differently dependent on what machine it will run. For some we might not have memory problems but for others it might grow over 1GB.

Now imagine what would happen if we would execute this code in 25 Sidekiq concurrent workers...

Of course with GC running we might not kill our machine, but memory spikes will look kinda weird and suspicious.

The good way - hello IO.popen

Instead of waiting for all the files from find / command, let's process each line separately. To do so, we can use the IO.popen method.

IO.popen runs the specified command as a subprocess; the subprocess’s standard input and output will be connected to the returned IO object. (source)

It means that we can execute find command and feed our main process with every line of the output separately.

Note: IO.popen executed without a block will not wait for the subprocess to finish!

require 'memory_profiler'
report = MemoryProfiler.report do
  pattern = /test/
  selection = []

  IO.popen('find / 2> /dev/null') do |io|
    while (line = io.gets) do
      # Note - here you could use break to get out and sigpipe
      # subprocess to finish it early. It will however mean that your subprocess
      # will stop running early and you need to test if it will stop without
      # causing any trouble
      next if selection.size > 10
      selection << line if line =~ pattern
    end
  end

  selection[0..10]
end

report.pretty_print

Results:

Total allocated: 362723119 bytes (2564923 objects)
Total retained:  394 bytes (3 objects)

allocated memory by gem
-----------------------------------
 362723119  other

allocated memory by class
-----------------------------------
 362713215  String
      8432  IO
      1120  MatchData
       232  Hash
        80  Array
        40  Process::Status

45% less memory required.

The best way (for some cases)

There's also one more way to do the same with the same #popen but in a slightly different style. If you:

Don't need to process all the lines from the executed command
Can terminate subprocess early
Are aware of how to manage subprocesses

you can then stream data into Ruby as long as you need and terminate once you're done. Than way Ruby won't fetch new lines and won't have to GC them later on.

require 'memory_profiler'
report = MemoryProfiler.report do
  pattern = /test/
  selection = []
  run = true

  io = IO.popen('find / 2> /dev/null')

  while (run && line = io.gets) do
    if selection.size > 10
      Process.kill('TERM', io.pid)
      io.close
      run = false
    end

    selection << line if line =~ pattern
  end

  selection
end

report.pretty_print

Since we don't wait for the subprocess to finish, it definitely will be faster but what about memory consumption?

Total allocated: 509989 bytes (5613 objects)
Total retained:  448 bytes (4 objects)

allocated memory by gem
-----------------------------------
    509989  other

allocated memory by class
-----------------------------------
    499965  String
      8432  IO
      1120  MatchData
       232  Hash
       200  Array
        40  Process::Status

99% less than the original solution!

Note: This solution is not always applicable.

Summary

The way you execute shell commands really depends on few factors:

Do you need an output results at all?
Do you need all the lines from the output at the same time?
Can you do other stuff and return once the data is ready?
Can you process partial data?
Can you terminate subprocess early?

When you go with your code out of Ruby scope and when you execute shell commands, it is always good to ask yourself those questions. Sometimes achieving stream processing ability can be done only when the system is being built, so it is really good to think about that before the implementation. In general I would recommend to always consider streaming in every place where we cannot exactly estimate the external command result size. That way you won't be surprised when there will be a lot more data that initially anticipated.

Note: Attentive readers will notice, that I didn't benchmark memory used in the subprocess. That is true, however it was irrelevant to our case as the shell command for all the cases was exactly the same.

Cover photo by: heiwa4126 on Creative Commons license.

Tag: parallel

Ruby concurrency is hard: how I became a Ruby on Rails contributor

Ephemeral bug from a test-suite

Process shutdown coordination

Pinpointing the issue

Reproduction

Fixing the issue

Fixing the world

Rails (ActiveSupport and ActionView)

i18n

dry-schema

rom-factory

Other repositories

Summary

Afterwords

Ruby: Stream processing of shell command results

The task

The bad way

The good way - hello IO.popen

The best way (for some cases)

Summary