Opening Note
We all make mistakes, and fundamentally, the havoc caused by this incident was due to a flaw in the design of rdkafka-ruby
. While the disappearance of librdkafka
from GitHub was unexpected, this article aims to clarify and explain how rdkafka-ruby
should have prevented it and what was poorly designed. By examining this incident, I hope to provide insights into better practices for managing dependencies and ensuring more resilient software builds for the Ruby ecosystem.
Incident Summary
On July 10, 2024 15:47 UTC, users of the rdkafka
gem faced issues when the librdkafka
repository on GitHub unexpectedly went private. This break in the supply chain disrupted installations, causing widespread frustration and, in many cases, completely blocking the ability to deploy rdkafka-based software.
Fetching rdkafka 0.16.0
Installing rdkafka 0.16.0 with native extensions
Gem::Ext::BuildError: ERROR: Failed to build gem native extension.
current directory: /rdkafka-0.16.0/ext
/usr/local/bin/ruby -rrubygems
/rake-13.2.1/exe/rake
RUBYARCHDIR\=/home/circleci/.rubygems/extensions/x86_64-linux/3.3.0/rdkafka-0.16.0
RUBYLIBDIR\=/home/circleci/.rubygems/extensions/x86_64-linux/3.3.0/rdkafka-0.16.0
2 retrie(s) left for v2.4.0 (404 Not Found)
1 retrie(s) left for v2.4.0 (404 Not Found)
0 retrie(s) left for v2.4.0 (404 Not Found)
404 Not Found
rake aborted!
Errno::ENOENT: No such file or directory @ rb_sysopen - ports/archives/v2.4.0
(Errno::ENOENT)
/mini_portile2-2.8.7/lib/mini_portile2/mini_portile.rb:496:in
`verify_file'
/mini_portile2-2.8.7/lib/mini_portile2/mini_portile.rb:133:in
`block in download'
/mini_portile2-2.8.7/lib/mini_portile2/mini_portile.rb:131:in
`each'
/mini_portile2-2.8.7/lib/mini_portile2/mini_portile.rb:131:in
`download'
/mini_portile2-2.8.7/lib/mini_portile2/mini_portile.rb:232:in
`cook'
/rdkafka-0.16.0/ext/Rakefile:38:in `block
in <top (required)>'
/rake-13.2.1/exe/rake:27:in `<main>'
Tasks: TOP => default
(See full trace by running task with --trace)
Detailed Explanation
The rdkafka gem used to rely on downloading librdkafka
from the Confluent GitHub repository during the installation process. As a huge proponent of immutable builds that do not depend on external resources, I planned to change this model for a long time. Several months ago, I created a GitHub issue to address this transition. However, the change was delayed due to other priorities within the karafka ecosystem. Unfortunately, this delay resulted in the recent outage.
# Just the relevant code here
recipe.files << {
:url => "https://codeload.github.com/edenhill/librdkafka/tar.gz/v#{Rdkafka::LIBRDKAFKA_VERSION}",
:sha256 => Rdkafka::LIBRDKAFKA_SOURCE_SHA256
}
recipe.configure_options = ["--host=#{recipe.host}"]
recipe.cook
This setup meant that during the bundle install
process, the required librdkafka
source was fetched and compiled on the fly, which inherently relied on the availability of the external GitHub repository.
Upon discovery, it took me 59 minutes to release the first patched version and approximately four hours to prepare fixes and backport them to all relevant versions of the rdkafka
gem, including older ones. Luckily, I was in front of my computer when the incident occurred, allowing me to quickly create and release needed fixes.
Future Steps
Going forward, all future releases will depend only on RubyGems, ensuring no reliance on external build sources like GitHub. I decided to ship the librdkafka
releases inside the gem itself, enhancing its reliability and stability of the ecosystem.
releases = File.expand_path(File.join(File.dirname(__FILE__), '../dist'))
recipe.files << {
:url => "file://#{releases}/librdkafka_#{Rdkafka::LIBRDKAFKA_VERSION}.tar.gz",
:sha256 => Rdkafka::LIBRDKAFKA_SOURCE_SHA256
}
recipe.configure_options = ["--host=#{recipe.host}"]
recipe.cook
Fragility of the OSS Supply Chain
This incident highlights our dependence on other OSS projects and repositories. It's essential to remember that mistakes can happen, and we must be prepared. This wasn't the first issue with GitHub downloads. In 2023, a change in GitHub's tar layout broke a lot of software, including ours, that relied on checksums for artifacts verification. To be honest, if we had migrated the building process of rdkafka
at that time, this article would not have to be written.
Here are my main takeaways from this incident:
- Design Flaws Can Amplify Issues: The incident highlighted how design flaws in dependency management can lead to significant disruptions.
- Dependency on External Repositories: Relying on external data sources during the build process can pose risks, mainly when unexpected changes occur.
- Importance of Immutable Builds: Adopting immutable builds without external resources can enhance reliability and stability.