Category: Software

Karafka 2.5 and Web UI 0.11: Next-Gen Consumer Control and Operational Excellence

Introduction

Imagine pausing a problematic partition, skipping a corrupted message, and resuming processing - all within 60 seconds, without deployments or restarts. This is now a reality with Karafka 2.5 and Web UI 0.11.

This release fundamentally changes how you operate Kafka applications. It introduces live consumer management, which transforms incident response from a deployment-heavy process into direct, real-time control.

I'm excited to share these capabilities with the Karafka community, whose feedback and support have driven these innovations forward.

Sponsorship and Community Support

The progress of the Karafka ecosystem has been significantly powered by a collaborative effort, a blend of community contributions, and financial sponsorships. I want to extend my deepest gratitude to all our supporters. Your contributions, whether through code or funding, make this project possible. Thank you for your continued trust and commitment to making Karafka a success. It would not be what it is without you!

Notable Enhancements and New Features

Below you can find detailed coverage of the major enhancements, from live consumer management capabilities to advanced performance optimizations that can improve resource utilization by up to 50%. For a complete list of all changes, improvements, and technical details, please visit the changelog page.

Live Consumer Management: A New Operational Model

This update introduces live consumer management capabilities that change how you operate Karafka applications. You can now pause partitions, adjust offsets, manage topics, and control consumer processes in real time through the web interface.

Note: While the pause and resume functionality provides immediate operational control, these controls are session-based and will reset during consumer rebalances caused by group membership changes or restarts. I am developing persistent pause capabilities that survive rebalances, though this feature is still several development cycles away. The current implementation remains highly valuable for incident response, temporary load management, and real-time debugging where quick action is needed within typical rebalance intervals.

Why This Matters

Traditional Kafka operations often require:

  • Deploying code changes to skip problematic messages
  • Restarting entire applications to reset consumer positions
  • Manual command-line interventions during incidents
  • Complex deployment coordination

This release provides an alternative approach. You now have more control over your Karafka infrastructure, which can significantly reduce incident response times.

Consider these operational scenarios:

  • Incident Response: A corrupted message is causing consumer crashes. You can pause the affected partition, adjust the offset to skip the problematic message, and resume processing - typically completed in under a minute.

  • Data Replay: After fixing a bug in your processing logic, you need to reprocess recent messages. You can navigate to a specific timestamp in the Web UI and adjust the consumer position directly.

  • Load Management: Your downstream database is experiencing a high load. You can selectively pause non-critical partitions to reduce load while keeping essential data flowing.

  • Production Debugging: A consumer appears stuck. You can trace the running process to see what threads are doing and identify bottlenecks before taking action.

This level of operational control transforms Karafka from a "deploy and monitor" system into a directly manageable platform where you maintain control during challenging situations.

Partition-Level Control

The Web UI now provides granular partition management with surgical precision over message processing:

  • Real-Time Pause/Resume: This capability addresses the rigidity of traditional consumer group management. You can temporarily halt the processing of specific data streams during maintenance, throttle consumption from high-volume partitions, or coordinate processing across multiple systems.

  • Live Offset Management: The ability to adjust consumer positions in real-time eliminates the traditional cycle of "stop consumer → calculate offsets → update code → deploy → restart" that could take time.

Complete Topic Lifecycle Management

Pro Web UI 0.11 introduces comprehensive topic administration capabilities that transform how you can manage your Kafka infrastructure. The interface now supports complete topic creation with custom partition counts and replication factors, while topic deletion includes impact assessment and confirmation workflows to prevent accidental removal of critical topics.

The live configuration management system lets you view and modify all topic settings, including retention policies, cleanup strategies, and compression settings. Dynamic partition scaling enables you to increase partition counts to scale throughput.

This approach unifies Kafka administration in a single interface, eliminating the need to context-switch between multiple tools and command-line interfaces.

UI Customization and Branding

When managing multiple Karafka environments, visual distinction becomes critical for preventing costly mistakes. Web UI 0.11 introduces customization capabilities that allow you to brand different environments distinctively - think red borders for production, blue gradients for staging, or custom logos for other teams. Beyond safety, these features enable organizations to seamlessly integrate Karafka's Web UI into their existing design systems and operational workflows.

Karafka::Web.setup do |config|
  # Custom CSS for environment-specific styling
  config.ui.custom_css = '.dashboard { background: linear-gradient(45deg, #1e3c72, #2a5298); }'

  # Custom JavaScript for enhanced functionality
  config.ui.custom_js = 'document.addEventListener("DOMContentLoaded", () => { 
    console.log("Production environment detected"); 
  });'
end

The UI automatically adds controller and action-specific CSS classes for targeted styling:

/* Style only the dashboard */
body.controller-dashboard {
  background-color: #f8f9fa;
}

/* Highlight error pages */
body.controller-errors {
  border-top: 5px solid #dc3545;
}

Enhanced OSS Monitoring Capabilities

Web UI 0.11 promotes two monitoring features from Pro to the open source version: consumer lag statistics charts and consumer RSS memory statistics charts. These visual monitoring tools now provide all users with essential insights into consumer performance and resource utilization.

This change reflects my core principle: the more successful Karafka Pro becomes, the more I can give back to the OSS version. Your Pro subscriptions directly fund the research and development that benefits the entire ecosystem.

Open-source software drives innovation, and I'm committed to contributing meaningfully. By making advanced monitoring capabilities freely available, I ensure that teams of all sizes can access the tools needed to build robust Kafka applications. Pro users get cutting-edge features and support, while the broader community gains battle-tested tools for their production environments.

Performance and Reliability Improvements

While Web UI 0.11 delivers management capabilities, Karafka 2.5 focuses equally on performance optimization and advanced processing strategies. This version includes throughput improvements, enhanced error-handling mechanisms, and reliability enhancements.

Balanced Virtual Partitions Distribution

The new balanced distribution strategy for Virtual Partitions is a significant improvement, delivering up to 50% better resource utilization in high-throughput scenarios.

The Challenge:

Until now, Virtual Partitions have used consistent distribution, where messages with the same partitioner result go to the same virtual partition consumer. While predictable, this led to resource underutilization when certain keys contain significantly more messages than others, leaving worker threads idle while others were overloaded.

The Solution:

routes.draw do
  topic :orders_states do
    consumer OrdersStatesConsumer

    virtual_partitions(
      partitioner: ->(message) { message.headers['order_id'] },
      # New balanced distribution for optimal resource utilization
      distribution: :balanced
    )
  end
end

The balanced strategy dynamically distributes workloads by:

  • Grouping messages by partition key
  • Sorting key groups by size (largest first)
  • Assigning each group to the worker with the least current workload
  • Preserving message order within each key group

When to Use Each Strategy:

Use :consistent when:

  • Processing requires stable assignment across batches
  • Implementing window-based aggregations spanning multiple polls
  • Keys have relatively similar message counts
  • Predictable routing is more important than utilization

Use :balanced when:

  • Processing is stateless, or state is managed externally
  • Maximizing worker thread utilization is a priority
  • Message keys have highly variable message counts
  • Optimizing throughput with uneven workloads

The performance gains are most significant with IO-bound processing, highly variable key distributions, and when keys outnumber available worker threads.

Advanced Error Handling: Dynamic DLQ Strategies

Karafka 2.5 Pro introduces context-aware DLQ strategies with multiple target topics:

class DynamicDlqStrategy
  def call(errors_tracker, attempt)
    if errors_tracker.last.is_a?(DatabaseError)
      [:dispatch, 'dlq_database_errors']
    elsif errors_tracker.last.is_a?(ValidationError)
      [:dispatch, 'dlq_validation_errors']
    elsif attempt > 5
      [:dispatch, 'dlq_persistent_failures']
    else
      [:retry]
    end
  end
end

class KarafkaApp < Karafka::App
  routes.draw do
    topic :orders_states do
      consumer OrdersStatesConsumer

      dead_letter_queue(
        topic: :strategy,
        strategy: DynamicDlqStrategy.new
      )
    end
  end
end

This enables:

  • Error-type-specific handling pipelines
  • Escalation strategies based on retry attempts
  • Separation of transient vs permanent failures
  • Specialized recovery workflows

Enhanced Error Tracking

The errors tracker now includes a distributed correlation trace_id that gets added to both DLQ dispatched messages and errors reported by the Web UI, making it easier to track and correlate error occurrences with their DLQ dispatches.

Worker Thread Priority Control

Karafka 2.5 introduces configurable worker thread priority with intelligent defaults:

# Workers now run at priority -1 (50ms) by default for better system responsiveness
# This can be customized based on your system requirements
config.worker_thread_priority = -2

This prevents worker threads from monopolizing system resources, leading to more responsive overall system behavior under high load.

FIPS Compliance and Security

All internal cryptographic operations now use SHA256 instead of MD5, ensuring FIPS compliance with enterprise security requirements.

Coming Soon: Parallel Segments

Karafka 2.5 Pro also introduces Parallel Segments, a new feature for concurrent processing within the same partition when there are more processes than partitions. As we finalize the documentation, this capability will be covered in a dedicated blog post.

Migration Notes

Karafka 2.5 and its related components introduce a few minor breaking changes necessary to advance the ecosystem. No changes would disrupt routing or consumer group configuration. Detailed information and guidance can be found on the Karafka Upgrades 2.5 documentation page.

Migrating to Karafka 2.5 should be manageable. I have made every effort to ensure that breaking changes are justified and well-documented, minimizing potential disruptions.

Conclusion

Karafka 2.5 and Web UI 0.11 mark another step in the framework's evolution, continuing to address real-world operational needs and performance challenges in Kafka environments.

I thank the Karafka community for their ongoing feedback, contributions, and support. Your input drives the framework's continued development and improvements.


The complete changelog and upgrade instructions are available in the Karafka documentation.

Breaking the Rules: RPC Pattern with Apache Kafka and Karafka

Introduction

Using Kafka for Remote Procedure Calls (RPC) might raise eyebrows among seasoned developers. At its core, RPC is a programming technique that creates the illusion of running a function on a local machine when it executes on a remote server. When you make an RPC call, your application sends a request to a remote service, waits for it to execute some code, and then receives the results - all while making it feel like a regular function call in your code.

Apache Kafka, however, was designed as an event log, optimizing for throughput over latency. Yet, sometimes unconventional approaches yield surprising results. This article explores implementing RPC patterns with Kafka using the Karafka framework. While this approach might seem controversial - and rightfully so - understanding its implementation, performance characteristics, and limitations may provide valuable insights into Kafka's capabilities and distributed system design.

The idea emerged from discussing synchronous communication in event-driven architectures. What started as a theoretical question - "Could we implement RPC with Kafka?" - evolved into a working proof of concept that achieved millisecond response times in local testing.

In modern distributed systems, the default response to new requirements often involves adding another specialized tool to the technology stack. However, this approach comes with its own costs:

  • Increased operational complexity,
  • Additional maintenance overhead,
  • And more potential points of failure.

Sometimes, stretching the capabilities of existing infrastructure - even in unconventional ways - can provide a pragmatic solution that avoids these downsides.

Disclaimer: This implementation serves as a proof-of-concept and learning resource. While functional, it lacks production-ready features like proper timeout handling, resource cleanup after timeouts, error propagation, retries, message validation, security measures, and proper metrics/monitoring. The implementation also doesn't handle edge cases like Kafka cluster failures. Use this as a starting point to build a more robust solution.

Architecture Overview

Building an RPC pattern on top of Kafka requires careful consideration of both synchronous and asynchronous aspects of communication. At its core, we're creating a synchronous-feeling operation by orchestrating asynchronous message flows underneath. From the client's perspective, making an RPC call should feel synchronous - send a request and wait for a response. However, once a command enters Kafka, all the underlying operations are asynchronous.

Core Components

Such an architecture has to rely on several key components working together:

  • Two Kafka topics form the backbone - a command topic for requests and a result topic for responses.
  • A client-side consumer, running without a consumer group, that actively matches correlation IDs and starts from the latest offset to ensure we only process relevant messages.
  • The commands consumer in our RPC server that processes requests and publishes results
  • A synchronization mechanism using mutexes and condition variables that maintain thread safety and handles concurrent requests.

Implementation Flow

A unique correlation ID is always generated when a client initiates an RPC call. The command is then published to Kafka, where it's processed asynchronously. The client blocks execution using a mutex and condition variable while waiting for the response. Meanwhile, the message flows through several stages:

  • command topic persistence,
  • consumer polling and processing,
  • result publishing,
  • result topic persistence,
  • and finally, the client-side consumer matching of the correlation ID with the response and completion signaling,

Below, you can find a visual representation of the RPC flow over Kafka. The diagram shows the journey of a single request-response cycle:

Design Considerations

This architecture makes several conscious trade-offs. We use single-partition topics to ensure strict ordering, which limits throughput but simplifies correlation and provides exactly-once processing semantics - though the partition count and other things could be adjusted if higher scale becomes necessary. The custom consumer approach avoids consumer group rebalancing delays, while the synchronization mechanism bridges the gap between Kafka's asynchronous nature and our desired synchronous behavior. While this design prioritizes correctness over maximum throughput, it aligns well with typical RPC use cases where reliability and simplicity are key requirements.

Implementation Components

Getting from concept to working code requires several key components to work together. Let's examine the implementation of our RPC pattern with Kafka.

Topic Configuration

First, we need to define our topics. We use a single-partition configuration to maintain message ordering:

topic :commands do
  config(partitions: 1)
  consumer CommandsConsumer
end

topic :commands_results do
  config(partitions: 1)
  active false
end

This configuration defines two essential topics:

  • Command topic that receives and processes RPC requests
  • Results topic marked as inactive since we'll use a custom iterator instead of a standard consumer group consumer

Command Consumer

The consumer handles incoming commands and publishes results back to the results topic:

class CommandsConsumer < ApplicationConsumer
  def consume
    messages.each do |message|
      Karafka.producer.produce_async(
        topic: 'commands_results',
        # We evaluate whatever Ruby code comes in the payload
        # We return stringified result of evaluation
        payload: eval(message.raw_payload).to_s,
        key: message.key
      )

      mark_as_consumed(message)
    end
  end
end

We're using a simple eval to process commands for demonstration purposes. You'd want to implement proper command validation, deserialization, and secure processing logic in production.

Synchronization Mechanism

To bridge Kafka's asynchronous nature with synchronous RPC behavior, we implement a synchronization mechanism using Ruby's mutex and condition variables:

class Accu
  include Singleton

  def initialize
    @running = {}
    @results = {}
  end

  def register(id)
    @running[id] = [Mutex.new, ConditionVariable.new]
  end

  def unlock(id, result)
    return false unless @running.key?(id)

    @results[id] = result
    mutex, cond = @running.delete(id)
    mutex.synchronize { cond.signal }
  end

  def result(id)
    @results.delete(id)
  end
end

This mechanism maintains a registry of pending requests and coordinates the blocking and unblocking of client threads based on correlation IDs. When a response arrives, it signals the corresponding condition variable to unblock the waiting thread.

The Client

Our client implementation brings everything together with two main components:

  1. A response listener that continuously checks for matching results
  2. A blocking command dispatcher that waits for responses
class Client
  class << self
    def run
      iterator = Karafka::Pro::Iterator.new(
        { 'commands_results' => true },
        settings: {
          'bootstrap.servers': '127.0.0.1:9092',
          'enable.partition.eof': false,
          'auto.offset.reset': 'latest'
        },
        yield_nil: true,
        max_wait_time: 100
      )

      iterator.each do |message|
        next unless message

        Accu.instance.unlock(message.key, message.raw_payload)
      rescue StandardError => e
        puts e
        sleep(rand)
        next
      end
   end

   def perform(ruby_remote_code)
      cmd_id = SecureRandom.uuid

      Karafka.producer.produce_sync(
        topic: 'commands',
        payload: ruby_remote_code,
        key: cmd_id
      )

      mutex, cond = Accu.instance.register(cmd_id)
      mutex.synchronize { cond.wait(mutex) }

      Accu.instance.result(cmd_id)
    end
  end
end

The client uses Karafka's Iterator to consume responses without joining a consumer group, which avoids rebalancing delays and ensures we only process new messages. The perform method handles the synchronous aspects:

  • Generates a unique correlation ID
  • Registers the request with our synchronization mechanism
  • Sends the command
  • Blocks until the response arrives

Using the Implementation

To use this RPC implementation, first start the response listener in a background thread:

# Do this only once per process
Thread.new { Client.run }

Then, you can make synchronous RPC calls from your application:

Client.perform('1 + 1')
#=> Remote result: 2

Each call blocks until the response arrives, making it feel like a regular synchronous method call despite the underlying asynchronous message flow.

Despite its simplicity, this implementation achieves impressive performance in local testing - roundtrip times as low as 3ms. However, remember this assumes ideal conditions and minimal command processing time. Real-world usage would need additional error handling, timeouts, and more robust command processing logic.

Performance Considerations

The performance characteristics of this RPC implementation are surprisingly good, but they come with important caveats and considerations that need to be understood for proper usage.

Local Testing Results

In our local testing environment, the implementation showed impressive numbers.

A single roundtrip can be completed in as little as 3ms. Even when executing 100 sequential commands:

require 'benchmark'

Benchmark.measure do
  100.times { Client.perform('1 + 1') }
end
#=> 0.035734   0.011570   0.047304 (  0.316631)

However, it's crucial to understand that these numbers represent ideal conditions:

  • Local Kafka cluster
  • Minimal command processing time
  • No network latency
  • No concurrent load

Summary

While Kafka wasn't designed for RPC patterns, this implementation demonstrates that with careful consideration and proper use of Karafka's features, we can build reliable request-response patterns on top of it. The approach shines particularly in environments where Kafka is already a central infrastructure, allowing messaging architecture to be extended without introducing additional technologies.

However, this isn't a silver bullet solution. Success with this pattern requires careful attention to timeouts, error handling, and monitoring. It works best when Kafka is already part of your stack, and your use case can tolerate slightly higher latencies than traditional RPC solutions.

This fascinating pattern challenges our preconceptions about messaging systems and RPC. It demonstrates that understanding your tools deeply often reveals capabilities beyond their primary use cases. While unsuitable for every situation, it provides a pragmatic alternative when adding new infrastructure components isn't desirable.

Copyright © 2025 Closer to Code

Theme by Anders NorenUp ↑