Simulating the benefits of Kafka with Redis
Apache Kafka has become a mainstream component in most technology stacks. The benefits of using Kafka range from ensuring causal ordering in events while maintaining parallelism, resilience to failure by replicating partitions across servers, fast, among many others.
However, running Kafka comes with its own set of challenges. While a lot of engineering teams would love to add Kafka to their stack and earn a seat at the table with the “real” engineers the operational overheads pose a strong barrier to entry. One may argue that managed services make a lot of this achievable, but it also sucks to be poor.
In this post, we focus on what it takes to build a system that looks like a traditionally monolithic application but is a loosely coupled, event-driven system. For this we rely on learnings from concepts such as Domain Driven Design, Event Sourcing and consistent hashing.
Most systems care about the ordering of events. The ordering in most systems is limited to the domain in consideration. For example, when we look at Posts to a thread we care about ordering with respect to the Post. When we look at financial systems the ordering is largely limited to the account. Global ordering of events in large systems is rarely useful but can be relevant.
In the scenario above posts are added to a thread. Assuming we have a fair bit of post-processing to do on each added post, which in turn updates certain attributes of the thread, it creates a reasonably good scenario to illustrate the use of partitioning.
In such a scenario the default approach is to pump all posts to a queue and have a bunch of workers (or consumers) do the work. This offers us the parallelism that our system needs but the minute we deal with multiple consumers the ordering is lost. The only way we retain ordering is by ensuring we process tasks one at a time to reflect the true sequence in which things happened on that thread. The obvious next thought is to deal with the same problem with a dedicated queue per thread but that immediately feels like overkill if we know we’d be generating a whole lot of threads.
Partitions simply break down our queuing systems into dedicated partitions. So, if we start with a naïve estimate that 8 workers would be able to process 1600 events per minute, we start our design with 8 partitions. You may need a little more work to ensure your estimates are good, but we'll work with the assumption that it is, for this example. We also allocate a single worker to a partition, since we want each partition to always maintain causal ordering.
Now we need to ensure that the posts for a specific thread are all routed to the same partition. Each partition is managed by a single consumer, so our ordering isn’t disrupted.
It’s important to remember that a “queue partition” or the “dedicated partition” is an abstract construct. It’s really just a queue. We use the term partition because it makes it easy to align with the terms used widely in the domain.
We’ll use consistent hashing as a means to route all posts that belong to a specific thread to the same queue partition (or queue). There is a wide range of posts explaining consistent hashing and we won’t dive into how that works. In our examples, we’ll use Murmurhash and a continuum managed by a library called uHashRing. Technically, you can manage the continuum yourself, but we’ll touch upon this a little later.
Viewing our queues as a continuum
Now if we simply place all our 8 queues in a circle, we’d come up with something like this. Let’s just call this a continuum as the 7th queue is followed by the first one which is queue 0.
Now consistent hashing allows us to map a given Task/Job to a specific queue using the threadId. So, we’re using the threadId as the partitioning key in this case. The important aspect to notice here is that we’re not calling our queues PostProcessing queues. They’re not dedicated queues. You can throw in a Transaction event into this and expect the corresponding consumers (and event handlers) to handle it. I know we’ve not touched upon the event handlers yet but stay with me on this.
We’ve spoken a lot about events in the last couple of paragraphs, but we’ve not really defined what events mean. Our systems will look at events as facts that happen across our systems. Facts are usually things that have altered the state of the system in some way (or failed to do so). For example, PostCreatedEvent happens when a new post is created. Similarly, a PostUpdatedEvent occurs when the post is updated. You can map an event to most of the CRUD operations in your system. If you design your system as domains, you’ll be surprised to find the number of events that are triggered by an application service.
An event also maps the surrounding state of the system. Let’s consider an application service that creates a post.
In this example, the trigger_events method determines the fact to publish. It gathers the surrounding context in this case. This could also potentially include the request params (like the params attribute passed into the event). However, what correctly captures the context depends on the context :).
So, our final event could look something like this. Notice the event does not have a updated_at attribute because we consider events to be immutable facts. We cannot undo something that actually happened.
The application service in our example dumped the event over to a service called SystemEventsService by calling the trigger method on it. The method does a bunch of things before it actually publishes that event for us. It runs through the continuum we saw earlier to identify the queue (and the respective worker) based on the partition key we passed it. That’s pretty much what we need consistent hashing for. This ensures our events are always handled by the same partition (and worker).
So, once we’ve identified the worker for our task, we ask the worker to
Persist the event in case we need to come back to it at a later time
Publish it to all the relevant workers
Let our event-driven workloads that subscribe to the task trigger their workflows.
Our Continuum definition
Dispatching to the right partition
Didn’t you say event-driven system?
This brings us to the best part. The whole system now lets you run your application as a series of asynchronous event handlers that can be invoked on a specific event. When an event arrives at the right partition the worker dispatches the event to a series of event handlers.
The partition worker persists the event and dispatches it to the EventHandler. Event handlers are a series of independently deployable functions that can do whatever you want them to.
And our processor could handle them in any way you like. Here we handle them in sequence, but it could be dispatched in parallel.
The handler itself is a fairly naive class that checks for the relevant event.
If you found this useful or have questions around the implementation details, do write to us!