Discover an Apache Kafka’s game-changing feature: log compaction
Updated: Aug 21, 2019
The fact that Apache KafkaⓇ is not a traditional message system can seem quite obvious. It is well known to be a scalable event-stream processor which transports high-volume of data in motion from producers to consumers in real-time. However Kafka has another particularity which makes it considerably different from traditional message system: it can store data.
Kafka can be used as a storage layer by storing data in regular topics for a configurable period of time (retention-based) and in compacted topic for longer-term or indefinite-time storage thanks to a mechanism called log compaction.
From an architecture point of view, Kafka is actually “much closer in architecture to a distributed file system or database then to traditional message queue” (Kreps, J. (2017). It's Okay to store Data in Apache Kafka).
Here is how he justifies this architectural patterns:
“One way to think about the relationship between messaging systems, storage systems, and Kafka is the following. Messaging systems are all about propagating future messages: when you connect to one you are waiting for new messages to arrive. Storage systems such as a filesystem or a database are all about storing past writes: when you query or read from them you get results based on the updates you did in the past. The essence of stream processing is being able to combine these two—being able to process from the past, and continue processing into the future as the future occurs. This is why Kafka’s core abstraction is a continuous time-ordered log. The key to this abstraction is that it is a kind of structured “file” that doesn’t end when you come to the last byte, but rather, at least logically, goes on forever. A program written to work with a log therefore doesn’t need to differentiate between data that already occurred and data that is going to happen in the future, it all appears as one continuous stream. This combination of storing the past and propagating the future behind a single uniform protocol and API is exactly what makes Kafka work well for stream processing”
Kreps, J. (2017). It's Okay to store Data in Apache Kafka.
Although there is often analogy made between Kafka and a database, Kafka is not a database, it is a “commit log offering no broad query functionality” (Stopford, B. (2018). Designing Event-Driven Systems. 1st ed. O'reilly, p.25.). Kafka’s priority is indeed first the real-time processing, then long-term storage.
So how can you store long-term data on Kafka and why is it useful?
Long compaction is basically a feature that can be “switched on or off” while writing the event stream within a framework such as Kafka Streams or Apache Flink.
Once “on” it will for a single topic partition retain (at least) the last known value for each message key within the log of data. In other words it stores a "snapshot" of the latest records for an indefinite time.
“The great thing about log compaction is that it blurs the distinction between the initial snapshot of the database and the ongoing change stream”.
Kleppmann, M. (2016). Making sense of stream processing. 1st ed. O'reilly, p.91.
This article doesn’t aim at explaining how log compaction works technically but to dives into the benefits of log compaction and how we experienced its value. If you are interested in getting more details on its functioning, please consult this thorough documentation from ASF.
There are of course many situations in which log compaction can be very valuable. An obvious one is that it allows downstream consumers to restore their state after that a system fails or an application crashes. In the same way it can reload caches during operational maintenance once an application restarts.
Another common use of this feature is when dealing with stateful event processing. This might sound less straightforward. However it reveals to be extremely valuable to better manage the performance of the overall systems. Let’s dig into it….
Stateful event stream processing
Compacted logs mean smaller datasets which are easier to move from one system to another one. This is essential when it comes to stateful stream processing.
Stateful stream processing refers to “an application design pattern for processing unbounded streams of events and is applicable to many different use cases in the IT infrastructure of a company” (Hueske, F. and Kalavri, V. (2019). Stream processing with Apache Flink. O'reilly.).
In simpler words it consists in transforming multiple streams into one piece of information, the state of data. Let’s take a concrete example.
Imagine a stream application that computes the click behaviour of visitors on a given e-commerce website. Each time a person clicks on a page, a record appears in real-time in the commit log and says e.g. “John has clicked on page 1”. An important KPI for this e-commerce company is the number of pageviews per person. It is calculated via a “count” function that summarizes these events into one state using stateful stream processing (ex. “John has clicked 4 times on this page” or “this page has been clicked 750 times”)
So where would log compaction be useful in this context ? Well let's get back to our example.
The number of pageviews per person KPI is stored within the database of a CRM system. Imagine now that several persons click all at once on different pages and that each person visits different pages within a very small interval between each visit. If the CRM system is not scalable enough it may easily meet difficulties in handling the load and … crash. To avoid this scenario, Kafka Streams (or another event-stream processing framework) would make sure that the count function would be applied on events while being processed by Kafka brokers and stored directly in a compacted log. The external CRM database would receive the latest computed value as stored in the compacted log, without the whole history of events.
To sum up: on top of Kafka’s inherent quality such as parallelism management ( thanks to its specific partitioning system) and fault-tolerance, log compaction will ensure scalability and a more performant and efficient load of data to the consumers in a context of high-volume traffic generated by event-stream processing.
We hope we triggered your interest for this less well-known but yet extremely valuable aspect of Kafka. Don’t hesitate to share your own experience with Log compaction in the comments below.
Thanks for reading!