What is Kafka in big data?


The article aims to present a series of general notions about Kafka and the role this system plays in the Big Data ecosystem.

One of Big Data’s biggest challenges is the data analysis part. But to meet this challenge, it is essential to focus our attention on how we manage to collect this data? The Apache Kafka system comes into our aid.

Kafka is distinguished as an open-source project of excellent quality, which has attracted a lot of attention and numerous contributors.
Apache Kafka is a public sub system built in the form of a distributed commit.

Kafka terminology has the following features:

1. Keep a stream of messages in categories called topics.

2. Processes that publish messages are called manufacturers.

3. The processes with the subscription to these topics and processing the messages that are published are called consumers.

4. Kafka runs in the form of a cluster of several servers called brokers.

Because enterprise applications are processing more and more data, the performance of messaging systems is becoming increasingly crucial for the smooth running of applications, requiring fast and scalable platforms.

Apache Kafka is a new messaging system that is one of the best performing solutions at the moment and can transfer up to a million messages per second to a group of three medium capacity machines.

Until 2014 we were talking about Hadoop, then Spark appeared, but now Kafka completes the triad. They form three primary pillars in the flow and data analysis flow into modern architecture.

Kafka was originally developed at LinkedIn, at a time when LinkedIn migrated from a monolithic database to a service-based architecture where each service had its own data storage model.

One of the issues that emerged during the migration was the real-time distribution of access logs from web servers to the user activity analysis service. LinkedIn engineers needed a platform that could transfer large amounts of data to multiple services in as short a time as possible. Existing platforms proved to be inefficient for their data volume, so they developed their own messaging system under the name Kafka. Subsequently, the project was launched open source and donated to the Apache Software Foundation. After launch, Kafka was adopted by several companies with similar messaging needs.

The main goal in Kafka’s design was to maximize the message transfer speed. To get the toughest speeds, Kafka comes with a rethinking publishing model and gives up some of the facilities offered by classic messaging platforms.

One of the most significant changes is the retention of published messages. Producers release Kafka messages that become available for consumer processing, but consumers do not need to confirm the message processing. Instead, Kafka retains all the messages received for a fixed period, and consumers are free to consume any retained message. Although seemingly ineffective at first glance, this work pattern brings some advantages:

1. Simplifies the architecture of the system;

2. Kafka does not have to remember which messages have been consumed and who do not. Isolating consumer manufacturers.

3. Consumers should not consume permanently and can be stopped at any time without impacting on the messaging system. They can even be batch jobs executed periodically. Of course, this approach works only if the messages are held in Kafka sufficiently long enough to process the data accumulated between runs.

4. Because messages do not have to be selectively retained, Kafka can use a simple and efficient storage model: the message log. The journal is just a list of messages in which new messages are added all the time to the end of the list. Existing messages never change. Since the log is only changed at the end by adding new messages, it can be stored optimally on magnetic storage media – writing to the hard disk can be done sequentially, avoiding the writing head movement, a costly operation in terms of performance. Also, if consumers manage to keep up with message makers, Kafka can serve messages directly from memory, using the caching mechanisms provided by the operating system in this case.

Kafka is a good solution for applications requiring high transfer speed and low latency when delivering messages. Its simple architecture and flexible grouping of consumers make it suitable for a variety of applications: log collection and performance metrics, data sequence and event processing. Like any technology, Kafka also has her limitations. The fact that messages cannot be processed individually may be a problem for some types of applications. Another question can be the lack of programs and management tools. Unlike other platforms like ActiveMQ or RabbitMQ, Kafka has a poorly developed ecosystem. The need to run a Zookeeper cluster alongside Kafka brokers may also be a financial or administrative impediment. However, we hope that some of these limitations will disappear with the spread and maturation of technology.

Recent Posts