Apache Kafka Terminologies

Jigar Rathod
4 min readMar 6, 2020

Part one

Kafka can be considered as a giant buffer between producers and consumers. Just like YouTube, kafka is a streaming platform. Your favorite artist publishes a new video on YouTube. If you have subscribed to his/her channel then you will get a notification about it. In this case, your favorite artist is a producer and you are consumer. In Kafka, channel is analogous to topic.

Let’s assume that you have wrote a producer that streams payments into a topic. You have few use-cases i.e. process payments, give bonus points to customers who paid in full & notify customers who has not paid in full. We can write a consumer for each individual use-case. For now, we will focus on the use case of processing payments. It takes about 10 seconds to process one payment. A single consumer can only process 6 payments every minute. During Black Friday, people started to purchase more items. We see about 10 payments every minute. What should we do?

We can start another consumer process to process payments. But now we start to notice another problem, consumer-1 and consumer-2 both read from the topic at the same time and processed the payment :(. To resolve this issue, topics are divided into partitions. Consumer-1 and Consumer-2 are reading data from different partitions. Producer can choose in which partition a message should go. By creating a topic with 2 partitions; we were able to process payments faster. Consumer-1 and consumer-2 are logically simillar. The only difference is that they consume data from a partition. Hence, consumer-1 and consumer-2 are part of a consumer-group.

So far, we have covered 5 terminologies: producer, consumer, topic, partition & consumer-group.

Part two

Assume that Consumer-2 ran into some kind of error. Consumers send heartbeat signals. As soon as, it is identified that consumer-2 is offline; Consumer-1 needs to take over from where consumer-2 left. This is called rebalance. One can write a consumer that precisely tell Kafka how many messages it has consumed so far.

Let’s assume that producer somehow created errorneous message. When a consumer processes this message, it can put it in Dead-letter queue. By using dead-letter queue, consumers filter out erroneous messages which can be handled manually.

In x node Kafka cluster, one of them becomes a controller node. When a user creates a topic with 3 partitions. Controller node has to decide which partition will go where. It employs round robin algorithm to do so. So in a three node cluster, partition-0 will be assigned to node-0; partition-1 will be assigned to node-1 and partition-2 will be assigned to node-2. In another word, partition-0 is being lead by node-0, partition-1 is being lead by node-1 and on-wards. When a producer wants to push data into partition-0; it will go to controller and find out that it is being lead by node-0. Now, producer will cache this information and communicates with node-0 to push relevant message.

In above example, if node-0 is unavailable for any reason then all the information associated with partition-0 won’t be available. This is a big problem. To withstand this scenario, one can choose to replicate partition-0 to another host i.e. node-1. Partition-0 in node-0 is the original partition = leader replica whereas partition-0 in node-1 is the copy of the original partition = follower replica. Thus by having 2 replicas of partition-0; we can withstand node-0 failure.

If a user creates a topic with 3 partitions and 2 replicas then it may look like following:

If node-0 and node-1 becomes unavailable then we won’t have any information about partition-0. This can be prevented by replicating partition-0 on node-2. A topic with 3 partitions and 3 replicas would look as following:

Producers feed data to leader replica. It is follower replica’s duty to copy leaders. Depending on physical placements of nodes and internet connectivity between them, it is expected that follower replicas would be lagging behind by some timeframe. In Kafka, one can define acceptable lag time (replica.lag.time.max.ms). If follower is lagging behind more than replica.lag.time.max.ms then it is out-of-sync, otherwise follower is in-sync. In-sync followers are also referred as ISR (In-sync replica).

Some critical applications requires to have atleast one consistent backup of the leader replica. While topic creation, one can specify ISR requirement. Producers can be configured with

  • ack=0 — after sending a message do not wait for any acknowledgement from kafka brokers
  • ack=1 — after feeding data to a leader replica wait for an acknowledgement from it
  • ack=all — after feeding data to a leader replica wait until acknowledgments are received from ISRs

If a user has specified ISR=2 while topic creation then producer (with ack=all) setting will wait for an acknowledgement from

  • leader replica &
  • one of the follower replica

How long should a producer wait can be fine-tuned as needed.

So far we covered, rebalance, dead-letter queue, controller, leader replica, follower replica, ISR and various ack settings for producers.

--

--

Jigar Rathod

DevOps Engineer and a part time investor | feel free to reach out to me | LinkedIn — https://www.linkedin.com/in/jigarrathod/