In this section, you will learn Apache Kafka terminologies like Topics, Partitions, and Offsets
Kafka topic is a particular stream of data (Kafka Streams). It is identified by a name and depends on the user’s choice. It is used to publish and subscribe to the records/data under that topic. The Producer publishes data to that topic and the consumer reads that data from the subscribed topic.
For example, it is similar to a table in the database. As in a database, you can have as many as tables, similarly, you can have as many as topics you want.
Topics are split into partitions. These partitions are then separated in an order. So each partition will be ordered.
So within a topic, there are partitions and within that partition, there are records/data. Hence, while creating topics, you can specify the number of partitions it can have.
Each message within the partition gets an incremental id, called as offset.
There will be guaranteed order of offset values within the partition and not across the partition.
If the data once written to the partition can never be changed. Data within the partition is immutable.
Data within the partition remains for a limited period only.
Now let us see with the below diagram how data is allocated within the partition.
Once the Kafka topic is created and you have specified the number of partitions then the first message to the partition 0 will get the offset 0 and then the next message will have offset 1 and so on.
As you can see, all messages in partition 0 will have incremental id called as offsets. This incremental id is infinite and unbounded.
Similarly for partition 1 has incremental if from 1 to 8 and partition 2 has 0 to 10.
It is not necessary that all partitions have the same number of messages.
Data is assigned randomly to a partition unless the key is given.
Servers are called as Kafka brokers where topics are stored.
Kafka cluster is composed of multiple brokers. After connecting with any Kafka broker (bootstrap broker) then you will be able to connect to any broker.
Each Kafka broker is identified by an id. This id will be an integer.
Each Kafka broker will have certain topic partitions. A topic is spread with different partitions of different brokers.
As we know Apache Kafka is distributed so we can define the replication factor which helps as fault-tolerant. Fault-tolerant means if any broker goes down then some other broker will act as lead for that partition of the topic and serve the data.
Topics should have a replication factor of more than 1. Usually, it is 2 or 3.
Let us see the below diagram to explain further.
In this example, we have taken Topic-A with 2 partitions and replication factor as 2.
As you can see that Topic-A with partition 0 is on Broker 1 and Topic-A with partition 1 is on Broker 2. Also, the replica of partition 0 is on Broker 1.
Similarly, the replica of partition 1 is on Broker 3.
Therefore each partition has 2 copies on different Kafka Brokers.
Let’s see what happens if we lose Broker 2.
As you can see Broker 1 and Broker 3 can still serve the data. So replication allowed us to ensure that the data should not be lost.
At any given time only one Kafka broker can be a leader for a given partition.
And only that leader can receive and serve messages for that partition.
Other brokers will only keep copy of the messages by synchronization.
You have learned about Apache Kafka Topics, partitions, offsets, and Brokers. How replication happens in Brokers and what is the leader for a given partition.