1. Overview

Kafka is a popular open-source distributed message streaming middleware. It decouples message producers from message consumers using the publish-subscribe pattern.

There are many configuration parameters for Kafka. The log.segment.bytes and log.retention.hours parameters are related to how long Kafka keeps messages within the system.

In this tutorial, we’ll discuss the difference between the log.segment.bytes and log.retention.hours parameters. The version of Kafka we use in the examples is 3.7.0.

2. Meaning of log.segment.bytes and log.retention.hours

Kafka keeps messages in partitions. A topic may have more than one partition. All messages with the same key are kept in the same partition. However, Kafka sends a message to a random partition if it has no key.

Partitions consist of multiple segments. Each segment has a different range of offsets. Kafka writes the messages to segments sequentially. Indeed, each partition has only one active segment to which Kafka appends incoming messages. The active segment is the last one in the sequence.

Several configuration parameters affect the behavior of segments. The first parameter we’re interested in is the log.segment.bytes parameter which specifies the maximum size of a segment in bytes. Its default value is 1 GB. If a segment becomes full, Kafka closes it and opens a new segment. The latest segment becomes the active segment. Kafka stores each segment in a log segment file in the file system.

The second parameter we’re interested in is log.retention.hours. Its default value is 168 hours, i.e., one week. Once Kafka closes a segment, it doesn’t delete it immediately even if the log cleanup policy has the delete value, which is the default policy. Kafka deletes an old segment after a duration specified by log.retention.hours passes. Therefore, the messages within an old segment are available to consumers till Kafka deletes the old segment.

There’s another parameter, log.retention.ms, that has the same functionality as log.retention.hours, but its unit is in milliseconds. If both of these parameters are used, log.retention.ms takes precedence.

3. Effects of Changing the Parameters

Generally, we don’t need to change the values of these two parameters. However, if we do change them, we should be aware of their effects. For example, if we set log.segment.bytes to a low value, there will be more segments per partition. So, log segment deletion happens more frequently. Additionally, Kafka deals with a greater set of open files. This may lead to errors such as Too many open files.

Setting log.retention.hours to a high value requires more disk space, whereas a lower value implies that Kafka keeps less data. Late-joining consumers might miss data in the latter case.

4. An Example

In this section, we’ll examine the behavior when we change the default values of the two parameters. However, we’ll use the log.retention.ms parameter instead of log.retention.hours as we want to see the effect of changing the parameter’s value quickly.

All of the scripts used come with the installation of Kafka.

4.1. Starting the Kafka Server

We use Kafka Raft (KRaft) to start the Kafka Server. We begin with generating a cluster identifier using kafka-storage.sh:

$ kafka-storage.sh random-uuid
VkjEOXEjTieLimFLjhvRKA

Next, we format the storage using the cluster identifier:

$ kafka-storage.sh format -t VkjEOXEjTieLimFLjhvRKA -c /home/baeldung/work/kafka/config/kraft/server.properties
metaPropertiesEnsemble=MetaPropertiesEnsemble(metadataLogDir=Optional.empty, dirs={/tmp/kraft-combined-logs: EMPTY})
Formatting /tmp/kraft-combined-logs with metadata.version 3.7-IV4.

The logs are in the /tmp/kraft-combined-logs directory according to the last line in the output.  Finally, we start the Kafka server using kafka-server-start.sh:

$ kafka-server-start.sh /home/baeldung/work/kafka/config/kraft/server.properties
[2024-07-23 05:15:50,075] INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$)
...

The server is up and running.

4.2. Configuring the Broker’s log.segment.bytes Parameter

It’s possible to configure Kafka dynamically using the kafka-configs.sh script. We can also use the same script to check the current dynamic configuration:

$ kafka-configs.sh --bootstrap-server localhost:9092 --entity-type brokers --entity-name 1 --describe 
Dynamic configs for broker 1 are:

The –bootstrap-server localhost:9092 option is for connecting to the Kafka server running on localhost and waiting for client connections on port 9092 by default. The –entity-type brokers option specifies that we want to list the dynamic configuration of brokers.  Additionally, the –entity-name 1 option specifies the broker’s name. As we have only one broker, its identification number is 1. Finally, the –describe option lists the configuration for the specified entity.

As is apparent from the output, we don’t have a dynamic configuration for the broker yet. So, let’s update the configuration:

$ kafka-configs.sh --bootstrap-server localhost:9092 --entity-type brokers --entity-name 1 --alter --add-config log.segment.bytes=128
Completed updating config for broker 1.

The –alter option changes the configuration for the specified entity. The –add-config log.segment.bytes=128 option sets log.segment.bytes to 128 bytes. Let’s recheck the configuration:

$ kafka-configs.sh --bootstrap-server localhost:9092 --entity-type brokers --entity-name 1 –-describe
Dynamic configs for broker 1 are:
  log.segment.bytes=128 sensitive=false synonyms={DYNAMIC_BROKER_CONFIG:log.segment.bytes=128, STATIC_BROKER_CONFIG:log.segment.bytes=1073741824, DEFAULT_CONFIG:log.segment.bytes=1073741824}

As is apparent from the output, the value of log.segment.bytes is 128 bytes now. We also see the default value of log.segment.bytes, which is 1073741824 bytes, i.e., 1 GB.

This configuration affects the segment size of all topics since we apply the change at the broker level.

4.3. Creating a Topic

We create a topic using kafka-topics.sh:

$ kafka-topics.sh --bootstrap-server localhost:9092 --topic my-topic --create
Created topic my-topic.

The –bootstrap-server localhost:9092 option is for connecting to the Kafka server as before. The –topic my-topic option specifies the topic’s name, my-topic. The –create option specifies the creation of a topic.

We succeeded in creating the topic. Kafka also creates a directory, my-topic-0, corresponding to the topic’s partition in /tmp/kraft-combined-logs:

$ ls -l /tmp/kraft-combined-logs/my-topic-0/
total 8
-rw-r--r-- 1 debian debian 10485760 Jul 23 05:20 00000000000000000000.index
-rw-r--r-- 1 debian debian        0 Jul 23 05:20 00000000000000000000.log
-rw-r--r-- 1 debian debian 10485756 Jul 23 05:20 00000000000000000000.timeindex
-rw-r--r-- 1 debian debian        8 Jul 23 05:20 leader-epoch-checkpoint
-rw-r--r-- 1 debian debian       43 Jul 23 05:20 partition.metadata

The 00000000000000000000.log file is an actual segment containing messages up to a specific offset. The name of the file specifies the starting offset. Its size is 0 bytes.

4.4. Writing Messages

Let’s start a producer using the kafka-console-producer.sh script:

$ kafka-console-producer.sh --bootstrap-server localhost:9092 --topic my-topic
>

The –bootstrap-server localhost:9092 option is for connecting to the server. The –topic my-topic option specifies the name of the topic to which we want to produce messages.

The arrowhead symbol, >, shows that we’re ready to send messages to my-topic. Let’s send a message, Message1, to my-topic:

$ kafka-console-producer.sh --bootstrap-server localhost:9092 --topic my-topic
>Message1
>

Let’s check the /tmp/kraft-combined-logs/my-topic-0 directory again after writing the message:

$ ls -l /tmp/kraft-combined-logs/my-topic-0/
total 12
-rw-r--r-- 1 debian debian 10485760 Jul 23 05:27 00000000000000000000.index
-rw-r--r-- 1 debian debian       76 Jul 23 05:28 00000000000000000000.log
-rw-r--r-- 1 debian debian 10485756 Jul 23 05:27 00000000000000000000.timeindex
-rw-r--r-- 1 debian debian        8 Jul 23 05:27 leader-epoch-checkpoint
-rw-r--r-- 1 debian debian       43 Jul 23 05:27 partition.metadata

We have the same set of files as in the previous check. However, the size of the 00000000000000000000.log file is now 76 bytes since we’ve appended a message to the active segment.

Let’s now send a second message, Message2:

$ kafka-console-producer.sh --bootstrap-server localhost:9092 --topic my-topic
>Message1
>Message2
>

Let’s check the /tmp/kraft-combined-logs/my-topic-0 directory once more:

$ ls -l /tmp/kraft-combined-logs/my-topic-0/
total 24
-rw-r--r-- 1 debian debian        0 Jul 23 05:31 00000000000000000000.index
-rw-r--r-- 1 debian debian       76 Jul 23 05:28 00000000000000000000.log
-rw-r--r-- 1 debian debian       12 Jul 23 05:31 00000000000000000000.timeindex
-rw-r--r-- 1 debian debian 10485760 Jul 23 05:31 00000000000000000001.index
-rw-r--r-- 1 debian debian       76 Jul 23 05:31 00000000000000000001.log
-rw-r--r-- 1 debian debian       56 Jul 23 05:31 00000000000000000001.snapshot
-rw-r--r-- 1 debian debian 10485756 Jul 23 05:31 00000000000000000001.timeindex
-rw-r--r-- 1 debian debian        8 Jul 23 05:27 leader-epoch-checkpoint
-rw-r--r-- 1 debian debian       43 Jul 23 05:27 partition.metadata

Since we’ve set the size of a segment to a small value, 128 bytes, Kafka created a new segment and added the second message to the new segment. The file corresponding to the new active segment is 00000000000000000001.log. Its size is the same as the first segment.

Therefore, Kafka creates a new segment when the active segment is full.

4.5. Reading Messages

Let’s now start a consumer using kafka-console-consumer.sh:

$ kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my-topic --from-beginning
Message1
Message2

The meaning of –bootstrap-server localhost:9092 is the same as before. The –topic my-topic option specifies that we read messages from my-topic. Finally, the –from-beginning option is for reading the messages previously written by producers.

We read the messages successfully.

4.6. Configuring the Broker’s log.retention.ms Parameter

Now, it’s time to update the broker’s log.retention.ms configuration:

$ kafka-configs.sh --bootstrap-server localhost:9092 --entity-type brokers --entity-name 1 --alter --add-config log.retention.ms=5000
Completed updating config for broker 1.

We set log.retention.ms to 5000 milliseconds this time. Here’s the broker’s current dynamic configuration:

$ kafka-configs.sh --bootstrap-server localhost:9092 --entity-type brokers --entity-name 1 --describe
Dynamic configs for broker 1 are:
  log.retention.ms=5000 sensitive=false synonyms={DYNAMIC_BROKER_CONFIG:log.retention.ms=5000, STATIC_BROKER_CONFIG:log.retention.hours=168, DEFAULT_CONFIG:log.retention.hours=168}
  log.segment.bytes=128 sensitive=false synonyms={DYNAMIC_BROKER_CONFIG:log.segment.bytes=128, STATIC_BROKER_CONFIG:log.segment.bytes=1073741824, DEFAULT_CONFIG:log.segment.bytes=1073741824}

In addition to log.segment.bytes, we’ve also set the value of log.retention.ms successfully. As is apparent from the output, the default value is 168 hours.

4.7. Reading Messages Again

Let’s reread the messages after setting log.retention.ms to 5000 milliseconds:

$ kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my-topic --from-beginning

We couldn’t read any messages as more than 5000 milliseconds have passed since we wrote the two messages, Message1 and Message2. Let’s check the content of /tmp/kraft-combined-logs/my-topic-0:

$ ls -l /tmp/kraft-combined-logs/my-topic-0/
total 12
-rw-r--r-- 1 debian debian        0 Jul 23 05:35 00000000000000000002.log
-rw-r--r-- 1 debian debian       56 Jul 23 05:35 00000000000000000002.snapshot
-rw-r--r-- 1 debian debian 10485756 Jul 23 05:35 00000000000000000002.timeindex
-rw-r--r-- 1 debian debian        8 Jul 23 05:35 leader-epoch-checkpoint
-rw-r--r-- 1 debian debian       43 Jul 23 05:27 partition.metadata

Notably, the two log segment files corresponding to the two deleted messages don’t exist anymore. Additionally, Kafka seems to create a new log segment. The file corresponding to the new active segment is 00000000000000000002.log. Its size is 0 as there’s no message in this segment yet.

5. Conclusion

In this article, we discussed the difference between the log.segment.bytes and log.retention.hours parameters of Kafka. First, we learned that log.segment.bytes affects the number of segments in a partition. Then, we saw that log.retention.hours and log.retention.ms are related to how long Kafka keeps old segments before deleting them. Finally, we changed the default values of log.segment.bytes and log.retention.ms, and saw the effects in an example.