Kafka Github Notes
Kafka Github Notes
[Link]
Kafka notes
These are the most relevant contents that I collected through some books and
courses. I organized and make some edits to help deliver the concepts about
Apache Kafka in the most comprehensible way. Anyone who wants to learn
Apache Kafka can reference these notes without going through too many
resources on the Internet. Below might be the only material you need to grasp
quite a basic understanding of Kafka as well as how to configure your applications
to use Kafka properly in production.
If you want to get a deeper understanding of Kafka or how to use Kafka Stream for
big data processing. I highly recommend you check out the material including
some books and courses that I linked in the reference section.
Table of contents:
Kafka notes
Kafka introduction
Use Cases
Kafka architecture
Log
Topics
Partitions
Producer
Consumer
Consumer group
Broker
Cluster controller
Leader replica
Follower replica
Confugrations
Hardware selection
Disk Throughput
Disk capacity
Memory
Patitions
Configure topic
Configure producer
Producer batching
Idempotent producer
Configure consumer
Consumer offset
Schema registry
Case study
GetTaxi
Campaign compare
Mysocial media
Kafka internal
Request processing
Physical storage
Partition Allocation
File Management
References
Kafka introduction
The data problem
You need a way to send data to a central storage quickly
Because machines frequently fail, you also need the ability to have your data
replicated, so those inevitable failures don't cause downtime and data loss
Disk-Based retention:
Fast: Kafka is a good solution for applications that require a high throughput,
low latency messaging solution. Kafka can write up to 2 million requests per
second
Scalable:
Batch data in chunks: Kafka is all about batching the data into chunks. This
minimises cross machine latency with all the buffering/copying that
accompanies this.
Can scale Horizontally: The ability to have thousands of partitions for a single
topic spread among thousands of machines means Kafka can handle huge
loads.
Kafka architecture
Kafka is a message broker. A broker is an intermediary that brings together two
parties that don't necessarily know each other for a mutually beneficicial
exchange or deal.
Log
Configuration setting [Link] , specifies where Kafka stores log data on disk.
Topics
orders
customers
paymments
To help manage the load of messages coming into a topic. Kafka use partitions
Partitionsare the way that Kafka provides redundancy and scalability. Each
partition can be hosted on a different server, which means that a single topic can
be scaled horizontally across multiple servers.
Partitions
Help increasing throughput
Partition: logical unit used to break down a topic into splits for redundancy and
scalability. You can see Log stored on disk. But with Partition , you can't.
Partition is handled logically.
Records with the same key will always be sent to the same partition and in order.
The producer does not care what partition a specific message is written to and
will balance messages over all partitions of a topic evenly
In some cases, the producer will direct messages to specific partitions using
message key . Messages with a specified message key will be ensured to come in
the right order in a partition.
The consumer subscribes to one or more topics and reads the messages in
the order in which they were produced.
The offset is a simple integer number that is used by Kafka to maintain the current
position of a consumer.
work together to consume a topic. Group assures that each partition is only
consumed by one member. If a single consumer fails, the remaning members of
group will rebalance the partitions being consumed to take over the missing
member.
Consumer group
Consumers groups used to read and process data in parallel.
Then Serialized the key and value objects to ByteArrays so they can be sent
over the network
Once a partition is selected, the producer then add the record to a batch of
records that will also be sent to the same topic and partition.
It acts as a centralized service and helps to keep track of the Kafka cluster nodes
status, Kafka topics, and partitions.
Cluster controller
In a cluster, one broker will also function as the cluster controller
A cluster controller is one of the kafka brokers that in addition to the usual broker
functionality:
The first broker that starts in the cluster becomes the controller.
Leader replica
Each partition has a single replica designated as the leader.
Each partition also have a prefered leader , the replica that was the leader when
the topic is originally created.
Follower replica
All replicas for a partition that are not leaders are called followers
Only replicate messages from the leader and stay up-to-date with the most
recent message the leader has
When a leader crashes, one of follower replica will be promoted to become the
leader
A Follower replica that catch up with the most recent messages of the leader
are callled In-Sync replica
Only in-sync replicas are eligible to be elected as partition leader in case the
existing leader fail
Configurations
Hardware selection
Disk Throughput
Faster disk writes will equal lower produce latency
SSDs have drastically lower seek and access times and will provide the best
performance
Disk capacity
If the broker is expected to receive 1 TB of traffic each day, with 7 days of
retention, then the broker will need a minimum of 7 TB of usable storage for
Memory
Having more memory available to the system for page cache will improve the
performance of consumer clients'
Patitions
Each partition can handle a throughput of a few MB/s
Guidelines:
Configure topic
Why should I care about topic config?
Compact: Only stores the most recent value for each key in the topic. Only
works on topics for which applications produce events that contain both a key
and a value
[Link]=delete :
[Link]=compact :
Will delete old duplicate keys after the active segment is commited
[Link] :
Lower number means that less data is retained(if your consumers are
down for too long, they can miss data)
[Link] :
[Link] : the time Kafka will wait before comming the segment if not full
Configure producer
You only need to connect to one broker and you will be connected to the entire
cluster
acks: Controls how many partition replicas must receive the record before the
producer can consider write successful.
acks = 0: the producer will not wait for a reply from the broker before
assuming the message was sent successfully. The message may be lost
but it can send messaes as fast as the network will support, so this setting
can be used to achieve very high throughput
acks=1: With a setting of 1, the producer will consider the write successful
when the leader receives the record. The leader replica will know to
immediately respond the moment it receives the record and not wait any
longer.
acks=all: the producer will consider the write successful when all of the in-
sync replicas receive the record. This is achieved by the leader broker
being smart as to when it responds to the request — it’ll send back a
response once all the in-sync replicas receive the record themselves.
In-sync replicas: An in-sync replica (ISR) is a replica that has the latest
data for a given partition. A leader is always an in-sync replica. A follower
is an in-sync replica only if it has fully caught up to the partition it’s
following.
if we go below that value of in-sync replicas, the producer will start receiving
exceptions.
[Link]
c0515b3b707e#:~:text='acks%3D1',producer%20waits%20for%20a%20response
buffer memory: this sets the amount of memory the producer will use to
buffer messages waiting to be sent to brokers.
retries: How many times the producer will retry sending the message
[Link]: The producer will batch them together. When the batch is full, all
the messages in the batch will be sent.
Better throughput
Producer batching
If more messages have to be sent while others are in flight, Kafka is smart
and will start batching them while they wait to send them all at once
A batch is allocated per partition, make sure don't set it to a number that's
too high
If the producer produces faster than the broker can take, the records will be
buffered in memory
[Link]=33554432(32MB)
[Link]=60000 , the time the .send() will block until throwing an exception.
Idempotent producer
The producer can introduce duplicate messages in Kafka due to network errors
Consumer offset
Kafka stores the offset at which a consumer group has been reading
If a consumer dies, it will be able to read back from where it left off thanks to
the commited consumer offset
If the processing goes wrong, the message will be lost(it won't be read
again)
How to commit?
AutoCommit:
[Link]
Manual Commit:
Schema registry
Kafka takes bytes as an input and publishes them
No data verification What if the producer sends bad data? The consumers
break. So
The pricing should "surge" if the number of drivers are low or the number of
users is high
Campaign compare
Mysocial media
Users should be able to post, like and comment
Users should see the toal number of likes and comments per post in real time
It is also very common to have Kafka serve a "speed layer" for real time
applications
Processor: Handling request threads are responsible for taking requests from
client connections, placing them in a request queue, picking up response from
response queue and send them back to [Link] are two types of request
Produce requests
Fetch requests
Client uses another request type called a metadata request to get information about
where to send the request. Which is a list of topics the client is interested in. The
Server respond specifies which partitions exist in the topics, the replicas for each
partition, and which replica is the leader
Client usually cache this information and priodically refresh this information.
(controlled by [Link] configuration parameter)
if a new broker was added or some replicas were moved to a new broker. A client
will receives the error Not a Leader and then it will refresh the metadata before try
sending the request again
Partition Allocation
Suppose you have 6 brokers and you decide to create a topic with 10 partitions
and a replication factor of 3. Kafka now has 30 partition replicas to allocate to 6
brokers.
If the brokers have rack information, then assign the replicas for each partition
to different racks if possible
File Management
Kafka administrator configures a retention period for each topic:
partition are splitted into segments. Each segment contains either 1GB or a week
of data segment we are currently writing to is called active segment. Active
segment is never deleted
References