Kafka Interview Questions
Kafka Interview Questions
Follow me Here:
LinkedIn:
https://www.linkedin.com/in/ajay026/
https://lnkd.in/geknpM4i
COMPLETE KAFKA
DATA ENGINEER INTERVIEW
QUESTIONS & ANSWERS
Answer:
Think of Kafka like a logistics hub. Producers are trucks delivering parcels (data), Kafka topics
are like storage shelves where those parcels are temporarily stored, and consumers are staff
who pick up the parcels for delivery elsewhere. It ensures smooth, real-time flow of data
from source to destination reliably.
2. You have 5 million events coming in every minute. How would you handle topic design?
Answer:
I'd design the topic with enough partitions to match or slightly exceed the number of
consumers for parallelism. For high-throughput, I’d consider 50+ partitions, ensure keys are
well distributed to avoid skew, and enable compression like Snappy to optimize throughput.
Also, monitor consumer lag constantly.
Answer:
While more partitions improve parallelism and throughput, they increase overhead: more
open file handles, longer leader election, and slower controller operations. It also makes
topic-level operations more resource-intensive. I’ve seen degraded performance during
broker restarts in such setups.
Answer:
If auto.create.topics.enable is true, Kafka creates the topic with default partition/replication
settings — which might be misaligned with SLAs. I prefer keeping that disabled and using
infrastructure-as-code to define topics explicitly.
5. How do consumer groups handle rebalancing and what issues can it cause?
Answer:
Kafka doesn’t automatically rebalance based on speed. The slow consumer may lag behind
while others finish quickly. We handle this by scaling out with more instances or increasing
max.poll.interval.ms for slower consumers to avoid rebalancing.
Answer:
Yes, with idempotent producers, transactions, and Kafka Streams. I’ve enabled
enable.idempotence=true and used transactions to avoid double writes. But it’s only truly
exactly-once when the downstream sink supports idempotency too.
Answer:
ISR is the set of replicas that are up-to-date with the leader. Only replicas in ISR are eligible
for leadership. This ensures durability and consistency. If ISR shrinks, durability is
compromised.
Answer:
Keys control partitioning. If all events share a key (say user ID), they go to one partition,
causing skew. I use hashing strategies or composite keys to spread load evenly.
10. You get duplicated messages from a producer. How do you handle it in consumers?
Answer:
At-least-once delivery can cause duplication. I include a unique event ID and deduplicate
downstream (via upserts or state stores in Kafka Streams). Alternatively, use idempotent
writes to sinks.
12. What is log compaction and where have you used it?
Answer:
It keeps only the latest message per key. We used this in a user-profile topic where only the
most recent profile data was needed — not the entire history.
13. Explain the difference between Kafka and traditional queuing systems in a
performance-critical system.
Answer:
Traditional queues (like RabbitMQ) push messages and delete them once consumed. Kafka
stores data for a time period, allowing multiple consumers to read independently. This
makes it better for event sourcing and backtracking.
14. How would you implement retry and DLQ logic in Kafka?
Answer:
Use retry topics with incremental delays (e.g. topic-retry-1, retry-2). After max retries, send
to DLQ. I build this using a middleware wrapper around consumers or Kafka Streams.
15. Scenario: Consumer reads data but dies before committing offset. What happens?
Answer:
If offset wasn’t committed, upon restart it re-reads the message — at-least-once semantics.
Hence, I always place offset commits after successful processing.
Answer:
Partitioning strategy is the base — consumers in a group scale up to the number of
partitions. I use Kubernetes HPA to auto-scale consumers and monitor lag via Prometheus.
17. In which scenario would you prefer compacted over delete-based retention?
18. Can you give a small Python snippet to consume from a topic with manual offset
management?
Answer:
consumer = KafkaConsumer(
'orders',
bootstrap_servers='localhost:9092',
enable_auto_commit=False,
group_id='order-processors'
process_order(message.value)
consumer.commit()
19. How would you handle ordering guarantees for multi-partition topics?
Answer:
Ordering is only guaranteed within a partition. So, I use a consistent key (e.g. order_id) to
ensure all related messages go to the same partition.
20. You’re receiving late-arriving messages in out-of-order fashion. How do you process
them correctly?
Answer:
Kafka writes messages to disk immediately and replicates them across brokers (based on
replication factor). A message is considered committed when it's replicated to all in-sync
replicas (ISR). If a leader broker crashes, one ISR becomes the new leader — ensuring no
data loss (assuming unclean leader election is disabled).
2. What are the trade-offs between increasing replication factor from 2 to 3 in a Kafka
cluster?
Answer:
Pros: Higher fault tolerance — the system can survive 2 broker failures.
3. Explain Kafka's high availability mechanism with ISR and Leader Election.
Answer:
Kafka maintains a list of ISR for each partition. The controller (a special broker) triggers
leader election when the leader fails. Only brokers in ISR can become the leader, ensuring
no data loss. We've tuned replica.lag.time.max.ms to control how quickly out-of-sync
replicas are kicked out from ISR.
4. What happens during broker failure and how does Kafka recover?
Answer:
The controller detects the failure via ZooKeeper (or KRaft in newer versions) and reassigns
partition leadership. Kafka doesn’t re-replicate immediately — it waits for broker recovery.
I’ve added alerts on partition under-replicated metrics to track such situations.
Answer:
• Confirm retention hasn’t expired messages We debugged a case where leader was
stuck and unclean.leader.election.enable=false prevented failover — caused silence
until manual intervention.
6. How does Kafka's batching mechanism work internally in the producer and what
parameters affect it?
Answer:
Producer batches messages using batch.size and linger.ms. Messages are grouped into a
batch per partition and sent as a single request. Larger batch size improves throughput but
increases latency. We found linger.ms=20 optimal for balancing latency vs batching.
7. What role does the Kafka Controller play and how does leader election work at the
cluster level?
Answer:
The controller handles metadata management — partition leadership, topic changes, broker
registrations. Kafka elects one controller (via ZooKeeper/KRaft) and replicates metadata
updates to all brokers. We've hit issues during controller failovers causing brief metadata
inconsistencies — solved by controller prioritization.
8. How does Kafka handle backpressure and what architectural decisions support it?
Answer:
Kafka itself is resilient — it keeps writing to disk. Backpressure is mainly handled at the
consumer side. We tune max.poll.records, fetch.min.bytes, and consumer lag thresholds to
throttle or scale out. On the producer side, we control flow using buffer.memory and
max.in.flight.requests.
Answer:
Both control ISR management.
10. What internal thread pools does Kafka broker use and how can thread tuning affect
throughput?
Answer:
Kafka brokers use:
Increasing these improves concurrency but hits CPU and I/O bottlenecks fast. For high-
throughput clusters, we tune num.io.threads and num.network.threads based on partition
count and I/O load.
11. How does Kafka maintain offset consistency across consumer rebalances?
Answer:
Offsets are stored in the __consumer_offsets topic. On rebalance, new consumers fetch the
committed offset and resume. I’ve seen corrupted offset commits when auto-commit is
misused — best practice is manual commit after successful processing.
12. How does Kafka achieve horizontal scalability compared to traditional messaging
systems?
Answer:
Partitioning is the core — Kafka topics can be split into hundreds of partitions, each handled
independently by brokers and consumers. Unlike traditional queues, Kafka lets you scale
reads and writes linearly with partition count.
14. Why is Kafka’s write-path “zero-copy” and how does it improve performance?
Answer:
Kafka uses sendfile syscall to transfer data from page cache to network socket without
copying to user space — reducing CPU usage and speeding up throughput. This architecture
makes Kafka great for high-volume use cases.
15. In a Kafka cluster of 5 brokers, why do we often have more partitions than brokers?
Answer:
To enable parallelism across consumers and producers. Partition count defines the
concurrency. We often create 3–5x partitions per broker to allow better load balancing and
future scaling.
16. Scenario: Kafka topic shows 200 partitions, but only 20 consumers in a group. What’s
the effect?
Answer:
Only 20 partitions will be actively consumed — 180 remain idle. You’re under-utilizing your
cluster. Either add more consumers or use multiple groups if use-case permits.
17. How does Kafka ensure order within partitions but not across?
Answer:
Messages with the same key go to the same partition. Kafka writes/reads to partitions in
sequence. Since partitions are processed independently, there's no global ordering across
them.
18. How does Kafka maintain metadata and how can it become a bottleneck?
Answer:
Kafka stores topic, partition, broker, and offset metadata in memory. For large clusters
(thousands of topics/partitions), metadata size explodes, causing GC pressure and slow
rebalances. We’ve split topics across multiple clusters to mitigate.
Answer:
Log segment files store the actual messages, while index files help locate offsets quickly.
Kafka rolls segments based on size or time. I’ve tuned log.segment.bytes in workloads with
high churn for better compaction and cleanup.
20. Coding: Write a script to list partitions with under-replicated replicas using Kafka
Admin API in Python.
admin = KafkaAdminClient(bootstrap_servers='localhost:9092')
partitions = admin.describe_cluster()
# kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
# Pseudo-code version:
Answer:
For latency-sensitive or failure-critical paths, I use synchronous sends
Answer:
acks=all gives best durability (only marked successful when all ISR replicas acknowledge),
but increases latency. I’ve used it with idempotent producers to ensure exactly-once
semantics. In less critical logging pipelines, I relax it to acks=1.
Answer:
That means the message size exceeds max.request.size on the producer or
message.max.bytes on the broker. I inspect both values and also compress messages using
compression.type=snappy. If it's one-off large messages, we chunk or route to a separate
topic.
4. Why might a producer retry cause message reordering even with idempotence
enabled?
Answer:
If max.in.flight.requests.per.connection > 1, retries of earlier batches can complete after
later ones. I set this to 1 for strict ordering or leave it at 5+ for higher throughput when
ordering isn’t critical.
Answer:
With true, Kafka commits offset periodically (e.g. every 5 seconds). If processing fails,
messages are lost. I always set false and commit after successful processing, ensuring at-
least-once delivery. This gives full control and safer recovery.
print("Shutting down...")
consumer.commit()
consumer.close()
sys.exit(0)
signal.signal(signal.SIGINT, shutdown)
process(message.value)
7. Describe a real issue you faced with high consumer lag and how you fixed it.
Answer:
We had a spike in consumer lag due to a downstream API outage. Messages piled up, and
rebalance caused reset to older offsets. I disabled auto commit, isolated the slow partition,
and processed it in a separate retry queue. Also added backpressure to throttle producers.
Answer:
It allows you to modify or filter messages before they're published. I’ve used it for
appending trace IDs, enriching metadata, or blocking PII data in sensitive pipelines. It’s
cleaner than embedding this logic in the main code.
Answer:
Answer:
Kafka assigns a partition to one thread at a time. If you use multiple threads per partition,
you lose ordering. I use one thread per partition model and use a thread-safe queue to
batch process or offload.
Answer:
Answer:
Useful when keys have uneven distribution (like hashing user ID vs length-based
segmentation).
14. Scenario: You notice that your consumer is reprocessing the same message multiple
times. Why?
Answer:
Likely causes:
Answer:
Use Thread.sleep(), token buckets, or async queues to delay processing. I’ve also used
max.poll.records=1 and control the poll loop frequency based on TPS budget or downstream
system limits.
16. How do you ensure safe offset commits in batch processing pipelines?
Answer:
process_batch(batch)
last_offset = batch[-1].offset
consumer.commit(offset=last_offset + 1)
Answer:
I evaluate expected throughput (in MB/s or events/sec), consumer parallelism, and fault
tolerance. A good starting point is:
#Partitions = Max consumer threads needed
For example, for a user activity stream processing ~50K events/sec and needing 5 parallel
consumers, I’d go with 10–15 partitions for headroom. Also consider broker limits — too
many partitions impact memory and controller performance.
2. Scenario: You designed a topic with 50 partitions, but only 5 consumers are reading.
What’s the impact?
Answer:
Only 5 partitions are actively consumed, the rest stay idle — underutilizing resources. Plus,
messages in idle partitions pile up and increase storage. In one project, I solved this by
horizontally scaling the consumer group or switching to multiple consumer groups for
different pipelines.
Answer:
I use good partition key design — usually a hash of composite keys like user_id + region or a
round-robin strategy if ordering isn't required. I’ve faced a case where all records with null
keys were routed to partition 0 — we fixed it by ensuring every event had a non-null
sharding key.
4. Explain how log compaction influences topic design. When would you prefer it?
Answer:
Log compaction retains the latest record per key — perfect for "current state" data like user
preferences, configs, or sensor readings. I used it in a Kafka topic feeding a lookup table in
Redis. It drastically reduced storage while ensuring we only kept the most recent value.
Answer:
Avoid creating one topic per tenant — it explodes the topic count and kills performance.
Instead, I use a shared topic with a tenant_id inside the message and optionally use Kafka
Streams to route to tenant-specific downstream logic.
6. Scenario: You need to retain 3 months of transaction data per customer for audit.
What’s your topic design?
Answer:
• Topic: customer-transactions
7. Why should you avoid too many small topics or partitions in Kafka?
Answer:
Each partition uses file handles, heap space, and increases metadata load on the controller.
We hit GC pressure and controller instability at ~50K partitions. Kafka is fast, but not a key-
value store substitute — use DBs for fine-grained storage.
Answer:
I follow a naming convention like:
<env>.<domain>.<service>.<entity>.<event_type>
Example: prod.payments.gateway.transaction.created
This helps enforce governance, ACLs, and observability. I also integrate topic creation with
Terraform and GitOps for version control.
Answer:
producer.flush()
This ensures manual routing to partition 2. Useful when we want dedicated partition per
tenant or category.
10. What are the pros/cons of splitting events into multiple topics vs using a single topic
with event-type field?
Answer:
Multiple topics
Better isolation, ACL control
Harder to scale/manage
Single topic + event_type field
Easier to scale, shared consumption
Answer:
Kafka Streams is client-side, library-based, and doesn’t require a separate cluster — unlike
Spark/Flink which need distributed infra. It's great for lightweight, microservice-style event
2. Scenario: You’re building a deduplication pipeline using Kafka Streams. How would you
design it?
Answer:
I use a state store to cache seen keys, and filter out duplicates:
Where the transformer maintains a local store with a retention window. This approach
works well when window size is small and memory is manageable.
3. How does Kafka Streams manage state internally, and how do you ensure it’s fault-
tolerant?
Answer:
Kafka Streams uses RocksDB-backed local state stores that are backed up to changelog
topics. During failover, it restores state from these topics. I’ve had better durability by
isolating the changelog topic in a different Kafka cluster to reduce interference.
4. What are repartition topics and when are they triggered in Kafka Streams?
Answer:
Repartition topics are internally created topics when a key-changing operation (e.g. map,
flatMap, groupBy) requires reshuffling data across partitions. If not pre-partitioned
correctly, Kafka Streams inserts a repartition step automatically. I always pre-key the topic
to avoid unnecessary repartition overhead.
Answer:
Windowing buckets events by event-time. I’ve used tumbling windows for 5-minute
aggregations of user clicks and sliding windows for real-time fraud detection (e.g., 10-
second sliding window to track login attempts). The API makes it declarative:
.groupByKey()
.count()
6. What’s the difference between KTable and GlobalKTable? When would you prefer
each?
Answer:
• KTable: Partitioned and co-located — efficient if both streams are keyed similarly.
• GlobalKTable: Fully replicated to all instances — useful for small reference data
joins.
In one project, we used GlobalKTable to join transactions with merchant metadata
(~5MB JSON) for every Kafka instance.
7. Scenario: You’re getting high RocksDB disk I/O in Kafka Streams app. How do you
handle it?
Answer:
RocksDB can be I/O intensive under pressure. I:
• Increased cache.max.bytes.buffering
• Enabled state.cleanup.delay.ms
8. Can you perform joins across topics in Kafka Streams? How do you handle out-of-order
data?
Answer:
Yes, using join, leftJoin, or outerJoin across KStream-KStream, KStream-KTable, or KTable-
KTable.
To handle late events, we define a grace period:
.windowedBy(TimeWindows.of(Duration.ofMinutes(10)).grace(Duration.ofMinutes(2)))
Late events within grace are accepted; beyond that, they’re dropped.
.groupByKey()
.aggregate(
initializer,
aggregator,
I use it when I want to query the store later via the Interactive Queries API.
10. How do you version schema in a Kafka Streams app using Avro or Protobuf?
Answer:
We use Confluent Schema Registry and enable schema evolution via compatibility modes. I
enforce schema validation in the Streams app to fail early on incompatible schemas, and
document schema changes clearly before rollout to avoid runtime errors.
11. Scenario: You deployed a Kafka Streams app, and one instance is stuck processing a
large key group. What's your fix?
Answer:
That’s a key skew problem. I resolved it by:
12. How do you use ksqlDB to join a stream and a table in real-time?
Answer:
FROM purchases p
JOIN user_info u
ON p.user_id = u.user_id;
This powers real-time dashboards. Behind the scenes, ksqlDB uses stream-table join logic
similar to Kafka Streams.
13. What’s the internal flow of a Kafka Streams application when it starts up?
Answer:
Answer:
Enable processing guarantees:
properties
processing.guarantee=exactly_once_v2
This uses Kafka transactions, stores offsets and state updates atomically. I also ensure
downstream systems are idempotent or transactional (like Kafka + JDBC Sink with upserts).
15. Coding: Write a simple Kafka Streams pipeline to count hashtags per minute.
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(1)))
.count()
.toStream()
.to("hashtag-counts",
Produced.with(WindowedSerdes.timeWindowedSerdeFrom(String.class), Serdes.Long()));
This was part of a real Twitter sentiment pipeline I built — works well for time-bucketed
aggregations.
Answer:
Without a schema registry, producers and consumers can silently break due to incompatible
data formats. Schema Registry enforces contract-first data validation, version control, and
compatibility checks. I’ve used it to avoid failures where a downstream consumer crashed
due to a missing field — the schema registry flagged it early during dev.
2. Scenario: You added a new field to your Avro schema. Producer works, but consumer
fails. What went wrong?
Answer:
Likely, the compatibility mode wasn't respected. If the consumer is expecting the old
schema and the field is not marked with a default, deserialization fails. In our setup, we
3. How do different compatibility types work in Schema Registry? When would you use
each?
Answer:
• BACKWARD: New schema can read old data (used when consumers lag behind)
For long-lived pipelines, I always stick with BACKWARD. For short-term ingestion, NONE
might be OK during exploration.
4. What happens under the hood when a schema is registered for a Kafka topic?
Answer:
When a producer with schema-aware serializer sends data:
5. Scenario: Your team uses Protobuf and another uses Avro. Can both work with Schema
Registry?
Answer:
Yes. Schema Registry supports Avro, Protobuf, and JSON Schema. In a multi-team
environment, we maintained compatibility by:
6. How do you handle schema evolution when the downstream system is a JDBC sink
connector?
Answer:
We ensure:
• Column reordering doesn’t affect schema (since JDBC sink maps by name, not order)
7. How do you register and fetch schemas programmatically via REST? Show an example.
Answer:
# Register
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{"schema":
"{\"type\":\"record\",\"name\":\"User\",\"fields\":[{\"name\":\"id\",\"type\":\"string\"}]}"}'
# Fetch latest
curl http://localhost:8081/subjects/user-value/versions/latest
I’ve wrapped this into a GitOps CI job to validate every schema push via REST before
deployment.
8. What is subject naming strategy and how can it impact schema reuse across topics?
Answer:
It determines how schemas are registered:
We use RecordNameStrategy when we want to share a schema across multiple topics. This
avoids duplication and keeps compatibility checks centralized.
Answer:
Yes — for services producing/consuming Kafka messages, schema registry ensures they’re
speaking the same language. I’ve used it in REST-to-Kafka microservices:
• Pushes to Kafka
• Consumer decodes using Schema Registry This enabled loosely coupled services with
strong contracts.
10. How do you test schema compatibility in CI/CD pipelines before deploying to Kafka?
Answer:
We use Confluent’s Maven plugin or REST API in Jenkins to:
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
Kafka Connect
Answer:
Kafka Connect is a framework for scalable, fault-tolerant data integration. Instead of
writing custom ingestion logic, you just configure connectors. It handles offset tracking,
retries, scaling, and parallelism. In one of our use cases, we replaced flaky ingestion scripts
with JDBC + S3 sink connectors and reduced codebase complexity by 70%.
2. Scenario: You notice duplicate records in your sink system. Kafka topic has no
duplicates. What’s going wrong?
Answer:
This usually happens when:
• Offset commits happen before actual DB insert, and failure in between causes
replay We fixed it by:
Answer:
Kafka Connect stores offsets in an internal topic (connect-offsets). Every source connector
maintains its offset per task. These offsets are:
• Committed periodically
4. You want to ingest 10 MySQL tables using Kafka Connect. What are your options?
Answer:
Using Single Message Transforms (SMTs). For anything beyond that, I’ve written custom
SMTs in Java. For example, we needed to:
...
6. Scenario: Your S3 Sink Connector is dumping one file per record. What’s the issue?
Answer:
This is usually caused by misconfigured flush size / rotation interval. We fixed it by tuning:
Also, batch-aware compression like gzip.snappy helped reduce S3 storage cost by 60%.
8. What are distributed vs standalone modes in Kafka Connect? When do you use each?
Answer:
• Standalone: Single node, local file storage, best for dev or quick jobs
9. Coding: How would you configure a Kafka JDBC Sink Connector to write to a Postgres
table using the key as primary key?
Answer:
"name": "pg-sink",
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"topics": "user-activity",
"connection.url": "jdbc:postgresql://host:5432/db",
"insert.mode": "upsert",
"pk.mode": "record_key",
"pk.fields": "user_id",
"auto.create": true,
This setup ensures idempotent writes and automatic schema evolution if Avro is used.
10. What’s the lifecycle of a Kafka Connect Source task? How is fault recovery handled?
Answer:
4. Applies SMTs
6. Commits offsets
If a task crashes, it is rebalanced to another worker. We’ve added
errors.log.enable=true and errors.tolerance=all to isolate bad records without
crashing the connector.
Answer:
I prioritize:
2. Scenario: Your Kafka consumer lag alert fired at 2 AM. How do you troubleshoot it?
3. How do you monitor Kafka Connect? What metrics do you care about?
Answer:
I track:
Answer:
Answer:
6. Scenario: You see spikes in UnderReplicatedPartitions every few hours. What’s your
next move?
Answer:
Possible causes:
• Network blips
7. What log patterns or exceptions are red flags for Kafka producers?
Answer:
Answer:
Answer:
10. Coding: How do you expose Kafka consumer lag metrics using Prometheus in a Spring
Boot app?
Answer (Java/Spring):
@Bean
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
@Bean
This exposes native Kafka client metrics via Actuator, scraped by Prometheus.
Answer:
SSL encryption in Kafka ensures data in transit is protected. We enable it on the broker side
by setting:
listeners=SSL://:9093
security.inter.broker.protocol=SSL
ssl.keystore.location=/path/to/keystore.jks
ssl.keystore.password=keystore-password
ssl.key.password=key-password
ssl.truststore.location=/path/to/truststore.jks
ssl.truststore.password=truststore-password
Benefits:
In production, we’ve used mutual TLS to ensure that both brokers and clients authenticate
each other.
Answer:
• Cipher mismatch: Ensure both producer and broker are using compatible SSL/TLS
versions and ciphers.
Answer:
To implement SASL/PLAIN for Kafka, set the following in the server.properties for brokers:
security.inter.broker.protocol=SASL_PLAINTEXT
sasl.enabled.mechanisms=PLAIN
sasl.mechanism.inter.broker.protocol=PLAIN
listeners=SASL_PLAINTEXT://:9093
security.protocol=SASL_PLAINTEXT
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required
username="your-username" password="your-password";
4. Scenario: You’re asked to enable Kafka ACLs for controlling access. How do you
implement role-based access control?
Answer:
Kafka uses ACLs for controlling access to topics, consumer groups, and more. First, enable
ACLs in server.properties:
authorizer.class.name=kafka.security.auth.SimpleAclAuthorizer
Then, create ACLs for different roles (e.g., admin, producer, consumer):
5. How do you monitor security-related events in Kafka (e.g., failed logins, ACL violations)?
Answer:
Kafka logs events related to authentication and authorization under the kafka.server
category. To monitor security events:
2. Use Filebeat or Fluentd to send logs to ELK stack for real-time monitoring.
We’ve set up alerts using Prometheus + Grafana to notify us of failed logins or unauthorized
access attempts based on error logs.
Answer:
• Missing ACL: The producer doesn’t have write access to the topic. I’d check the ACLs
with kafka-acls to verify the permissions.
In our case, the issue was due to missing Write permission for a producer on the data-topic.
After adding the necessary ACL, the issue was resolved.
Answer:
To enable Kerberos for Kafka, you need to:
security.inter.broker.protocol=SASL_PLAINTEXT
2. Set up JAAS configuration for both producers and consumers to authenticate using
Kerberos tickets.
3. Ensure Kerberos credentials are available for both Kafka brokers and clients by using
a keytab file.
In one of our projects, we migrated from SASL/PLAIN to Kerberos for increased security
compliance with multi-tenant systems.
8. How do you configure Kafka for cross-cluster authentication and data replication
securely?
Answer:
Kafka supports SSL for securing cross-cluster replication:
4. Use ACLs to ensure that only authorized clusters can push/pull data.
We implemented this for a multi-region setup, where replication was secured using SSL and
client certs, and we restricted access using ACLs.
9. Scenario: Your Kafka producers are facing issues with SSLProtocolException. How do
you debug this error?
Answer:
1. Check broker SSL config: Ensure the broker has the correct SSL/TLS version and
cipher suites configured (ssl.protocol=TLSv1.2).
3. Producer SSL Config: Ensure the producer has the correct ssl.truststore.location and
ssl.keystore.location.
10. How do you secure the communication between Kafka clients (producers/consumers)
and brokers in a cloud environment?
Answer:
In a cloud environment, I use:
• SASL for authentication (e.g., SASL/PLAIN or Kerberos depending on the use case).
• ACLs to control which users and services can access which Kafka topics or consumer
groups.
• VPC Peering / Private Link: Ensure Kafka brokers are only accessible within a private
network for internal communication.
In one case, we had Kafka brokers deployed in an AWS VPC, and all client access was
secured via AWS PrivateLink — avoiding public internet exposure.
Answer:
acks=all
enable.idempotence=true
batch.size=65536
This reduced per-record latency in our payments pipeline by 40% without risking duplicates
or ordering violations.
2. Scenario: Kafka consumers are slow despite zero lag. What would you check?
Answer:
Answer:
More partitions = more parallelism = better consumer scalability and write throughput.
BUT — too many partitions increases:
4. How do you tune Kafka for large message sizes (e.g., 10 MB payloads)?
Answer:
• Increase:
message.max.bytes=10485760
replica.fetch.max.bytes=10485760
fetch.message.max.bytes=10485760
Answer:
Answer:
Answer:
• BytesInPerSec, BytesOutPerSec
• UnderReplicatedPartitions
Answer:
It controls how long the producer waits to batch messages before sending.
Answer:
num.network.threads=3
num.io.threads=8
Increase based on broker CPU & disk I/O. We tuned to 16 IO threads during a log replay
spike for smoother throughput.
Answer:
Answer:
12. What are some overlooked settings that can boost Kafka performance?
Answer:
13. How do you reduce consumer lag in near real-time analytics pipelines?
Answer:
• Increase max.poll.records
14. Coding: How to configure a high-throughput producer in Python with optimal settings?
Answer:
producer = KafkaProducer(
bootstrap_servers='localhost:9092',
batch_size=65536,
compression_type='snappy',
acks='all',
buffer_memory=67108864, # 64MB
retries=5
Answer:
Answer:
Controls how much data a follower can fetch per request.
Higher = fewer requests, better replication throughput.
We increased this to 1MB in one use case where replication was lagging on large messages.
17. Scenario: Producer throughput is fine but consumers are randomly stuck. What could
cause it?
Answer:
Possible causes:
• Consumer GC issues
18. How does Kafka handle disk I/O pressure and what can you do about it?
Answer:
log.flush.interval.messages
log.segment.bytes
log.flush.scheduler.interval.ms
We added SSDs to high-traffic brokers and moved log directories using RAID 10 for IOPS
optimization.
Answer:
Answer:
• Run synthetic load with production-like batch size, key distribution, and volume
Answer:
Kafka achieves exactly-once semantics using the idempotent producer and transactional
APIs. The idempotent producer ensures that resending the same message doesn't result in
duplicates, while transactional APIs allow grouping multiple operations into a single atomic
transaction. This combination ensures that messages are processed exactly once, even in
the event of failures.
2. Scenario: You need to process messages in the order they arrive for a specific key. How
would you design your Kafka consumer setup?
Answer:
• Partitioning Strategy: Ensure that all messages with the same key are sent to the
same partition. Kafka maintains order within a partition.
• Single Consumer per Partition: Assign one consumer to process messages from the
partition to maintain order.
3. How do you handle schema evolution in Kafka, and what role does the Schema Registry
play?
Answer:
4. Write a Java code snippet for a Kafka producer that includes custom partitioning logic.
Answer:
import org.apache.kafka.clients.producer.*;
import java.util.Properties;
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("partitioner.class", "com.example.CustomPartitioner");
producer.send(record);
producer.close();
5. What are Kafka Streams' state stores, and how do they enhance stream processing?
Answer:
State stores in Kafka Streams are used to maintain stateful information across stream
processing operations. They allow:
• Fault Tolerance: State stores are backed by changelogs in Kafka topics, ensuring
state recovery in case of failures.
State stores enable complex processing patterns like sessionization and real-time analytics.
6. Scenario: Your Kafka consumers are experiencing frequent rebalances. What could be
causing this, and how do you mitigate it?
Answer:
• Monitor and Scale: Use monitoring tools to detect issues and scale consumers
appropriately.
Answer:
log.segment.bytes defines the size of a single log segment file in bytes. Adjusting this
parameter impacts:
• Disk I/O: Smaller segments lead to more frequent file creation and potential I/O
overhead, while larger segments reduce this but may delay log compaction and
deletion.
• Log Retention: Larger segments can delay the deletion of old data if retention is
based on segment files.
• Recovery Time: Smaller segments can speed up recovery time after a failure as there
are fewer messages to replay.
8. How does Kafka's exactly-once processing differ from at-least-once and at-most-once
semantics?
Answer:
• At-Least-Once: Messages are delivered one or more times, ensuring no loss but
potential duplicates.
• Exactly-Once: Messages are delivered once and only once, ensuring no loss and no
duplicates. Kafka achieves this using idempotent producers and transactional APIs.
9. Scenario: You need to integrate Kafka with a legacy system that only supports RESTful
APIs. How would you achieve this?
Answer:
• Kafka REST Proxy: Use Confluent's Kafka REST Proxy to produce and consume
messages over HTTP.
• Custom Middleware: Develop a middleware service that interfaces with Kafka and
exposes RESTful endpoints for the legacy system.
Answer:
Setting acks=0 means the producer does not wait for any acknowledgment from the broker
before considering the message sent. This results in:
• High Throughput: Reduced latency as the producer doesn't wait for responses.
• Risk of Data Loss: Messages may be lost if the broker fails before writing the
message to disk.
This setting is suitable for scenarios where performance is prioritized over durability.
11. How do you monitor Kafka's consumer lag, and why is it important?
Answer:
Consumer lag is the difference between the latest offset and the consumer's committed
offset. Monitoring lag is crucial because:
• Affects Real-Time Processing: Ensuring low lag is essential for real-time applications.
Tools like Confluent Control Center, Burrow, and Kafka Manager can monitor consumer
lag.
13. How does Kafka handle backpressure when consumers are slower than producers?
Answer:
• Flow Control: Consumers can control the rate of data consumption by adjusting
parameters like fetch.min.bytes and fetch.max.wait.ms.
• Scaling Consumers: Adding more consumer instances to a consumer group can help
distribute the load and reduce lag.
In practice, we implemented dynamic consumer scaling based on lag metrics, which helped
in maintaining optimal processing rates during traffic spikes.
14. Scenario: A Kafka topic has 10 partitions, but your consumer group has only 5
consumers. How are the partitions assigned?
Answer:
Kafka assigns partitions to consumers in a consumer group such that each consumer can
handle multiple partitions:
• Rebalancing: If consumers join or leave the group, Kafka will rebalance the partitions
among the available consumers.
15. How do you implement a dead-letter queue (DLQ) in Kafka to handle message
processing failures?
Answer:
• Error Handling Logic: In the consumer application, catch exceptions during message
processing and produce the failed messages to the DLQ topic.
• Monitoring: Set up monitoring and alerts on the DLQ to promptly address issues
causing message failures.
In our system, implementing a DLQ allowed us to isolate and analyze problematic messages
without affecting the main processing flow.
16. Explain the role of ZooKeeper in Kafka and the impact of its removal in newer Kafka
versions.
Answer:
With the introduction of KRaft (Kafka Raft), Kafka aims to remove the dependency on
ZooKeeper by integrating metadata management directly into Kafka brokers, simplifying the
architecture and reducing operational complexity.GitHub+1TestGorilla+1
Answer:
• Create a New Consumer Group: Start a new consumer group with a unique group ID
to read from the beginning of the topic without impacting existing consumers.
• Seek to Beginning: Use the seekToBeginning method to reset the offset to the start
of the topic.
18. What are the trade-offs between using Kafka as a message queue versus a publish-
subscribe system?
Answer:
Trade-offs:
In our architecture, we used the Pub-Sub model for real-time analytics and the queue model
for task processing, balancing scalability and simplicity.
Answer:
• Schema Registry Integration: Use Confluent Schema Registry to manage and version
schemas.
In our Kafka Streams application, integrating Schema Registry allowed seamless schema
evolution without downtime.
20. Scenario: Your Kafka Streams application is experiencing high latency. What steps
would you take to diagnose and resolve the issue?
Answer:
• Monitor Metrics: Track Kafka Streams metrics like process-rate, punctuate-rate, and
commit-latency.
• Optimize State Stores: Ensure RocksDB state stores are configured optimally,
considering factors like cache size and compaction settings.
In one instance, optimizing RocksDB configurations reduced our application latency by 30%.
21. Explain the concept of log compaction in Kafka and its use cases.
Answer:
Log compaction in Kafka is a mechanism that ensures the retention of at least the latest
value for each key within a topic, even if older records exceed the configured retention time
or size. This process selectively removes obsolete records, retaining the most recent update
for each key. Stack Overflow+1Medium+1
Use Cases:
• Change Data Capture (CDC): Capturing changes from databases where only the
latest state of a record is relevant. kai-waehner.de+1Stack Overflow+1
22. Scenario: You have a compacted topic, but consumers are still seeing multiple records
for the same key. Why might this occur?
Answer:
23. How does Kafka ensure data integrity during log compaction?
Answer:
• Offset Preservation: Offsets are immutable; even after compaction, the offset for a
given record remains the same, ensuring consumers can maintain their position
accurately. Confluent Documentation
• Atomic Segment Replacement: Compacted segments are created as new files and
atomically swapped with the old segments, ensuring that consumers always see a
consistent log.
Answer:
25. Scenario: After enabling log compaction, you notice increased disk I/O. What could be
the cause and how do you mitigate it?
Answer:
Increased disk I/O can result from the log cleaner threads actively compacting segments. To
mitigate:Ted Naleid’s Notes
• Monitor Resource Usage: Use monitoring tools to observe disk I/O patterns and
adjust configurations accordingly.
Answer:
Log compaction does not affect the sequential nature of offsets. Even if records are
removed during compaction, their offsets remain valid. Consumers can continue from their
last committed offset without disruption, though they may encounter gaps where records
have been compacted away. Confluent Documentation
27. Can log compaction be used alongside time-based retention policies? If so, how?
Answer:
Yes, Kafka allows combining log compaction with time-based retention by setting the
cleanup.policy to compact,delete. This configuration enables Kafka to retain the latest
record for each key (compaction) while also deleting records based on time or size
constraints.
28. What is a tombstone record in Kafka, and what role does it play in log compaction?
A tombstone record is a message with a key and a null value. In log compaction, writing a
tombstone for a key indicates that the record should be deleted. During compaction, both
the tombstone and any prior records with that key are removed, effectively deleting the
record from the log.
29. Scenario: You need to ensure that certain records are retained longer before
compaction. How can you achieve this?
Answer:
30. How does Kafka's log compaction process handle delete operations, and what
configurations control the retention of tombstone records?
Answer:
Properly configuring these parameters ensures that delete operations are handled correctly,
and tombstone records are retained long enough for consumers to acknowledge deletions
before they are permanently removed.
Answer:
In Kafka, both delete.retention.ms and log.retention.ms are configurations that manage the
retention of records, but they serve different purposes:Stack Overflow
• delete.retention.ms: This setting applies to topics with the compact cleanup policy.
It specifies the duration (in milliseconds) for which tombstone records (records with
null values indicating deletions) are retained before they are eligible for removal
during log compaction. The default value is 24 hours (86400000 milliseconds). This
ensures that consumers have a window to recognize and process delete markers
before they are purged. Apache Kafka
• log.retention.ms: This setting applies to topics with the delete cleanup policy. It
determines the maximum time (in milliseconds) that a log segment is retained
before it is discarded to free up space. Once this time elapses, the log segments are
deleted regardless of whether they have been consumed. The default value is 7 days
(604800000 milliseconds). Confluent Documentation
Understanding the distinction between these settings is crucial for configuring Kafka topics
to handle data retention and deletion appropriately based on the desired cleanup policy.
32. Scenario: You have a compacted topic with a high volume of updates and deletes. How
would you configure Kafka to ensure that delete markers (tombstones) are retained long
enough for all consumers to process them?
Answer:
To ensure that delete markers are retained long enough for all consumers to process them
in a high-throughput compacted topic, you should adjust the delete.retention.ms
configuration:
By carefully configuring these parameters and monitoring consumer behavior, you can
ensure that delete markers are retained appropriately, allowing all consumers adequate
time to process them.
33. How can you configure a Kafka topic to retain only the latest state of each key while
also ensuring that delete operations are propagated to consumers?
Answer:
To configure a Kafka topic to retain only the latest state of each key and ensure that delete
operations are propagated:
o To propagate delete operations, produce a record with the key set to the
entity to be deleted and the value set to null. This null value acts as a
tombstone marker, indicating that the record should be deleted.
By implementing these configurations, the Kafka topic will maintain only the latest state for
each key and properly handle delete operations, ensuring consumers receive accurate and
up-to-date information.
34. What are the potential risks of setting a very low delete.retention.ms value in a
compacted Kafka topic, and how can they be mitigated?
Answer:
Setting a very low delete.retention.ms value in a compacted Kafka topic can lead to several
risks:
o If tombstones are removed before all consumers have processed them, some
consumers may miss the delete markers and retain outdated or incorrect
state information.Ted Naleid’s Notes
• Data Inconsistency:
Mitigation Strategies:
FREE RESOURCES
1. What is Apache Kafka?
https://www.youtube.com/watch?v=RDp33e3ttTE
https://www.youtube.com/watch?v=hNDjd9I_VGA&pp=ygUFa2Fma2E%3D
2. Kafka Tutorial
https://www.youtube.com/watch?v=tU_37niRh4U&pp=ygUFa2Fma2E%3D
3. Kafka fundamentals
https://www.youtube.com/watch?v=B5j3uNBH8X4&pp=ygUFa2Fma2E%3D
4. Kafka Project
https://www.youtube.com/watch?v=7n72snj0rqs&pp=ygUFa2Fma2E%3D
5. Kafka Architecture
https://www.youtube.com/watch?v=HUAa1Yg9NlI&pp=ygUSa2Fma2EgYXJjaGl0ZWN0dX
Jl
6. Kafka Security
https://www.youtube.com/watch?v=-
HlRFh9GfWw&list=PLa7VYi0yPIH2t3_wc1tm1rHDO9tbtfX1T
7. Kafka Monitoring
https://www.youtube.com/watch?v=3fsquXqgb5w&pp=ygUQa2Fma2EgbW9uaXRvcmlu
Zw%3D%3D