Sure!
Here's the full list of intermediate-level interview questions for Apache Kafka, Spark,
Hadoop, and ZooKeeper with simple one-line answers, all in one place for easy revision:
Apache Kafka
1. What is the role of a Kafka producer and consumer?
→ Producer sends messages; consumer reads messages.
2. How does Kafka ensure message durability and fault tolerance?
→ Messages are stored on disk and replicated across brokers.
3. What is the purpose of partitions in Kafka topics?
→ Partitions allow parallel processing and scaling.
4. How does Kafka handle message ordering?
→ Kafka maintains order within a single partition.
5. Explain the difference between at most once, at least once, and exactly once delivery
semantics in Kafka.
→ At most once: may be lost; at least once: may be duplicate; exactly once: delivered
once.
6. What is the significance of consumer groups in Kafka?
→ Consumer groups allow load sharing and fault tolerance.
7. How does Kafka handle backpressure and slow consumers?
→ Kafka keeps data in the log; slow consumers can catch up.
8. How does Kafka achieve high throughput?
→ Kafka uses batching, compression, and sequential I/O.
9. What is Kafka's ISR (In-Sync Replicas) list?
→ ISR contains brokers with up-to-date copies of data.
10. Explain how Kafka handles leader election for partitions.
→ Kafka controller assigns one replica as leader per partition.
Apache Spark
1. What is an RDD and how is it different from a DataFrame?
→ RDD is low-level and typed; DataFrame is faster and optimized.
2. Explain Spark’s execution model – DAG, stages, and tasks.
→ DAG is the job plan; stages are steps; tasks run the steps.
3. What is lazy evaluation in Spark?
→ Spark waits until an action to run the job.
4. What are transformations and actions in Spark?
→ Transformation changes data; action triggers execution.
5. What is a wide transformation vs. narrow transformation?
→ Wide needs shuffle across nodes; narrow doesn’t.
6. Explain the role of the Catalyst optimizer in Spark SQL.
→ Optimizes and rewrites queries for better performance.
7. How does Spark handle data partitioning?
→ It splits data into chunks for parallel processing.
8. How does Spark’s memory management work?
→ Spark manages execution and storage memory dynamically.
9. What is the difference between persist() and cache()?
→ cache() is memory-only; persist() can use disk too.
10. How would you optimize a slow-running Spark job?
→ Use caching, reduce shuffles, and balance partitions.
Apache Hadoop
1. What are the main components of Hadoop?
→ HDFS for storage and MapReduce for processing.
2. Explain the purpose of HDFS and how data is stored in blocks.
→ HDFS stores large files in blocks across machines.
3. How does Hadoop ensure data replication and fault tolerance?
→ It replicates blocks to multiple DataNodes.
4. Explain the difference between MapReduce and Spark.
→ Spark is faster and processes in memory; MapReduce is disk-based.
5. What happens if a DataNode fails in Hadoop?
→ Data is read from replicated blocks on other nodes.
6. How does NameNode handle metadata and failover?
→ NameNode stores file info; backup NameNode takes over if it fails.
7. How do you tune the number of reducers in a MapReduce job?
→ Set based on data size and cluster capacity.
Apache ZooKeeper
1. What is ZooKeeper and why is it used in distributed systems?
→ It coordinates tasks and stores configuration in distributed systems.
2. What are znodes in ZooKeeper?
→ Znodes are data nodes in ZooKeeper’s tree structure.
3. How does leader election work in ZooKeeper?
→ Nodes vote and the one with the lowest ID becomes leader.
4. What is the purpose of ephemeral and sequential znodes?
→ Ephemeral znodes delete on disconnect; sequential ones get unique IDs.
5. How does ZooKeeper provide consistency guarantees?
→ All nodes follow a quorum and apply the same order of changes.
6. Explain how ZooKeeper handles quorum and consensus.
→ Majority must agree for any update to happen.
7. What are watches in ZooKeeper and how are they used?
→ Watches notify clients of data changes in znodes.
Would you like this in PDF format for quick download before your interview?