Interview Questions and Answers on Apache Spark, Kafka, Airflow, and Druid
# Apache Spark
1. **What is Apache Spark, and how does it differ from Hadoop?**
- **Answer**: Apache Spark is an open-source, distributed computing system designed for fast
computation. Unlike Hadoop, which relies on MapReduce for batch processing, Spark offers
in-memory computation, making it faster for iterative tasks. Spark supports real-time stream
processing and interactive queries, unlike Hadoop's batch-only processing.
2. **Explain RDD in Spark. Why is it important?**
- **Answer**: RDD (Resilient Distributed Dataset) is the fundamental data structure in Spark,
representing an immutable, distributed collection of objects. RDDs are fault-tolerant, support
in-memory computations, and allow transformations and actions.
3. **What are Spark transformations and actions? Provide examples.**
- **Answer**: Transformations (e.g., `map`, `filter`) create a new RDD from an existing one and are
lazy-evaluated. Actions (e.g., `collect`, `count`) trigger computation and return results to the driver.
4. **How does Spark Streaming work?**
- **Answer**: Spark Streaming processes live data streams in small batches using DStreams
(Discretized Streams), enabling real-time analytics.
5. **What is the role of Spark's DAG Scheduler?**
- **Answer**: The Directed Acyclic Graph (DAG) Scheduler manages job execution in stages,
optimizing task scheduling and fault recovery.
# Apache Kafka
6. **What is Apache Kafka, and how is it used?**
- **Answer**: Apache Kafka is a distributed event-streaming platform for building real-time data
pipelines. It uses topics to organize data streams and supports high-throughput and fault tolerance.
7. **Explain the concept of Kafka topics and partitions.**
- **Answer**: Topics are categories for data streams. Each topic is divided into partitions to allow
parallelism and scalability.
8. **What is a Kafka consumer group?**
- **Answer**: A consumer group allows multiple consumers to coordinate and share the workload
of reading data from Kafka topics.
9. **How does Kafka ensure message durability?**
- **Answer**: Kafka uses distributed logs, replication, and configurable retention policies to
guarantee message durability.
10. **What are Kafka Connect and Kafka Streams?**
- **Answer**: Kafka Connect simplifies data integration between Kafka and external systems.
Kafka Streams is a library for processing data streams.
# Apache Airflow
11. **What is Apache Airflow, and why is it used?**
- **Answer**: Apache Airflow is a workflow orchestration tool used to automate and schedule
tasks. It ensures task dependencies are respected and provides monitoring capabilities.
12. **Explain Directed Acyclic Graph (DAG) in Airflow.**
- **Answer**: A DAG is a collection of tasks with dependencies that do not form cycles, ensuring
tasks execute in the correct order.
13. **What are Operators in Airflow?**
- **Answer**: Operators define individual tasks in a DAG, e.g., PythonOperator for Python scripts
or BashOperator for shell commands.
14. **How does Airflow handle task retries?**
- **Answer**: Airflow allows configuring retries with parameters like `retries`, `retry_delay`, and
`max_retry_delay`.
15. **What are XComs in Airflow?**
- **Answer**: XComs (Cross-Communications) enable data sharing between tasks within a DAG.
# Apache Druid
16. **What is Apache Druid?**
- **Answer**: Apache Druid is a real-time analytics database optimized for OLAP queries on event
data. It supports high concurrency and low-latency data ingestion.
17. **How does Druid store data?**
- **Answer**: Druid organizes data into segments, which are immutable and optimized for fast
access.
18. **What is the role of Druid's indexing service?**
- **Answer**: The indexing service ingests raw data and converts it into Druid's segment format for
storage and querying.
19. **Explain Druid's architecture.**
- **Answer**: Druid has a distributed architecture, including nodes like Historical (querying data),
MiddleManager (data ingestion), and Coordinator (management).
20. **What is a Druid query?**
- **Answer**: Druid queries are JSON-based and support aggregations, filters, and group-by
operations.