0% found this document useful (0 votes)
136 views4 pages

Interview Questions Apache Spark Kafka Airflow Druid

The document provides a comprehensive set of interview questions and answers related to Apache Spark, Kafka, Airflow, and Druid. It covers key concepts such as Spark's RDD, Kafka's topics and consumer groups, Airflow's DAG and operators, and Druid's architecture and indexing service. Each section highlights the essential features and functionalities of these technologies.

Uploaded by

cherrygranger1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views4 pages

Interview Questions Apache Spark Kafka Airflow Druid

The document provides a comprehensive set of interview questions and answers related to Apache Spark, Kafka, Airflow, and Druid. It covers key concepts such as Spark's RDD, Kafka's topics and consumer groups, Airflow's DAG and operators, and Druid's architecture and indexing service. Each section highlights the essential features and functionalities of these technologies.

Uploaded by

cherrygranger1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Interview Questions and Answers on Apache Spark, Kafka, Airflow, and Druid

# Apache Spark

1. **What is Apache Spark, and how does it differ from Hadoop?**

- **Answer**: Apache Spark is an open-source, distributed computing system designed for fast

computation. Unlike Hadoop, which relies on MapReduce for batch processing, Spark offers

in-memory computation, making it faster for iterative tasks. Spark supports real-time stream

processing and interactive queries, unlike Hadoop's batch-only processing.

2. **Explain RDD in Spark. Why is it important?**

- **Answer**: RDD (Resilient Distributed Dataset) is the fundamental data structure in Spark,

representing an immutable, distributed collection of objects. RDDs are fault-tolerant, support

in-memory computations, and allow transformations and actions.

3. **What are Spark transformations and actions? Provide examples.**

- **Answer**: Transformations (e.g., `map`, `filter`) create a new RDD from an existing one and are

lazy-evaluated. Actions (e.g., `collect`, `count`) trigger computation and return results to the driver.

4. **How does Spark Streaming work?**

- **Answer**: Spark Streaming processes live data streams in small batches using DStreams

(Discretized Streams), enabling real-time analytics.

5. **What is the role of Spark's DAG Scheduler?**

- **Answer**: The Directed Acyclic Graph (DAG) Scheduler manages job execution in stages,

optimizing task scheduling and fault recovery.


# Apache Kafka

6. **What is Apache Kafka, and how is it used?**

- **Answer**: Apache Kafka is a distributed event-streaming platform for building real-time data

pipelines. It uses topics to organize data streams and supports high-throughput and fault tolerance.

7. **Explain the concept of Kafka topics and partitions.**

- **Answer**: Topics are categories for data streams. Each topic is divided into partitions to allow

parallelism and scalability.

8. **What is a Kafka consumer group?**

- **Answer**: A consumer group allows multiple consumers to coordinate and share the workload

of reading data from Kafka topics.

9. **How does Kafka ensure message durability?**

- **Answer**: Kafka uses distributed logs, replication, and configurable retention policies to

guarantee message durability.

10. **What are Kafka Connect and Kafka Streams?**

- **Answer**: Kafka Connect simplifies data integration between Kafka and external systems.

Kafka Streams is a library for processing data streams.

# Apache Airflow

11. **What is Apache Airflow, and why is it used?**

- **Answer**: Apache Airflow is a workflow orchestration tool used to automate and schedule
tasks. It ensures task dependencies are respected and provides monitoring capabilities.

12. **Explain Directed Acyclic Graph (DAG) in Airflow.**

- **Answer**: A DAG is a collection of tasks with dependencies that do not form cycles, ensuring

tasks execute in the correct order.

13. **What are Operators in Airflow?**

- **Answer**: Operators define individual tasks in a DAG, e.g., PythonOperator for Python scripts

or BashOperator for shell commands.

14. **How does Airflow handle task retries?**

- **Answer**: Airflow allows configuring retries with parameters like `retries`, `retry_delay`, and

`max_retry_delay`.

15. **What are XComs in Airflow?**

- **Answer**: XComs (Cross-Communications) enable data sharing between tasks within a DAG.

# Apache Druid

16. **What is Apache Druid?**

- **Answer**: Apache Druid is a real-time analytics database optimized for OLAP queries on event

data. It supports high concurrency and low-latency data ingestion.

17. **How does Druid store data?**

- **Answer**: Druid organizes data into segments, which are immutable and optimized for fast

access.
18. **What is the role of Druid's indexing service?**

- **Answer**: The indexing service ingests raw data and converts it into Druid's segment format for

storage and querying.

19. **Explain Druid's architecture.**

- **Answer**: Druid has a distributed architecture, including nodes like Historical (querying data),

MiddleManager (data ingestion), and Coordinator (management).

20. **What is a Druid query?**

- **Answer**: Druid queries are JSON-based and support aggregations, filters, and group-by

operations.

You might also like