Interview Questions and Answers
Apache Kafka
Q: What is Kafka?
A: Kafka is a distributed messaging system for real-time data streaming.
Q: What is a topic?
A: A topic is a category where messages are stored in Kafka.
Q: What is a producer and a consumer?
A: Producers send data; consumers read data from topics.
Q: What is a Kafka broker?
A: A broker is a server that stores and manages data in Kafka.
Q: What is ISR?
A: ISR (In-Sync Replica) are replicas that are synced with the leader.
Apache Spark
Q: What is Spark?
A: Spark is a big data framework for fast data processing.
Q: What is an RDD?
A: RDD (Resilient Distributed Dataset) is Spark's basic data structure.
Q: What is Spark SQL?
A: Spark SQL allows querying data with SQL.
Q: What is lineage?
A: Lineage tracks the history of RDD transformations for fault recovery.
Q: What is shuffling?
A: Shuffling moves data across nodes; it's costly but necessary for some operations.
Hadoop
Q: What is Hadoop?
A: Hadoop is a framework for storing and processing big data.
Q: What is HDFS?
A: HDFS is a distributed file system for storing large data.
Q: What are NameNode and DataNode?
A: NameNode manages metadata; DataNode stores actual data.
Q: What is YARN?
A: YARN manages resources and runs tasks in Hadoop.
Q: What is data locality?
A: Data locality means processing data close to where it is stored.
Apache Hive
Q: What is Hive?
A: Hive is a data warehouse tool for querying big data using SQL-like syntax.
Q: What is schema-on-read?
A: Hive applies a schema to data only when reading it.
Q: What is partitioning?
A: Partitioning divides data into smaller chunks for faster queries.
Q: What are Hive tables?
A: Tables in Hive can be internal (managed by Hive) or external (managed externally).
Q: What is a UDF?
A: UDFs (User Defined Functions) let you create custom query functions in Hive.
Zookeeper
Q: What is Zookeeper?
A: Zookeeper is a tool for managing distributed systems.
Q: What is a ZNode?
A: A ZNode is a data node in Zookeeper.
Q: What is a Watch?
A: A Watch is a notification mechanism for data changes.
Q: What is leader election?
A: It's the process of selecting a master server.
Q: What is a quorum?
A: A quorum is the minimum number of servers needed for decisions.
Apache Oozie
Q: What is Oozie?
A: Oozie is a workflow scheduler for Hadoop jobs.
Q: What are Oozie jobs?
A: They are workflows, coordinators, or bundles to run tasks.
Q: What is an Oozie coordinator?
A: It schedules workflows based on time or data availability.
Q: How does Oozie handle dependencies?
A: Oozie executes jobs in a predefined order.
Q: What are SLAs in Oozie?
A: SLAs ensure jobs are completed within a set time.
Apache Flume
Q: What is Flume?
A: Flume is a tool for collecting and moving log data.
Q: What are its components?
A: Source (data input), Channel (data storage), Sink (data output).
Q: What is a Flume agent?
A: A Flume agent is a single data flow unit.
Q: What is a Memory Channel?
A: A fast but volatile in-memory storage for events.
Q: What is At-Least-Once delivery?
A: Data might be duplicated but is never lost.
Apache Samza
Q: What is Samza?
A: Samza is a tool for real-time stream processing.
Q: What is stream processing?
A: Processing data as it arrives in real time.
Q: What are Samza's components?
A: Streams (data), Jobs (logic), and Tasks (process streams).
Q: What is stateful processing?
A: It uses previous data to make decisions during processing.
Q: What is a checkpoint?
A: A saved state of processing for fault recovery.