0% found this document useful (0 votes)
8 views4 pages

BigData Interview QnA

The document provides interview questions and answers related to various big data technologies including Apache Kafka, Spark, Hadoop, Hive, Zookeeper, Oozie, Flume, and Samza. Key concepts such as topics, producers, RDDs, HDFS, and workflow scheduling are explained. Each technology is summarized with essential definitions and functionalities.

Uploaded by

carley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views4 pages

BigData Interview QnA

The document provides interview questions and answers related to various big data technologies including Apache Kafka, Spark, Hadoop, Hive, Zookeeper, Oozie, Flume, and Samza. Key concepts such as topics, producers, RDDs, HDFS, and workflow scheduling are explained. Each technology is summarized with essential definitions and functionalities.

Uploaded by

carley
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Interview Questions and Answers

Apache Kafka

Q: What is Kafka?

A: Kafka is a distributed messaging system for real-time data streaming.

Q: What is a topic?

A: A topic is a category where messages are stored in Kafka.

Q: What is a producer and a consumer?

A: Producers send data; consumers read data from topics.

Q: What is a Kafka broker?

A: A broker is a server that stores and manages data in Kafka.

Q: What is ISR?

A: ISR (In-Sync Replica) are replicas that are synced with the leader.

Apache Spark

Q: What is Spark?

A: Spark is a big data framework for fast data processing.

Q: What is an RDD?

A: RDD (Resilient Distributed Dataset) is Spark's basic data structure.

Q: What is Spark SQL?

A: Spark SQL allows querying data with SQL.

Q: What is lineage?

A: Lineage tracks the history of RDD transformations for fault recovery.

Q: What is shuffling?

A: Shuffling moves data across nodes; it's costly but necessary for some operations.
Hadoop

Q: What is Hadoop?

A: Hadoop is a framework for storing and processing big data.

Q: What is HDFS?

A: HDFS is a distributed file system for storing large data.

Q: What are NameNode and DataNode?

A: NameNode manages metadata; DataNode stores actual data.

Q: What is YARN?

A: YARN manages resources and runs tasks in Hadoop.

Q: What is data locality?

A: Data locality means processing data close to where it is stored.

Apache Hive

Q: What is Hive?

A: Hive is a data warehouse tool for querying big data using SQL-like syntax.

Q: What is schema-on-read?

A: Hive applies a schema to data only when reading it.

Q: What is partitioning?

A: Partitioning divides data into smaller chunks for faster queries.

Q: What are Hive tables?

A: Tables in Hive can be internal (managed by Hive) or external (managed externally).

Q: What is a UDF?

A: UDFs (User Defined Functions) let you create custom query functions in Hive.

Zookeeper
Q: What is Zookeeper?

A: Zookeeper is a tool for managing distributed systems.

Q: What is a ZNode?

A: A ZNode is a data node in Zookeeper.

Q: What is a Watch?

A: A Watch is a notification mechanism for data changes.

Q: What is leader election?

A: It's the process of selecting a master server.

Q: What is a quorum?

A: A quorum is the minimum number of servers needed for decisions.

Apache Oozie

Q: What is Oozie?

A: Oozie is a workflow scheduler for Hadoop jobs.

Q: What are Oozie jobs?

A: They are workflows, coordinators, or bundles to run tasks.

Q: What is an Oozie coordinator?

A: It schedules workflows based on time or data availability.

Q: How does Oozie handle dependencies?

A: Oozie executes jobs in a predefined order.

Q: What are SLAs in Oozie?

A: SLAs ensure jobs are completed within a set time.

Apache Flume

Q: What is Flume?

A: Flume is a tool for collecting and moving log data.


Q: What are its components?

A: Source (data input), Channel (data storage), Sink (data output).

Q: What is a Flume agent?

A: A Flume agent is a single data flow unit.

Q: What is a Memory Channel?

A: A fast but volatile in-memory storage for events.

Q: What is At-Least-Once delivery?

A: Data might be duplicated but is never lost.

Apache Samza

Q: What is Samza?

A: Samza is a tool for real-time stream processing.

Q: What is stream processing?

A: Processing data as it arrives in real time.

Q: What are Samza's components?

A: Streams (data), Jobs (logic), and Tasks (process streams).

Q: What is stateful processing?

A: It uses previous data to make decisions during processing.

Q: What is a checkpoint?

A: A saved state of processing for fault recovery.

You might also like