Big Data One Shot
Big Data One Shot
ig Data
B
One Shot
nit-01+02+03+04+05
U
Unit-01
ues→Explain the 5 Vs of Big Data in detail. Howdo they define the scope
Q
and complexity of modern data systems?
Ans:-)1.Volume
Refersto the massive amount of data generated everysecond from
→
varioussourcessuch associal media, sensors, transactions, machines, etc.
Organizations deal with terabytesto petabytes ofdata that traditionalsystems cannot
→
manage efficiently.→High volume requiresscalablestorage and distributed processing
systemslike Hadoop and Spark.
.Velocity
2
→Describesthe speed at which data is generated, collected,and processed.
Real-time or near-real-time processing is crucialfor timely decision-making
→
(e.g., fraud detection, recommendation engines).
→Technologieslike Kafka, Flink, and Storm are usedto handle high-velocity data streams.
.Variety
3
→Representsthe different forms of data:structured(databases),semi-structured (XML,
JSON), and unstructured (videos, images, audio).
Handling diverse data typesincreasesthe complexityofstorage,
→
integration, and analysis.→Requiresflexible systemscapable of handling
all formats.
.Veracity
4
→Relatesto the trustworthiness and accuracy of data.
→Data may be inconsistent, incomplete, or noisy,affecting the quality of insights.
→Data cleansing and validation mechanisms are essentialto improve reliability.
.Value
5
→Denotesthe usefulness of data in making informeddecisions.
LePic
The main goal isto extract meaningful insightsthatadd business value.
→
→Techniqueslike data mining, analytics, and machinelearning help unlock this value.
ues→What are the major types of digital data? Classifyand explain with
Q
examples from real-world applications
Ans:-)1.Structured Data
Organized data that residesin fixed fields withinrecords or files, typically in
→
relational databases.→Easily searchable using SQL-basedqueries.
Examples:
→
– Bank transaction records(amount, date, account number)
– Employee detailsin an HR system (name, ID, department)
– Sensor data stored in rows and columns
.Semi-Structured Data
2
→Data that does not follow a strict tabular formatbut has a clear structure
through tags or markers.→Can be stored in NoSQLdatabases and parsed using
specific parsers.
Examples:
→
– JSON and XML files used in web APIs
– Email (with fieldslike To, From, Subject, Body)
– Logsfrom servers or IoT devices
.Unstructured Data
3
→Data without any predefined structure or format,making it difficult to process
and analyze directly.→Requires advanced toolsforstorage, processing, and
analysis(e.g., NLP, image recognition).→Examples:
– Social media posts(text, images, videos)
– Multimedia files(MP3, MP4, JPEG)
– Documents and PDFs containing mixed content
Ans:-)
LePic
FeatureConventional Data SystemsBig Data Platforms
Data VolumeHandles GBsto TBs of data Capable of handlingTBsto PBs and beyond
Data VarietyPrimarily structured data Handlesstructured,semi structured, and
unstructured data
Ans:-)
LePic
LePic
Component Role and Function
6. Analytical Data Store→Central repository forstoring processed data from both batch
and stream paths.
. Informed Consent
2
→Users are often unaware of how their data is beingcollected,stored, and analyzed.
→Consent forms are either vague or buried in longprivacy policies, violating ethical
transparency.
. Regulatory Compliance
2
→Align data processing practices with lawslike GDPR,HIPAA, CCPA, etc.
→Include consent tracking, right-to-be-forgotten,and data minimization in system design.
. Storage Advancements
1
→Cheap and scalable storage (e.g., HDFS, Amazon S3)made it feasible to store massive
datasets.
. Distributed Computing
2
→Technologieslike Hadoop and Spark enabled parallelprocessing across clusters of
machines.
. Cloud Computing
3
→Platformslike AWS, Google Cloud, and Azure madeBig Data infrastructure more accessible
and cost-effective.
. NoSQL Databases
4
→Support for semi-structured and unstructured data(e.g., MongoDB, Cassandra).
. Open-source Ecosystem
5
→Rapid development due to community-driven projectslikeHadoop, Hive, Pig, and Flink.
. Real-Time Processing
6
→Toolslike Apache Kafka and Storm allowed handlingstreaming data for
. Customer-Centric Strategies
1
→Companies use Big Data to understand consumer behavior,personalize
marketing, and improve user experience.
. Competitive Advantage
2
→Data-driven insights help firmsinnovate faster andoutperform competitors.
. Operational Efficiency
3
→Optimization ofsupply chains, manufacturing, andresource management using data
analytics.
. Risk Management
4
→Fraud detection, predictive maintenance, and financialrisk modeling using Big Data tools.
LePic
. Monetization of Data
5
→Businessesincreasingly see data as a valuable assetthat can be sold, traded, or used to
generate revenue.
CategoryAnalysisReporting
discover patterns, trends.
Definition→In-depth examination of data to
Purpose→To extract insights, make
predictions, and support Systems
decisions.
. Decision-Making Support
1
→Summarizing data to present factsin a
Nature→Exploratory, diagnostic, predictive,
readable format.
and prescriptive.
utput→Actionable insights,
O
→Descriptive and historical in nature.
recommendations, forecasts.
. Drives Personalization
4
→Systems analyze user preferences and behaviorstoprovide tailored experiences.
➤
Apache Hadoop
➤Apache Spark
➤
Apache Flink
➤Google BigQuery
➤Amazon EMR (Elastic MapReduce)
➤
Healthcare Industry
→Big Data is used to analyze patient records, treatmenthistories, and diagnostic reportsto
improve healthcare outcomes and personalize treatment plans.
➤
Retail Industry
→Big Data helpsin customer behavior analysis, inventorymanagement, and targeted
marketing by analyzing large volumes ofsales and customer interaction data.
Unit-02
h i d l i i hi d h f h d
LePic
Ques→What is Hadoop? Explain its history and thecomponents of the
Hadoop ecosystem.Ans:-)
Hadoop:-
History of Hadoop
Core Components:
MapReduce:
→
A programming model for processing large datasetsin parallel across a Hadoop cluster.
→Hadoop Common:
Providesshared libraries and utilities used by other Hadoop components.
LePic
Supporting Ecosystem Tools:
Hive:
→
Data warehousing tool that allows SQL-like queries on large datasets.
Pig:
→
A high-levelscripting language (Pig Latin) for analyzing large data sets.
→HBase:
A column-oriented NoSQL database built on top of HDFS.
Sqoop:
→
Used for transferring data between Hadoop and relational databases.
Flume:
→
Collects and transportslarge amounts of log or streaming data into HDFS.
Oozie:
→
Workflow scheduler to manage Hadoop jobs.
→Zookeeper:
A coordination service for managing distributed applications.
Mahout:
→
A library for scalable machine learning algorithms on Hadoop.
Ans:-)
→Distributed Storage:
Stores data across multiple machinesto ensure scalability and parallel access.
→Fault Tolerance:
ata is automatically replicated across multiple nodes(default: 3 copies) to prevent data
D
loss during hardware failure.
High Throughput:
→
Optimized for large data sets with batch processing, providing high data accessspeed.
LePic
Scalability:
→
Can scale up by adding more machines without changing the data or applications.
1.Architecture:
NameNode (Master):
→
Stores metadata (file names, block locations, permissions) but not the actual data.
DataNodes (Slaves):
→
Store the actual data blocks. They report to the NameNode periodically.
3.Accessing Data:
MapReduce
1.
I nput Splitting:
→Input data issplit into smaller chunks(blocks),which are processed in parallel.
LePic
2.
apping Phase:
M
→The Mapper function processes each split and producesintermediate
ey-value pairs. 3.
k
huffling and Sorting:
S
→The frameworkgroups all valuesbased on their keysandsends them
to the Reducers.4.
Reducing Phase:
→The Reducer processes each key and itslist of valuesand generatesthe final output.
Input.txt:
Hello world
Hello Hadoop
("Hello", 1)
("world", 1)
("Hello", 1)
("Hadoop", 1)
Grouped:
("Hadoop", [1])
("world", [1])
Reduce Output:
("Hadoop", 1)
LePic
("Hello", 2)
("world", 1)
Hadoop 1
Hello 2
world 1
Ans:-)
What is Shuffle and Sort in MapReduce?
Ensures Correctness:
→
The Reducer expects all valuesfor a given key to arrive together;shuffle and sort
guaranteesthis.
Optimizes Performance:
→
Sorting reducesthe complexity of the Reducer’sjob by providing already ordered data.
Enables Scalability:
→
It allows Hadoop to process massive data sets across distributed nodes effectively.
Ans:-)
C++ only
anguage SupportAny language
L
thatsupportsstdin and stdout
Ease of UseEasy to write and debug More complex toimplement and debug
ardware Dependency Requires powerful Now the job finishesin 6 hours.→Scale Out:
H
(high-end) hardware
Adding more machines/nodes to the
system
Cost Expensivedue to high-end systems
You add 3 more similar machines and distribute thetask. Now the job finishesin 2.5
→
hours due toparallel processing.
simultaneouslyusing theMapReducemodel.
As more data is added, instead of upgrading machines,Hadoop canadd more nodesto
→
the cluster to maintain performance.
LePic
This parallel and distributed architecture enablesHadoop tohandle petabytes of data
→
efficiently, ensuring bothscalability and fault tolerance.
Unit-03
Ques-Explain the design and architecture of HDFS.
Ans:-)
➤
Slave Nodes are theexecution layer,where MapReducejobs or other computation tasks
run.
LePic
3. Communication
➤
TheResource Managercommunicates with allNodeManagersto:
→Assign jobs
→Track task progress
→Handle failures and reassign resources as needed
➤
NameNode(typically part of the Master) managesmetadata about file storage.
➤DataNodes(typically part of Slave Nodes)store theactual data blocks.
Ques-What are the challenges and benefits of using HDFS in big data
➤
Scalability
→HDFS can store and process petabytes of data byscaling across multiple machines.
➤
Fault Tolerance
→Data is replicated across different nodesto ensureavailability in case of failure.
➤
Cost-Effective Storage
→Uses commodity hardware, reducing infrastructurecosts.
➤High Throughput
→Optimized for batch processing and large datasetswith sequential read/write operations.
➤
Data Locality Optimization
→Moves computation close to where the data resides,reducing network traffic.
➤
Not Suitable for Small Files
→HDFS is optimized for large files; managing a largenumber ofsmall filesisinefficient.
➤
Complexity in Management
→Requiresskilled administratorsfor setup, configuration,and monitoring.
➤
Security Concerns
→By default, lacksstrong authentication and encryptionwithout additional configuration
(like Kerberos).
LePic
➤
Single Point of Failure (NameNode)
→If the NameNode fails(without HA configuration),the entire system is disrupted.
Ques-Discuss the role of Flume and Sqoop in data ingestion. How do they
➤
Purpose
→Flume is designed for collecting, aggregating, andmoving large volumes oflog or
event datafrom various sourcesto HDFS.
➤
How It Works
→Flume uses a data flow structure withSource→Channel→Sink.
→TheSourcecollects data (e.g., from web servers),
→TheChanneltemporarily storesit (like a buffer),
→TheSinkdeliversthe data toHDFSor other destinations.
➤
Integration with HDFS
→The Flume sink writes data directly into HDFS usinga specified directory path and
format (e.g., text,sequence file).
Apache Sqoop
➤
Purpose
→Sqoop is used for efficiently transferringstructureddatabetweenrelational databases
(MySQL, Oracle, etc.) andHDFS.
➤
How It Works
→Sqoop generates MapReduce jobsthat parallelly import/exportdata.
→Forimport, it extracts data from RDBMS and writesintoHDFS in formatslike
Avro, Parquet, or text.→Forexport, it takes HDFSdata and pushesit back to a
database table.
➤
Integration with HDFS
→Sqoop directly writesthe imported data into HDFSdirectories.
→It also supportsimporting into Hive tables and HBase.
ues-What are the various file-based data structures and serialization
Q
formats supported in Hadoop?
➤
Text Files
→Simple flat files containing plain text data; usedfor small or basic datasets.
➤
Sequence Files
→Binary filesstoring key-value pairs; useful forintermediate MapReduce data.
LePic
➤
MapFiles
→Sorted SequenceFiles with an index for fast lookups.
➤
Avro Files
→Row-based storage format;supportsschema evolutionand compactserialization.
➤
Parquet Files
→Columnar storage format; optimized for complex dataand
➤
Writable
→Default Hadoop serialization format for key-valuepairsin MapReduce.
➤
Avro
→Supports rich data structures, compact format, andschema definition in JSON.
➤
Protocol Buffers
→Language-neutral format developed by Google; efficientand extensible.
➤
Thrift
→Cross-language serialization and RPC framework usedfor data serialization.
➤
JSON and XML
→Human-readable formats; used for configuration,logs, and lightweight data exchange.
Hadoop Cluster
➤
Enable passwordless SSH between all nodes
➤Configure hostnames and /etc/hostsfor name resolution
➤Test connectivity between nodes
➤
Enable file-level permissionsin hdfs-site.xml
➤Create Hadoop users and groups
➤Set up directory ownership and access control
➤
Enable data encryption at rest using TransparentData Encryption (TDE)
➤Enable data encryption in transit using HTTPS orSASL
➤
Enable service-level authorization in core-site.xml
➤Define policiesin hadoop-policy.xml
➤
Enable audit logsfor HDFS and other components
➤Integrate with monitoring toolslike Ambari or Prometheus
➤
Step 1: Client Request
→Client contactstheNameNodeto request permissionto write a file.
➤
Step 2: Block Allocation
→NameNode checks namespace and responds with theaddresses of DataNodesfor each
block of the file.
➤
Step 3: Data Streaming
→Clientsplitsthe file into blocks and startssendingthe data to thefirst DataNodein the
pipeline.
➤Step 4: Data Pipelining
Each DataNode forwardsthe block to thenext DataNode(replicas are created in the
→
process).
➤
Step 5: Acknowledgment
→Once all DataNodes have written the block, acknowledgmentsare sent
➤
Step 1: Client Request
→Clientsends a request to theNameNodefor readinga file.
➤
Step 2: Metadata Response
→NameNode responds with thelocations of DataNodescontaining the blocks of the
requested file.
➤
Step 3: Data Fetching
→Client directly contactsthenearest DataNodetoread the data blocks.
LePic
➤
Step 4: Data Assembly
→Blocks are fetched andreassembledby the clientto form the complete file.
Unit-04
Ques→Explain the architecture of YARN and its rolein Hadoop .
Ans:-)
YARN
➤
YARN standsforYet Another Resource Negotiator.
➤It istheresource management layerof the Hadoopecosystem, introduced in Hadoop 2.x
to improve scalability and performance.
➤It decouplesresource managementandjob schedulingfrom the MapReduce
. NodeManager (NM)
2
➤Runs on each node in the cluster.
➤Responsible for monitoring resources and containerlifecycle
management on the node.➤Reports node and containerstatusto
ResourceManager.
. ApplicationMaster (AM)
3
➤Created for each application/job.
➤Handles execution, task scheduling, and coordinationof itsspecific job.
➤Requests resourcesfrom ResourceManager and interactswith NodeManagers.
LePic
. Containers
4
➤Logical units of resource allocation (e.g., memory+ CPU).
➤Actual tasks of an application run inside containerson different nodes.
➤
Resource Management
➤Efficient allocation of resources(CPU, memory) acrossthecluster.
➤
Multi-Framework Support
➤Allows Hadoop to run multiple processing engines(e.g.,MapReduce, Apache Spark,
Tez, Flink) on the same cluster.
➤
Improved Scalability
➤Decentralized application managementsupportsthousandsof concurrent applications.
➤
Better Utilization
➤Fine-grained and dynamic allocation improves clusterutilization.
➤Fault Tolerance
➤Isolated application execution enables recoverywithout affecting others.
Ans:-)
NoSQL Databases
➤
NoSQL (Not Only SQL)databases are non-relationaldatabases designed to
handle large volumes of unstructured,semi-structured, or structured data.
➤
They providehigh scalability,flexibility in datamodeling, and are well-suited for
distributed architecturesandBig Dataapplications.
Ans:-)
ongoDB stores data in the form ofdocuments, whichare similar to JSON objects but use
M
BSON (Binary JSON) format internally. Document operations refer to actionsthat can be
performed on these documents within a MongoDB collection.
➤
insertMany()
➤Inserts multiple documents at once.
Example:
db.students.insertMany([
])
➤find()
➤Returns all documentsthat match a query.
Example:
➤
findOne()
➤Returnsthe first document that matches a query.
Example:
➤
Query Operators
➤Use operatorslike $gt, $lt, $eq, etc.
Example:
➤
updateOne()
➤Updatesthe first document that matchesthe query.
Example:
db.students.updateOne(
)
LePic
➤
updateMany()
➤Updates all matching documents.
Example:
db.students.updateMany(
)
➤
replaceOne()
➤Replacesthe entire document with a
new one.Example:
db.students.replaceOne(
)
➤deleteOne()
➤
Deletesthe first document that
matchesthe query.Example:
➤
deleteMany()
➤Deletes all documents matching
the query.Example:
db.students.deleteMany({ status:
Operations
➤
countDocuments()
➤Returnsthe count of documentsthat
match a query.Example:
➤
sort()
➤Sorts documents based on one or
more fields.Example:
db.students.find().limit(2)
. SparkContext Creation
2
➤The Driver creates aSparkContext, which establishesa connection with theCluster
Manager(like YARN, Mesos, or Standalone).
. Data Shuffling
8
➤Between stages, data may need to beshuffled(redistributedacross nodes) for
operationslike groupByKey or reduceByKey.
. Job Completion
9
➤After all tasksfinish, theresults are returnedto the Driver or written to externalstorage (like
HDFS, S3, DB).
0. Clean-Up
1
➤Spark releases resources(executors) and the jobends.
Support for Other Models Only supports MapReduce Supports MapReduce, Spark,
Tez, Flink, etc.
➤Better Scalability
→MRv1’sJobTracker became a bottleneck in large clusters.YARN decentralizesjob
scheduling, allowing thousands of jobsto run concurrently.
➤Multi-Framework Support
→YARN supportsnon-MapReduceprocessing modelslikeApache Spark, Tez, and Flink,
enabling more flexibility and modern workloads.
LePic
➤Improved Resource Utilization
→YARN providesfine-grained controlover resources(CPU,memory), allowing better
utilization and reduced waste.
Ans:-)
➤
Hive
A data warehouse system built on Hadoop that allows querying and managing large
datasets using a SQL-like language called HiveQL.
➤
Pig
A platform for analyzing large datasets using a high-levelscripting language called Pig Latin.
➤
HBase
A NoSQL database built on HDFS that provides real-time read/write accessto large datasets.
➤
Sqoop
Used to transfer bulk data between Hadoop and structured data storeslike RDBMS.
➤
Flume
Used for collecting, aggregating, and moving large amounts of log data into HDFS.
➤
Oozie
A workflow scheduler system to manage Hadoop jobs.
➤
Zookeeper
A coordination service for distributed applications, ensuring synchronization across nodes.
➤
Mahout
A machine learning library for building scalable ML applications on top of Hadoop.
➤
Ambari
A web-based tool for provisioning, managing, and monitoring Hadoop clusters.
LePic
➤
First-Class and Higher-Order Functions
Functions are treated as values and can be assigned to variables, passed as parameters, or
returned from other functions.
Example:
➤Immutable Data
cala encourages using immutable variables(using val) and immutable collectionsto
S
avoid side effects.Example:
➤
Pure Functions
Functionsthat always produce the same output for the same input and
have no side effects.Example:
➤
Pattern Matching
A powerful feature to match data structures and decompose them. Similar to switch-case
but more expressive.Example:
Example:
var factor = 3
➤
Lazy Evaluation
Expressions are evaluated only when needed. Useful to improve performance and handle
infinite data structures.Example:
Unit-05
Ques→Differentiate between Map-Reduce, PIG and HIVE
Ans:-)
LePic
featureMapReducePigHive
processing sing Pig Latin
u
efinitionLow-level
D large data Data warehousing tool with
High-levelscripting platform SQL-like interface
programming model for
➤
Execution Environment
→Pig scripts run on asingle local machine.
→Data is read from and written to thelocal filesystem.
➤
Use Case
→Ideal fortesting,development, and small datasets.
➤
Execution Environment
→Pig scripts are translated intoMapReduce jobsandrun on a
Hadoop cluster.→Data is processed usingHDFS.
➤
Use Case
→Best forlarge-scale production environments.
➤
Execution Environment
→Pig scripts are converted to tasksthat run usingtheApache Tez
execution engine.→Tez isfaster and more efficientthan traditional
MapReduce.
➤
Use Case
→Used forimproved performancein iterative and interactivePig queries.
➤
Execution Environment
→Allows Pig scriptsto execute on theApache Sparkengine.
➤Use Case
→Useful for users already using Spark and exploringfaster execution.
➤Status
→Stillexperimental and less stablethan MapReduceand Tez.
5. Grunt Shell (Interactive Mode)
➤
Execution Environment
→Interactive command-line shell where users can writeand execute Pig Latin statements
step-by-step.
➤
Use Case
→Helpful fordebugging,testing,andlearningPigsyntax.
➤Command to Start→pig
Ans:-)
LePic
Architecture of Hive
➤
Thrift, JDBC, and ODBC Applications
→Usersinteract with Hive using different interfaces.
→These clientssend queriestoHiveServer2.
➤
JDBC/ODBC/Beeline Clients
→Allow connectionsfrom Java/ODBC-based applications.
→Beeline isthe command-line interface for Hive.
➤
HiveServer2
→Manages connections and handles requestsfrom multipleclients.
→Provides a secure and multi-user environment.
LePic
➤
Driver
→Actslike a controller, managing the execution flowof HiveQL statements.
➤
Compiler
→Parsesthe query and convertsit into a logical plan.
➤
Optimizer
→Optimizesthe query plan (e.g., rearranging joins,filtering early).
➤
Metastore
→Stores metadata about tables, columns, data types,
partitions, etc.→Uses a relational database like
MySQL or Derby.
➤
MapReduce/YARN
→Hive converts queriesinto MapReduce jobs(or Tez/Sparkin
newer setups).→YARNhandlesjob scheduling and resource
management.
➤
HDFS (Hadoop Distributed File System)
→Actual data isstored in HDFS.
→Hive only stores metadata; it queriesthe data inHDFS.
Working Flow
➤
Step 2: HiveServer2 Receives the Query
→Passesit to theDrivercomponent for processing.
➤
Step 3: Compilation
→TheCompilercheckssyntax, performssemantic analysis,and creates a query plan.
➤
Step 4: Optimization
→TheOptimizerimprovesthe query plan (e.g., filtersearly, reducesshuffle).
➤
Step 5: Execution Plan Generation
→The final physical plan is generated (as MapReduce,Tez, or Spark jobs).
➤
Step 6: Metadata Access
→TheDriverconsultstheMetastorefor schema andpartition information.
➤
Step 7: Job Execution via YARN
→Jobs are submitted to theYARNresource managerfor execution on the cluster.
LePic
➤
Step 8: Data Retrieval from HDFS
→Input data isfetched fromHDFS, processed, and resultsare written back to HDFS or sent to
the client.
➤
Step 9: Result Return
→The Driver collectsthe final output and sendsitto the client.
Ques→What are the key features of HBase and how doesit differ
➤
Column-Oriented Storage
➤Stores data in column families rather than rows,which is efficient for read/write operations
on large datasets.
➤Built on HDFS
➤Uses Hadoop Distributed File System (HDFS) asitsstoragelayer, allowing it to handle
massive volumes of data.
➤Scalability
➤
Horizontally scalable across commodity hardware;new nodes can be added easily without
downtime.
➤
Automatic Sharding
➤Data is automatically split intoregions, and eachregion isserved by aRegionServerfor
load balancing.
➤
Strong Consistency
➤Providesstrong consistency on reads and writes perrow.
LePic
6.Group By
1.Creating a Table
7.Join Example
id INT,
name STRING,
2.Loading Data into a Table
marks INT
)
FROM students s
SELECT COUNT(*) FROM students;
(Grouping) JOIN marks_table m ON (s.id =
m.student_id);
SELECT AVG(marks) FROM
LePic
featureHiveQLTraditional SQL
rocessing
p
xecution Engine Converts to
E Direct execution on RDBMS
MapReduce/Tez/Spark
ast due to indexing and
F
Speed Slower due to batch transaction
Ans:-)
➤
Apache Zookeeperis acentralized serviceused formaintainingconfiguration
information,naming,synchronization, andgroup servicesin distributed systems.
➤
It is acoordination servicethat allows distributedapplicationsto work together by
providing reliable and consistentservices.
Key Features
➤High Availability
→Ensuresthat configuration and coordination data are always available across nodes.
➤
Consistency
→All clientssee the same view of the system, evenduring updates.
➤
Reliability
→Uses replication and logging to ensure data durabilityand recovery.
➤
Fast Reads
→Read operations are very fast asthey are servedfrom memory.
Architecture
➤
Client-Server Model
→Clients(applications) interact with the Zookeeperservice to read or update data.
LePic
➤
Zookeeper Ensemble
→A group ofservers(usually odd in number, like 3,5, or 7) form the
Zookeeper cluster.→One acts astheLeader, and theothers are
Followers.
➤
ZNodes
→The data in Zookeeper isstored in ahierarchicaltree-like
structure,similar to a file system.→Each node iscalled aZNode, and it can
hold data and children.
Core Services
➤
Naming Service
→Helpsidentify nodesin a cluster using unique names(ZNodepaths).
➤Configuration Management
→Stores configuration filesthat can be accessed orupdated by clients.
➤Synchronization
Helps manage locks, barriers, and other synchronizationprimitivesin distributed
→
applications.
➤Leader Election
→Helps elect a master node among a group of nodes,which is critical for
➤
HBase
→Uses Zookeeper for region server coordination, failover,and leader election.
➤
Kafka
→Uses Zookeeper for broker coordination, topic configuration,and consumer offsets.
➤YARN (Hadoop)
→Can use Zookeeper for High Availability of ResourceManager.
Limitations
➤
Notsuitable forlarge-scale data storage.
➤Designed forcoordination and small configurationdata,not heavy workloads.
➤
Data that follows a strictschema (rows and columns).
→Example: TablesfromRDBMSlike MySQL, Oracle.
→Handled efficiently usingHiveQLfor querying andanalysis.
➤
Data without a predefined model or organization.
→Example:Text files, logs, images, videos.
→Hive can process and querylog/text filesusingcustom input formats
➤
Logs or records with time-stamped events(e.g., clickstreams,sensordata).
→Suitable for partitioning bydate/time fieldstoimprove performance.
→Common in web analytics and monitoring applications.
➤
CSV, TSV, and otherdelimitedfiles.
→Hive can directly load and query such files usingsimple table definitions.
Ques→Describe schema.
Types of Schema
➤
1. Structured Schema
→Predefined schema with a fixed format (e.g., RDBMS).
→Example: Tables with specific columnslike ID (int),Name (string), Age (int).
➤
3. Schema-on-Write
→Schema is appliedwhen data is writtento storage(e.g., RDBMS, Hive).
→Data must match the schema upfront.
➤
4. Schema-on-Read
→Schema is appliedwhen data is read or queried(commonin Hadoop).
→Example: Hive or Pig accessing raw log files.
Importance of Schema
➤
Helpsindata validationand consistency.
➤Enables efficientquerying and data processing.
➤Assistsindata interpretationand metadata management.