Data 2020 – 2021
BIG DATA ANALYTICS
WITH APACHE HADOOP
Dr. Emmanuel S. Pilli
Malaviya NIT Jaipur
Data 2021 Data
Data 2023 Data 2020
Gartner Hype Cycle 2014 Data 2021
Gartner Hype Cycle 2017 Gartner Hype Cycle 2015
Gartner Hype Cycle 2018 Gartner Hype Cycle 2016
Gartner Hype Cycle 2021
Gartner Hype Cycle 2022
3 V’s of Big Data Gartner Hype Cycle 2023
5 V’s of Big Data
DIKUW Another 5 ?
• Volume
• Velocity
• Variety
• Veracity
• Value
• Validity
• Variability
• Venue
• Vocabulary
• Vagueness Ordering Pizza in the Future
Motivation DIKW
• Storage as NAS (Network Attached Storage) and SAN
(Storage Area Network)
• Data is moved to compute nodes at compute time
Core Principles Motivation
• Scale-Out rather than Scale-Up • Modern systems have to deal with huge data
which has inherent value, and cannot be
• Bring code to data rather than data to
discarded
code
• Getting the data to the processors becomes
• Deal with failures – they are common the bottleneck
• Abstract complexity of distributed and • The system must support partial failure
concurrent applications • Adding load to the system should result in a
graceful decline in performance
Scale-Out rather than Scale-Up Motivation
• It is harder and more expensive to Scale-Up • Ability to scale out to Petabytes in size using
• Add additional resources to an existing node (CPU, RAM) commodity hardware
• Moore’s Law can’t keep up with data growth
• New units must be purchased if required resources can not • Processing (MapReduce) jobs are sent to the
be added data versus shipping the data to be processed
• Also known as scale vertically
• Hadoop doesn’t impose a single data format so
• Scale-Out it can easily handle structure, semi-structure and
• Add more nodes/machines to an existing distributed unstructured data
application
• Manages fault tolerance and data replication
• Software Layer is designed for node additions or removal
• Hadoop takes this approach - A set of nodes are bonded
automatically
together as a single distributed system
• Very easy to scale down as well
Hadoop Distributed File System Bring Code to Data rather than Data to Code
• Hadoop co-locates processors and storage
Block Size = 64MB • Code is moved to data (size is tiny, usually in KBs)
2 1
Replication Factor = 3
4 2 • Processors execute code and access underlying local
5 5 storage
1
2 1
HDFS
3 3
4 4
5 2
5
1
3
3
Cost is $400-$500/TB 4
5
31 ©2011 Cloudera, Inc. All Rights
Reserved.
Abstract complexity Hadoop is designed to cope with node failures
• Hadoop abstracts many complexities in distributed • If a node fails, the master will detect that failure
and concurrent applications and re-assign the work to a different node on the
• Defines small number of components system
• Provides simple and well defined interfaces of interactions • Restarting a task does not require communication
between these components with nodes working on other portions of the data
• Frees developer from worrying about system level • If a failed node restarts, it is automatically added
challenges back to the system and assigned new tasks
• race conditions, data starvation
• If a node appears to be running slowly, the
• processing pipelines, data partitioning, code distribution
master can redundantly execute another instance
• Allows developers to focus on application of the same task
development and business logic • Results from the first to finish will be used
Apache Hadoop Hadoop Use Cases
• Open source (using the Apache license)
Use Case Application Industry Application Use Case
• Around 40 core Hadoop committers from
Social Network Analysis Web Clickstream Sessionization
companies like
• Yahoo!, Facebook, Apple, Cloudera, and more Content Optimization Media Clickstream Sessionization
• Hundreds of contributors writing features, fixing Network Analytics Telco Mediation
bugs and other ecosystem products
Loyalty & Promotions Data Factory
Retail
Analysis
Fraud Analysis Financial Trade Reconciliation
DATA PROCESSING
Entity Analysis Federal SIGINT
ADVANCED ANALYTICS
Sequencing Analysis Bioinformatics Genome Mapping
Hadoop Ecosystem Three papers by Google
• Google File System
• MapReduce
• BigTable
Hadoop Ecosystem Research Periodic Table
• Pig: A high-level data-flow language and execution
framework for parallel computation of large datasets
• Hive: A distributed data warehouse query language
based on SQL that provides data summarization and ad
hoc querying.
• Cassandra: A scalable multi-master database with no
single points of failure.
• Mahout: A Scalable machine learning and data mining
library.
• Chukwa: A data collection system for managing large
distributed systems.
Hadoop Ecosystem Hadoop Ecosystem
• Flume: A distributed, reliable, available service for • Hadoop Distributed File System (HDFS): A distributed
efficiently moving large amounts of data as it is file system that provides high-throughput access to
produced application data.
• Sqoop: A tool for efficiently moving data between • Hadoop MapReduce: A distributed parallel data
relational databases and HDFS (SQL to Hadoop) processing model and execution environment for
• Oozie: A tool to provide way for developers to define an handling large data sets.
entire workflow • Hadoop YARN: A framework for job scheduling and
• Hue: Graphical front-end to developer and administrator cluster resource management.
functionality • HBase: A scalable, distributed column oriented database
• ZooKeeper: A distributed, highly available coordination that supports structured data storage for large tables.
service.
Hadoop Eco System Hadoop Ecosystem
• Avro: A data serialization system.
• Spark: A fast and general compute engine for Hadoop
data supporting machine learning, stream processing,
and graph computation
• Ambari: A web-based tool for provisioning, managing,
and monitoring Apache Hadoop clusters which includes
support for HDFS, MapReduce, Hive, HBase,
ZooKeeper, Oozie, Pig and Sqoop.
• Hadoop Ecosystem Table on GitHub
Hadoop Core Components Hadoop Ecosystem
Data Analytic Components High Level Data Processing
Data Analytic Components NoSQL
Management Components Data Serialization Components
Monitoring Components Data Transfer Components
Scalable: Grows without changes Benefit of Agility / Flexibility
Schema-on-Write (RDBMS) Schema-on-Read (Hadoop)
• Schema must be created • Data is simply copied to the file
before any data can be store, no transformation is
loaded. needed.
• An explicit load operation • A SerDe (Serializer /
has to take place which Deserlizer) is applied during
transforms data to DB read time to extract the
internal structure. required columns (late binding)
• New columns must be • New data can start flowing
added explicitly before new anytime and will appear
data for such columns can retroactively once the SerDe is
be loaded into the database. updated to parse it.
• Read is Fast • Load is Fast
• Standards / Governance • Flexibility / Agility
Data beats Algorithms Original Raw Data
Hadoop HDFS Keep All Data Alive Forever
• Hadoop Distributed File System is responsible for
storing data on the cluster
• Data files are split into blocks and distributed across
multiple nodes in the cluster
• Each block is replicated multiple times
• Default is to replicate each block three times
• Replicas are stored on different nodes
• This ensures both reliability and availability
Hadoop HDFS Right Tool for the Right Job
Relational Databases Hadoop
Use when:
• Interactive OLAP Analytics Use when:
(<1sec)
• Structured or Not (Flexibility)
• Multistep ACID Transactions
• Scalability of Storage/Compute
• 100% SQL Compliance
• Complex Data Processing
Five Hadoop Daemons Hadoop HDFS
Client
[ sends .jar and .xml ]
NameNode Master Node
[ file location info ]
JobTracker
[ determines execution plan ]
TaskTracker TaskTracker TaskTracker TaskTracker
Map Map Map Map
Reduce Reduce Reduce Reduce
DataNode DataNode
DataNode DataNode
[ reads and writes HDFS files ] Slave Nodes
HDFS Architecture Hadoop Daemons
Hadoop is comprised of five separate daemons
NameNode: Holds the metadata for HDFS
Secondary NameNode
– Performs housekeeping functions for the NameNode
– Is not a backup or hot standby for the NameNode!
DataNode: Stores actual HDFS data blocks
JobTracker: Manages MapReduce jobs, distributes
individual tasks
TaskTracker: Responsible for instantiating and
monitoring individual Map and Reduce tasks
Functions of Data Node Functions of Name Node
• These are slave daemons or process which runs • It is the master daemon that maintains and manages the
on each slave machine. DataNodes (slave nodes)
• It records the metadata of all the files stored in the
• The actual data is stored on DataNodes.
cluster, e.g. The location of blocks stored, the size of the
• The DataNodes perform the low-level read and files, permissions, hierarchy, etc. There are two files
write requests from the file system’s clients. associated with the metadata:
• FsImage: Complete state of the file system namespace
• They send heartbeats to the NameNode
since the start of the NameNode.
periodically to report the overall health of HDFS,
• EditLogs: All the recent modifications made to the file
by default, this frequency is set to 3 seconds. system with respect to the most recent FsImage.
• It records each change that takes place to the file system
metadata.
Secondary Name Node Functions of Name Node
• It regularly receives a Heartbeat and a block report from
all the DataNodes in the cluster to ensure that the
DataNodes are live.
• It keeps a record of all the blocks in HDFS and in which
nodes these blocks are located.
• The NameNode is also responsible to take care of
the replication factor .
• In case of the DataNode failure, the NameNode
chooses new DataNodes for new replicas, balance disk
usage and manages the communication traffic to the
DataNodes.
Blocks Replication Secondary Name Node
• The Secondary NameNode is one which constantly
reads all the file systems and metadata from the RAM of
the NameNode and writes it into the hard disk or the file
system.
• It is responsible for combining the EditLogs with
FsImage from the NameNode.
• It downloads the EditLogs from the NameNode at regular
intervals and applies to FsImage.
• The new FsImage is copied back to the NameNode,
which is used whenever the NameNode is started the
next time
HDFS File Read Blocks Replication
HDFS Write Pipeline HDFS File Write
Acknowledgement HDFS Write Setting up HDFS Pipeline
Hadoop MapReduce HDFS Read
Hadoop MapReduce Hadoop MapReduce
• MapReduce is the system used to process data in the
Hadoop cluster
• Consists of two phases: Map, and then Reduce
• Each Map task operates on a discrete portion (one
HDFS Block) of the overall dataset
• MapReduce system distributes the intermediate data
to nodes which perform the Reduce phase
Hadoop MapReduce Hadoop MapReduce
Hadoop MapReduce Hadoop MapReduce
Flume Architecture Sqoop
Master send
Configurable levels of reliability
configuration to all SQL to Hadoop
Guarantee delivery in event of
Agents failure
Agent Agent Agent Agent Tool to import/export any JDBC-supported database into Hadoop
Deployable, centrally administered
encrypt
Transfer data between Hadoop and external databases or EDW
MASTER
Optionally pre-process incoming High performance connectors for some RDBMS
Processor Processor data: perform transformations,
suppressions, metadata enrichment Developed at Cloudera
compress batch
encrypt
Writes to multiple HDFS file formats Collector(s)
(text, sequence, JSON, Avro, others) Flexibly deploy decorators at any
step to improve performance,
Parallelized writes across many reliability or security
collectors – as much write throughput
as
HBase Flume
Distributed, reliable, available service for efficiently moving
Column-family store. Based on design of Google BigTable large amounts of data as it is produced
Provides interactive access to information Suited for gathering logs from multiple systems
Holds extremely large datasets (multi-TB) Inserting them into HDFS as they are generated
Constrained access model Design goals
(key, value) lookup Reliability, Scalability, Manageability, Extensibility
Limited transactions (only one row) Developed at Cloudera
Pig HBase Architecture
Data-flow oriented language – “Pig latin”
Datatypes include sets, associative arrays, tuples
High-level language for routing data, allows easy
integration of Java for complex tasks
Example:
emps=LOAD 'people.txt’ AS(id,name,salary);
rich = FILTER emps BY salary > 100000; srtd =
ORDER rich BY salary DESC; STORE srtd INTO ’
rich_people.txt';
Oozie Hive
Oozie is a workflow/cordination service to manage data
processing jobs for Hadoop SQL-based data warehousing application
Language is SQL-like
Supports SELECT, JOIN, GROUP BY, etc.
Features for analyzing very large data sets
Partition columns, Sampling, Buckets
Example:
SELECT s.word, s.freq, k.freq FROM shakespeares
JOIN ON (s.word= k.word) WHERE s.freq >= 5;
Zookeeper
Zookeeper is a distributed consensus engine
Provides well-defined concurrent access semantics:
Leader election
Service discovery
Distributed locking / mutual exclusion
Message board / mailboxes
Any Questions and Thanks