0% found this document useful (0 votes)

51 views24 pages

Hadoop for Big Data Professionals

The document discusses big data analytics with Apache Hadoop. It provides data from 2020 to 2023 and discusses Gartner Hype Cycles from 2014 to 2023. It also discusses core principles of Hadoop including scaling out rather than up, bringing code to data rather than vice versa, dealing with failures, and abstracting complexity. The Hadoop ecosystem is explained, including components like HDFS, MapReduce, Pig, Hive, Cassandra, and Mahout.

Uploaded by

Neeraj Garg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views24 pages

Hadoop for Big Data Professionals

Uploaded by

Neeraj Garg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Data 2020 – 2021

BIG DATA ANALYTICS

WITH APACHE HADOOP
Dr. Emmanuel S. Pilli
Malaviya NIT Jaipur

Data 2021 Data

Data 2023 Data 2020

Gartner Hype Cycle 2014 Data 2021

Gartner Hype Cycle 2017 Gartner Hype Cycle 2015

Gartner Hype Cycle 2018 Gartner Hype Cycle 2016

Gartner Hype Cycle 2021

Gartner Hype Cycle 2022

3 V’s of Big Data Gartner Hype Cycle 2023

5 V’s of Big Data

DIKUW Another 5 ?
• Volume
• Velocity
• Variety
• Veracity
• Value
• Validity
• Variability
• Venue
• Vocabulary
• Vagueness Ordering Pizza in the Future

Motivation DIKW
• Storage as NAS (Network Attached Storage) and SAN
(Storage Area Network)
• Data is moved to compute nodes at compute time
Core Principles Motivation
• Scale-Out rather than Scale-Up • Modern systems have to deal with huge data
which has inherent value, and cannot be
• Bring code to data rather than data to
discarded
code
• Getting the data to the processors becomes
• Deal with failures – they are common the bottleneck
• Abstract complexity of distributed and • The system must support partial failure
concurrent applications • Adding load to the system should result in a
graceful decline in performance

Scale-Out rather than Scale-Up Motivation

• It is harder and more expensive to Scale-Up • Ability to scale out to Petabytes in size using
• Add additional resources to an existing node (CPU, RAM) commodity hardware
• Moore’s Law can’t keep up with data growth
• New units must be purchased if required resources can not • Processing (MapReduce) jobs are sent to the
be added data versus shipping the data to be processed
• Also known as scale vertically
• Hadoop doesn’t impose a single data format so
• Scale-Out it can easily handle structure, semi-structure and
• Add more nodes/machines to an existing distributed unstructured data
application
• Manages fault tolerance and data replication
• Software Layer is designed for node additions or removal
• Hadoop takes this approach - A set of nodes are bonded
automatically
together as a single distributed system
• Very easy to scale down as well
Hadoop Distributed File System Bring Code to Data rather than Data to Code

• Hadoop co-locates processors and storage

Block Size = 64MB • Code is moved to data (size is tiny, usually in KBs)
2 1
Replication Factor = 3
4 2 • Processors execute code and access underlying local
5 5 storage
1

2 1
HDFS
3 3

4 4

5 2
5
1
3
3
Cost is $400-$500/TB 4
5

31 ©2011 Cloudera, Inc. All Rights

Reserved.

Abstract complexity Hadoop is designed to cope with node failures

• Hadoop abstracts many complexities in distributed • If a node fails, the master will detect that failure
and concurrent applications and re-assign the work to a different node on the
• Defines small number of components system
• Provides simple and well defined interfaces of interactions • Restarting a task does not require communication
between these components with nodes working on other portions of the data
• Frees developer from worrying about system level • If a failed node restarts, it is automatically added
challenges back to the system and assigned new tasks
• race conditions, data starvation
• If a node appears to be running slowly, the
• processing pipelines, data partitioning, code distribution
master can redundantly execute another instance
• Allows developers to focus on application of the same task
development and business logic • Results from the first to finish will be used
Apache Hadoop Hadoop Use Cases
• Open source (using the Apache license)
Use Case Application Industry Application Use Case
• Around 40 core Hadoop committers from
Social Network Analysis Web Clickstream Sessionization
companies like
• Yahoo!, Facebook, Apple, Cloudera, and more Content Optimization Media Clickstream Sessionization

• Hundreds of contributors writing features, fixing Network Analytics Telco Mediation

bugs and other ecosystem products
Loyalty & Promotions Data Factory
Retail
Analysis

Fraud Analysis Financial Trade Reconciliation

DATA PROCESSING

Entity Analysis Federal SIGINT

ADVANCED ANALYTICS
Sequencing Analysis Bioinformatics Genome Mapping

Hadoop Ecosystem Three papers by Google

• Google File System

• MapReduce

• BigTable
Hadoop Ecosystem Research Periodic Table
• Pig: A high-level data-flow language and execution
framework for parallel computation of large datasets
• Hive: A distributed data warehouse query language
based on SQL that provides data summarization and ad
hoc querying.
• Cassandra: A scalable multi-master database with no
single points of failure.
• Mahout: A Scalable machine learning and data mining
library.
• Chukwa: A data collection system for managing large
distributed systems.

Hadoop Ecosystem Hadoop Ecosystem

• Flume: A distributed, reliable, available service for • Hadoop Distributed File System (HDFS): A distributed
efficiently moving large amounts of data as it is file system that provides high-throughput access to
produced application data.
• Sqoop: A tool for efficiently moving data between • Hadoop MapReduce: A distributed parallel data
relational databases and HDFS (SQL to Hadoop) processing model and execution environment for
• Oozie: A tool to provide way for developers to define an handling large data sets.
entire workflow • Hadoop YARN: A framework for job scheduling and
• Hue: Graphical front-end to developer and administrator cluster resource management.
functionality • HBase: A scalable, distributed column oriented database
• ZooKeeper: A distributed, highly available coordination that supports structured data storage for large tables.
service.
Hadoop Eco System Hadoop Ecosystem
• Avro: A data serialization system.
• Spark: A fast and general compute engine for Hadoop
data supporting machine learning, stream processing,
and graph computation
• Ambari: A web-based tool for provisioning, managing,
and monitoring Apache Hadoop clusters which includes
support for HDFS, MapReduce, Hive, HBase,
ZooKeeper, Oozie, Pig and Sqoop.

• Hadoop Ecosystem Table on GitHub

Hadoop Core Components Hadoop Ecosystem

Data Analytic Components High Level Data Processing

Data Analytic Components NoSQL

Management Components Data Serialization Components

Monitoring Components Data Transfer Components

Scalable: Grows without changes Benefit of Agility / Flexibility
Schema-on-Write (RDBMS) Schema-on-Read (Hadoop)

• Schema must be created • Data is simply copied to the file

before any data can be store, no transformation is
loaded. needed.
• An explicit load operation • A SerDe (Serializer /
has to take place which Deserlizer) is applied during
transforms data to DB read time to extract the
internal structure. required columns (late binding)
• New columns must be • New data can start flowing
added explicitly before new anytime and will appear
data for such columns can retroactively once the SerDe is
be loaded into the database. updated to parse it.

• Read is Fast • Load is Fast

• Standards / Governance • Flexibility / Agility

Data beats Algorithms Original Raw Data

Hadoop HDFS Keep All Data Alive Forever
• Hadoop Distributed File System is responsible for
storing data on the cluster
• Data files are split into blocks and distributed across
multiple nodes in the cluster
• Each block is replicated multiple times
• Default is to replicate each block three times
• Replicas are stored on different nodes
• This ensures both reliability and availability

Hadoop HDFS Right Tool for the Right Job

Relational Databases Hadoop

Use when:
• Interactive OLAP Analytics Use when:
(<1sec)
• Structured or Not (Flexibility)
• Multistep ACID Transactions
• Scalability of Storage/Compute
• 100% SQL Compliance
• Complex Data Processing
Five Hadoop Daemons Hadoop HDFS
Client
[ sends .jar and .xml ]
NameNode Master Node
[ file location info ]
JobTracker
[ determines execution plan ]

TaskTracker TaskTracker TaskTracker TaskTracker

Map Map Map Map

Reduce Reduce Reduce Reduce

DataNode DataNode
DataNode DataNode
[ reads and writes HDFS files ] Slave Nodes

HDFS Architecture Hadoop Daemons

 Hadoop is comprised of five separate daemons

 NameNode: Holds the metadata for HDFS

 Secondary NameNode
– Performs housekeeping functions for the NameNode
– Is not a backup or hot standby for the NameNode!

 DataNode: Stores actual HDFS data blocks

 JobTracker: Manages MapReduce jobs, distributes

individual tasks
 TaskTracker: Responsible for instantiating and
monitoring individual Map and Reduce tasks
Functions of Data Node Functions of Name Node
• These are slave daemons or process which runs • It is the master daemon that maintains and manages the
on each slave machine. DataNodes (slave nodes)
• It records the metadata of all the files stored in the
• The actual data is stored on DataNodes.
cluster, e.g. The location of blocks stored, the size of the
• The DataNodes perform the low-level read and files, permissions, hierarchy, etc. There are two files
write requests from the file system’s clients. associated with the metadata:
• FsImage: Complete state of the file system namespace
• They send heartbeats to the NameNode
since the start of the NameNode.
periodically to report the overall health of HDFS,
• EditLogs: All the recent modifications made to the file
by default, this frequency is set to 3 seconds. system with respect to the most recent FsImage.
• It records each change that takes place to the file system
metadata.

Secondary Name Node Functions of Name Node

• It regularly receives a Heartbeat and a block report from
all the DataNodes in the cluster to ensure that the
DataNodes are live.
• It keeps a record of all the blocks in HDFS and in which
nodes these blocks are located.
• The NameNode is also responsible to take care of
the replication factor .
• In case of the DataNode failure, the NameNode
chooses new DataNodes for new replicas, balance disk
usage and manages the communication traffic to the
DataNodes.
Blocks Replication Secondary Name Node
• The Secondary NameNode is one which constantly
reads all the file systems and metadata from the RAM of
the NameNode and writes it into the hard disk or the file
system.
• It is responsible for combining the EditLogs with
FsImage from the NameNode.
• It downloads the EditLogs from the NameNode at regular
intervals and applies to FsImage.
• The new FsImage is copied back to the NameNode,
which is used whenever the NameNode is started the
next time

HDFS File Read Blocks Replication

HDFS Write Pipeline HDFS File Write

Acknowledgement HDFS Write Setting up HDFS Pipeline

Hadoop MapReduce HDFS Read

Hadoop MapReduce Hadoop MapReduce

• MapReduce is the system used to process data in the
Hadoop cluster
• Consists of two phases: Map, and then Reduce
• Each Map task operates on a discrete portion (one
HDFS Block) of the overall dataset
• MapReduce system distributes the intermediate data
to nodes which perform the Reduce phase
Hadoop MapReduce Hadoop MapReduce

Hadoop MapReduce Hadoop MapReduce

Flume Architecture Sqoop
Master send
Configurable levels of reliability
configuration to all SQL to Hadoop
Guarantee delivery in event of
Agents failure
Agent Agent Agent Agent  Tool to import/export any JDBC-supported database into Hadoop
Deployable, centrally administered
encrypt
 Transfer data between Hadoop and external databases or EDW
MASTER
Optionally pre-process incoming  High performance connectors for some RDBMS
Processor Processor data: perform transformations,
suppressions, metadata enrichment  Developed at Cloudera
compress batch
encrypt

Writes to multiple HDFS file formats Collector(s)

(text, sequence, JSON, Avro, others) Flexibly deploy decorators at any
step to improve performance,
Parallelized writes across many reliability or security
collectors – as much write throughput
as

HBase Flume

Distributed, reliable, available service for efficiently moving

Column-family store. Based on design of Google BigTable large amounts of data as it is produced
 Provides interactive access to information  Suited for gathering logs from multiple systems
 Holds extremely large datasets (multi-TB)  Inserting them into HDFS as they are generated
 Constrained access model Design goals
 (key, value) lookup  Reliability, Scalability, Manageability, Extensibility
 Limited transactions (only one row) Developed at Cloudera
Pig HBase Architecture

Data-flow oriented language – “Pig latin”

 Datatypes include sets, associative arrays, tuples
 High-level language for routing data, allows easy
integration of Java for complex tasks

 Example:
emps=LOAD 'people.txt’ AS(id,name,salary);
rich = FILTER emps BY salary > 100000; srtd =
ORDER rich BY salary DESC; STORE srtd INTO ’
rich_people.txt';

Oozie Hive
Oozie is a workflow/cordination service to manage data
processing jobs for Hadoop SQL-based data warehousing application
 Language is SQL-like
 Supports SELECT, JOIN, GROUP BY, etc.
 Features for analyzing very large data sets
 Partition columns, Sampling, Buckets

 Example:
SELECT s.word, s.freq, k.freq FROM shakespeares
JOIN ON (s.word= k.word) WHERE s.freq >= 5;
Zookeeper

Zookeeper is a distributed consensus engine

 Provides well-defined concurrent access semantics:
 Leader election
 Service discovery
 Distributed locking / mutual exclusion
 Message board / mailboxes

Any Questions and Thanks

Unit I
No ratings yet
Unit I
38 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
BDA Module2
No ratings yet
BDA Module2
83 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
5 pages
Module 2 Hadoop
No ratings yet
Module 2 Hadoop
180 pages
Introduction to Hadoop Basics
No ratings yet
Introduction to Hadoop Basics
26 pages
History and Architecture of Hadoop
No ratings yet
History and Architecture of Hadoop
53 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Big Data
No ratings yet
Big Data
67 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Introduction to Hadoop Software
No ratings yet
Introduction to Hadoop Software
47 pages
HADOOP
No ratings yet
HADOOP
55 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Overview of Big Data Platforms
No ratings yet
Overview of Big Data Platforms
82 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Introduction to Apache Hadoop
No ratings yet
Introduction to Apache Hadoop
29 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Unit 5
No ratings yet
Unit 5
101 pages
Big Data Insights with Hadoop
No ratings yet
Big Data Insights with Hadoop
34 pages
Hadoop Basics and Benefits
No ratings yet
Hadoop Basics and Benefits
52 pages
Introduction To
No ratings yet
Introduction To
7 pages
Apache Hadoop: Getting Started With
No ratings yet
Apache Hadoop: Getting Started With
7 pages
Bda Unit2
No ratings yet
Bda Unit2
24 pages
Top Hadoop Training in Bangalore
No ratings yet
Top Hadoop Training in Bangalore
31 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Chap4 BigDataStorageAndManagement
No ratings yet
Chap4 BigDataStorageAndManagement
46 pages
Unit-5 - Hadoop
No ratings yet
Unit-5 - Hadoop
29 pages
BDA Unit-4
No ratings yet
BDA Unit-4
47 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
55 pages
BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
61 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Hadoop Ecosystem Overview and Setup
No ratings yet
Hadoop Ecosystem Overview and Setup
48 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
36 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Introduction to Hadoop Ecosystem
No ratings yet
Introduction to Hadoop Ecosystem
50 pages
Bda 3
No ratings yet
Bda 3
70 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Understanding Hadoop: Architecture & Use Cases
No ratings yet
Understanding Hadoop: Architecture & Use Cases
55 pages
Overview of the Hadoop Ecosystem
100% (2)
Overview of the Hadoop Ecosystem
33 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Web Services in Cloud Computing Overview
No ratings yet
Web Services in Cloud Computing Overview
23 pages
12 Debugging Techniques
No ratings yet
12 Debugging Techniques
19 pages
Materi-MSC3508-M 05-Steps in Personal Selling
No ratings yet
Materi-MSC3508-M 05-Steps in Personal Selling
21 pages
Lobster Data-Broschure EN Web Jun2015
No ratings yet
Lobster Data-Broschure EN Web Jun2015
9 pages
Android Unit V 10 Marks JNTUH
No ratings yet
Android Unit V 10 Marks JNTUH
5 pages
Ebs Sec Config Checks
No ratings yet
Ebs Sec Config Checks
10 pages
Git Ref Card
No ratings yet
Git Ref Card
8 pages
NoSQL Implementation of A Conceptual Data Model: UML Class Diagram To A Document Oriented Model
No ratings yet
NoSQL Implementation of A Conceptual Data Model: UML Class Diagram To A Document Oriented Model
10 pages
Chapter 5
No ratings yet
Chapter 5
24 pages
OceanStor V500R007 MIB Interface Notes
No ratings yet
OceanStor V500R007 MIB Interface Notes
137 pages
DBMS Basics for Students
No ratings yet
DBMS Basics for Students
74 pages
Dark Trace - Mapping To MITRE
No ratings yet
Dark Trace - Mapping To MITRE
19 pages
Locuz Enterprise Solutions LTD
No ratings yet
Locuz Enterprise Solutions LTD
29 pages
Study Guide
100% (1)
Study Guide
74 pages
Database, End of Sem Quiz, 2021
No ratings yet
Database, End of Sem Quiz, 2021
3 pages
"Hostel Management System": Project Report ON
No ratings yet
"Hostel Management System": Project Report ON
8 pages
Oracle Database 11g PL/SQL Course Overview
No ratings yet
Oracle Database 11g PL/SQL Course Overview
1 page
Oracle E-Business Suite Installation On Windows Server 2003
No ratings yet
Oracle E-Business Suite Installation On Windows Server 2003
10 pages
Excel Tables & VBA for Analysts
No ratings yet
Excel Tables & VBA for Analysts
60 pages
Redesigning Dairy Farm Group Systems
0% (1)
Redesigning Dairy Farm Group Systems
5 pages
Introduction Ashok It
100% (1)
Introduction Ashok It
18 pages
User Guide: A Free and Open Source Distributed Version Control System
No ratings yet
User Guide: A Free and Open Source Distributed Version Control System
11 pages
NEW LAUNCH REPEAT 1 Introducing Amazon SageMaker Studio, The First Full IDE For ML AIM214-R1
No ratings yet
NEW LAUNCH REPEAT 1 Introducing Amazon SageMaker Studio, The First Full IDE For ML AIM214-R1
44 pages
Schema Relational Algebra Soln
No ratings yet
Schema Relational Algebra Soln
5 pages
Exam Quistions
No ratings yet
Exam Quistions
85 pages
Online Exam System for Schools
No ratings yet
Online Exam System for Schools
13 pages
Easy Travel Website for Palestinians
No ratings yet
Easy Travel Website for Palestinians
56 pages
FPTD FDM Config Guide 660
No ratings yet
FPTD FDM Config Guide 660
798 pages
Cisco IronPort Management SOP KAUST
No ratings yet
Cisco IronPort Management SOP KAUST
8 pages
Fort I Token
No ratings yet
Fort I Token
4 pages

Hadoop for Big Data Professionals

Uploaded by

Hadoop for Big Data Professionals

Uploaded by

Data 2020 – 2021

BIG DATA ANALYTICS

Data 2021 Data

Gartner Hype Cycle 2014 Data 2021

Gartner Hype Cycle 2018 Gartner Hype Cycle 2016

Gartner Hype Cycle 2022

5 V’s of Big Data

Scale-Out rather than Scale-Up Motivation

• Hadoop co-locates processors and storage

31 ©2011 Cloudera, Inc. All Rights

Abstract complexity Hadoop is designed to cope with node failures

• Hundreds of contributors writing features, fixing Network Analytics Telco Mediation

Fraud Analysis Financial Trade Reconciliation

Entity Analysis Federal SIGINT

Hadoop Ecosystem Three papers by Google

• Google File System

Hadoop Ecosystem Hadoop Ecosystem

• Hadoop Ecosystem Table on GitHub

Hadoop Core Components Hadoop Ecosystem

Data Analytic Components NoSQL

Monitoring Components Data Transfer Components

• Schema must be created • Data is simply copied to the file

• Read is Fast • Load is Fast

Data beats Algorithms Original Raw Data

Hadoop HDFS Right Tool for the Right Job

TaskTracker TaskTracker TaskTracker TaskTracker

Map Map Map Map

Reduce Reduce Reduce Reduce

HDFS Architecture Hadoop Daemons

 NameNode: Holds the metadata for HDFS

 DataNode: Stores actual HDFS data blocks

 JobTracker: Manages MapReduce jobs, distributes

Secondary Name Node Functions of Name Node

HDFS File Read Blocks Replication

Acknowledgement HDFS Write Setting up HDFS Pipeline

Hadoop MapReduce Hadoop MapReduce

Hadoop MapReduce Hadoop MapReduce

Writes to multiple HDFS file formats Collector(s)

Distributed, reliable, available service for efficiently moving

Data-flow oriented language – “Pig latin”

Zookeeper is a distributed consensus engine

Any Questions and Thanks

You might also like