Bigdata Hadoop and Spark
by Sumit Mittal
Welcome
to
Bigdata Hadoop & Spark Demo
Trainer Introduction
Mr. Sumit Mittal, CEO & founder of
TrendyTech. He has a Master’s degree in
Computer Applications from NIT Trichy & have
a total of 7+ years of industry experience. He has
worked for top MNC’s like Cisco & VMware.
Consistent 5 star Google rated Bigdata course
What is Bigdata ?
Bigdata is a term that describes
the large volume of data
There may be many definitions
of Bigdata
How to
classify Bigdata
?
3v’s of Bigdata
by
3v’s of Bigdata ▶
• Volume
• Variety
Formal Definition
• Velocity
3v’s of Bigdata ▶
2.5 quintillion
Volume (2,500,000,000,000,000,000)
bytes of data are created
Scale of data each day
3v’s of Bigdata ▶
Structured data
Variety RDBMS Databases (Oracle & MySQL)
Semi structured data
Different forms CSV, XML, JSON
of data Unstructured data
Audio, Video, Image, Log files.
3v’s of Bigdata ▶
900 Million photos on
Facebook
Velocity 600 Million tweets on Twitter
0.5 M
illion hours of video on
Speed of data Youtube
3.5 Billion searches on Google
4 th
v of Bigdata
4 th
v of Bigdata ▶
Veracity Poor Quality data
Uncertainity Unclean data
of data
Why Bigdata ?
Why Bigdata ?
Process To process huge amount
of data which traditional
systems are not capable
of processing
Why Bigdata ?
Store To process huge amount
of data we 1st need to
store it
Why Bigdata ?
Store Are our traditional systems
capable to store such
massive amount of data ?
Bigdata System
Requirements
?
Bigdata System Requirements ▶
Traditional systems
Store huge amount are NOT fit to store
of data such huge amount
of data
Bigdata System Requirements ▶
Store
store massive
amount of data
? ?
Bigdata System Requirements ▶
Process huge amount Traditional systems
of data in a efficient are NOT capable
and timely manner to handle
Bigdata System Requirements ▶
Store Process
store massive
amount of data
Process it in a
timely manner
?
Bigdata System Requirements ▶
Scale easily to Traditional systems
accomodate growing have serious
requirements Limitations
Bigdata System Requirements ▶
Store Process Scale
store massive Process it in a Scale easily as
amount of data timely manner data grows
Two ways to build
a system
Monolithic Distributed
2 ways to build a system ▶
Monolithic
One powerful system
with lot of resources
2 ways to build a system ▶
Distributed
Many smaller systems
come together
Monolithic or Distributed ?
Monolithic or Distributed ? ▶
A Single powerful server
Monolithic Hard to to add resources
after a certain limit
Monolithic or Distributed ? ▶
Resources
Monolithic
RAM 8 GB Hard Disk 1 TB CPU Quad core
(Memory) (Storage) (Compute)
Monolithic or Distributed ? ▶
NO!
Is Monolithic
2x resources ≠
scalable
2x performance
?
Monolithic or Distributed ? ▶
Node
Distributed
6 Node Cluster
Monolithic or Distributed ? ▶
Many small and cheap
computers come together....
Distributed
....to act as a
single entity
Monolithic or Distributed ? ▶
Is Distributed
Yes !
system
Distributed systems
scalable
are linearly scalable
?
Monolithic or Distributed ? ▶
Distributed +
systems are
2x resources =
scalable
2x Speed
Monolithic or Distributed ? ▶
Monolithic Distributed
Architecture Architecture
Vertical Scaling Horizontal Scaling
(Not true scaling) (True scaling)
Monolithic or Distributed
✓
Monolithic Distributed
That is why all good big data
systems are based on Distributed
architecture
What is Hadoop
?
What is Hadoop ?
Hadoop is a framework to solve
Bigdata problems
Hadoop Evolution
Hadoop Evolution
Google released a
2003 paper to describe how
to store large datasets
Hadoop Evolution
This paper was called
2003 as GFS (Google File
System)
Hadoop Evolution
Google released another
2004 paper to describe how to
process large datasets
Hadoop Evolution
This paper was called
2004 as MapReduce
Hadoop Evolution
Yahoo took these papers
2006 and Implemented it
Hadoop Evolution
The implimentation of GFS
was named as HDFS (Hadoop
Distributed File System)
2006 The implimentation of
MapReduce was named as
MapReduce (unchanged)
Hadoop Evolution
Hadoop 1.0
HDFS MapReduce
for for
distributed storage distributed processing
Hadoop Evolution
Hadoop came under
Apache Software
2009 Foundation and
became open source
Hadoop Evolution
Apache released
Hadoop 2.0 to provide
2013 major performance
enhancements
Hadoop Evolution
2003 2004 2006 2009 2013
Google Google Yahoo Hadoop Hadoop 2.0
GFS MR Implimentations under Apache released
Hadoop Evolution
Hadoop 1.0 Hadoop 2.0
MapReduce MapReduce YARN
HDFS HDFS
Hadoop Evolution
What is YARN
?
Hadoop Evolution
YARN
Mainly responsible for
Yet Another
Resource Resource management
Negotiator
Hadoop Evolution
HADOOP CORE COMPONENTS
HDFS MR YARN
for for for
distributed distributed resource
storage processing management
Hadoop Ecosystem
Hadoop Ecosystem ▶
HBASE SQOOP
YARN MR
HIVE HDFS
Hadoop Core
OOZIE PIG
Hadoop Ecosystem ▶
Data warehouse tool
built on top of Apache
Hadoop for providing
HIVE data query and analysis
Hadoop Ecosystem ▶
A scripting language
for data manipulation.
Transforms unstructured
PIG data into structured format
Hadoop Ecosystem ▶
A command-line interface
application for transferring
data between relational
SQOOP databases and Hadoop
Hadoop Ecosystem ▶
A column-oriented
NoSQL database that
HBASE runs on top of HDFS
Hadoop Ecosystem ▶
A workflow scheduler
system to manage
OOZIE Apache Hadoop jobs
Hadoop Ecosystem ▶
A distributed general
purpose in-memory
SPARK compute engine
Introduction to
Introduction to Spark ▶
Apache Spark is a
general purpose
in-memory compute
engine
Introduction to Spark ▶
In Hadoop Cluster
HDFS | MapReduce | YARN
↑ ✓ ↑
Storage Compute Resource
Unit Engine Manager
Introduction to Spark ▶
In Hadoop Cluster
HDFS | MapReduce | YARN
↓ ↓ ↓
HDFS | SPARK | YARN
Compute Engine
Introduction to Spark ▶
A plug & play
Compute Engine
SPARK
Introduction to Spark ▶
Plug it with any Storage System
LOCAL STORAGE / HDFS / AMAZON S3
Plug it with any Resource Manager
SPARK YARN / MESOS / KUBERNETES
Introduction to Spark ▶
SPARK CLUSTER
Compute ▶ SPARK
Storage ▶ Local / HDFS / Amazon s3
SPARK
Resource ▶
YARN / MESOS / KUBERNETES
Manager ▶
Introduction to Spark ▶
Current Industry Trend
Compute ▶ SPARK
Storage ▶ HDFS
SPARK
Resource ▶
YARN
Manager ▶
Introduction to Spark ▶
Spark is written in Scala
However, Spark officially
supports Java, Scala,
SPARK Python and R
Key Course Highlights
5 Star Google Rated Big Data Course All topics related to Bigdata Hadoop,
Scala, Spark, Bigdata on AWS Cloud
are covered in depth
Hands on learning so that you get
really confident
15 Weeks of online extened course Live Capstone projects, regular
designed for working professionals assignments & assessments
150+ hours of Quality learning Wide range of interview questions
specially designed to crack top covered along with resume prepara-
companies tion & career guidance
Trainer : Mr. Sumit Mittal
LinkedIn : [Link]
Website : [Link]
Call : 9108179578
email : [Link]@[Link]
Youtube chanel : TrendyTech