Welcome to
Big Data
with
Hadoop & Spark
Please introduce yourself while others are
joining
Introduction to Hadoop &
ReachUs@[Link]
Session 1 - Big Data with Hadoop & Spark
Duration: 3 hours
Agenda:
• Introduction to Big Data
• 10 mins. break
• Spark & Hadoop Architecture
Notes:
• Please introduce yourself using chat window while others are joining
• Session is being recorded & Recording & presentation will be shared
• This is Session 1 out of 18 sessions on Big Data with Hadoop & Spark
specialization.
• It suffices as an introduction to Big Data Technology Stack.
Asking Questions?
• Every one except Instructor is muted
• Please ask questions by typing in Q&A Window
• Instructor will read out the questions before answering
• To get better answers, keep your messages short and avoid chat language
Introduction to Hadoop &
ReachUs@[Link]
Course Instructor
Founder
Loves Explaining Technologies
Software Engineer
Sandeep Giri
Worked On Large Scale Computing
Graduated from IIT Roorkee
Introduction to Hadoop &
ReachUs@[Link]
Course Objective
Learn To Process
Big Data
With
Hadoop, Spark
&
Related Technologies
Introduction to Hadoop &
ReachUs@[Link]
Course Structure
Videos Quizzes Hands-On Projects Case
Studies
Real Life Use Cases
Introduction to Hadoop &
ReachUs@[Link]
Automated Hands-on Assessments
Learn by doing
Introduction to Hadoop &
ReachUs@[Link]
Automated Hands-on Assessments
Problem Hands On Assessment
Statement
Introduction to Hadoop &
ReachUs@[Link]
Automated Hands-on Assessments
Problem
Statement
Evaluatio
n Introduction to Hadoop &
ReachUs@[Link]
My Courses
Introduction to Hadoop &
ReachUs@[Link]
My Course List
Introduction to Hadoop &
ReachUs@[Link]
Topics or PlayLists
Introduction to Hadoop &
ReachUs@[Link]
Learning Item
Introduction to Hadoop &
ReachUs@[Link]
Automated Hands-on Assessments
Click when you are done!
Introduction to Hadoop &
ReachUs@[Link]
Data Variety
Introduction to Hadoop &
ReachUs@[Link]
Data Variety
ETL
Extract Transform Load
Introduction to Hadoop &
ReachUs@[Link]
Distributed Systems
[Link] of networked
computers
[Link] with each other
[Link] achieve a common goal.
Introduction to Hadoop &
ReachUs@[Link]
Question
How Many Bytes in One
Petabyte?
1.1259x10 ^15
Introduction to Hadoop &
ReachUs@[Link]
Question
How Much Data Facebook Stores
in One Day?
600 TB
Introduction to Hadoop &
ReachUs@[Link]
What is Big Data?
• Simply: Data of Very Big
Size
• Can’t process with usual
tools
• Distributed Architecture
Needed
• Structured / Unstructured
Introduction to Hadoop &
ReachUs@[Link]
Characteristics of Big Data
VOLUME VELOCITY VARIETY
Data At Rest Data In Motion Data in Many Forms
Problems Involving the
handling of data coming at Problems involving
Problems related to complex data
storage of huge data fast rate.
e.g. Number of requests structures
reliably. e.g. Maps, Social
e.g. Storage of Logs of a being received by
Facebook, Youtube Graphs,
website, Storage of data Recommendations
by gmail. streaming, Google
FB: 300 PB. 600TB/ day Analytics
Introduction to Hadoop &
ReachUs@[Link]
Characteristics of Big Data - Variety
Problems involving complex data structures
e.g. Maps, Social Graphs, Recommendations
Introduction to Hadoop &
ReachUs@[Link]
Question
Time taken to read 1 TB from
HDD?
Around 6 hours
Introduction to Hadoop &
ReachUs@[Link]
Is One PetaByte Big Data?
If you have to count just vowels in 1
Petabyte data everyday, do you need
distributed system?
Introduction to Hadoop &
ReachUs@[Link]
Is One PetaByte Big Data?
Yes.
Most of the existing systems can’t handle it.
Introduction to Hadoop &
ReachUs@[Link]
Why Big Data?
Introduction to Hadoop &
ReachUs@[Link]
Why is It Important Now?
X =>
Application
Devices: Connectivity
Social Networks
Smart Phones Wifi, 4G, NFC,
4.6 billion mobile-phones. Internet of
GPS
1 - 2 billion people accessing the
internet. Things
The devices became cheaper, faster and smaller.
The connectivity improved. Result: Many Applications
Introduction to Hadoop &
ReachUs@[Link]
Computing Components
To process & store data
we need
1. CPU Speed
4. Network
2. RAM - Speed & 3. HDD or SSD
Size Disk Size + Speed
Introduction to Hadoop &
ReachUs@[Link]
Which Components Impact the Speed
of Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above
Introduction to Hadoop &
ReachUs@[Link]
Which Components Impact the Speed
of Computing?
A. CPU
B. Memory Size
C. Memory Read
Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above
Introduction to Hadoop &
ReachUs@[Link]
Example Big Data Customers
1. Ecommerce - Recommendations
Introduction to Hadoop &
ReachUs@[Link]
Example Big Data Customers
1. Ecommerce - Recommendations
Introduction to Hadoop &
ReachUs@[Link]
Example Big Data Problems
Recommendations -
How?
MOVIE
USER ID RATING
ID
KUMAR matrix 4.0
KUMAR Ice age 3.5
USER ID MOVIE ID RATING
apocalyp
GIRI 3.6
se now
KUMAR apocalypse now 3.6
GIRI Ice age 3.5
GIRI matrix 4.0
Introduction to Hadoop &
ReachUs@[Link]
Example Big Data Customers
2. Ecommerce - A/B Testing
Introduction to Hadoop &
ReachUs@[Link]
Big Data Customers
Government
[Link] Detection
[Link] Security
Welfare
[Link] Telecommunications
[Link] Churn Prevention
[Link] Performance Optimization
[Link] Data Record (CDR)
Analysis
[Link] Network to Predict
Failure
Introduction to Hadoop &
ReachUs@[Link]
Example Big Data Customers
Healthcare & Life Sciences
[Link] information
exchange
[Link] sequencing
[Link] improvements
[Link] Safety
Introduction to Hadoop &
ReachUs@[Link]
Big Data Solutions
[Link] Hadoop
Apache Spark
[Link]
[Link]
[Link] Compute
Engine
[Link]
Introduction to Hadoop &
ReachUs@[Link]
What is Hadoop?
A. Created by Doug Cutting (of Yahoo)
B. Built for Nutch search engine project
C. Joined by Mike Cafarella
D. Based on GFS, GMR & Google Big Table
E. Named after Toy Elephant
F. Open Source - Apache
G. Powerful, Popular & Supported
H. Framework to handle Big Data
I. For distributed, scalable and reliable computing
J. Written in Java
Introduction to Hadoop &
ReachUs@[Link]
WorkFlow
Components Spark Machin
SQL like e
interface learnin
g/
SQL Interface STATS
Compute Engine
NoSQL
Datastore
Resource
Manager
File Storage
Introduction to Hadoop &
ReachUs@[Link]
Apach
e• Really fast MapReduce
• 100x faster than Hadoop MapReduce in
memory,
• 10x faster on disk.
• Builds on similar paradigms as MapReduce
• Integrated with Hadoop
Spark Core - A fast and general engine for large-
scale data processing.
Introduction to Hadoop &
ReachUs@[Link]
Spark Architecture
Data Sources
HDFS
HBase
Spark Jav Pytho Scal Languages
SQL R a n a
Hive
Dataframe MLLi
Streaming GraphX Libraries
s b
Tachyon
Spark Core
Cassandra
Hadoop Amazon Standalon
Apache Mesos
YARN EC2 e
Resource/cluster managers
Introduction to Hadoop &
ReachUs@[Link]
Thank you. For the full course please enroll at
[Link]
Introduction to Hadoop &
ReachUs@[Link]
For the full course please enroll at [Link]
Introduction to Hadoop &
ReachUs@[Link]