Welcome to
Big Data
with
Hadoop & Spark
Introduction
Introduction to Hadoop & Spark
Data Variety
Introduction to Hadoop & Spark
Data Variety
ETL
Extract Transform Load
Introduction to Hadoop & Spark
Distributed Systems
1.Groups of networked computers
2.Interact with each other
3.To achieve a common goal.
Introduction to Hadoop & Spark
Question
How Many Bytes in One Petabyte?
^15
1.1259x10
Introduction to Hadoop & Spark
Question
How Much Data Facebook Stores in
One Day?
600 TB
Introduction to Hadoop & Spark
What is Big Data?
• Simply: Data of Very Big Size
• Can’t process with usual tools
• Distributed Architecture
Needed
• Structured / Unstructured
Introduction to Hadoop & Spark
Characteristics of Big Data
VOLUME VELOCITY VARIETY
Data At Rest Data In Motion Data in Many Forms
Problems Involving the
Problems related to storage handling of data coming at Problems involving
of huge data reliably. fast rate. complex data structures
e.g. Storage of Logs of a e.g. Number of requests e.g. Maps, Social Graphs,
website, Storage of data by being received by Facebook, Recommendations
gmail. Youtube streaming, Google
FB: 300 PB. 600TB/ day Analytics
Introduction to Hadoop & Spark
Characteristics of Big Data - Variety
Problems involving complex data structures
e.g. Maps, Social Graphs, Recommendations
Introduction to Hadoop & Spark
Question
Time taken to read 1 TB from HDD?
Around 6 hours
Introduction to Hadoop & Spark
Is One PetaByte Big Data?
If you have to count just vowels in 1 Petabyte
data everyday, do you need distributed
system?
Introduction to Hadoop & Spark
Is One PetaByte Big Data?
Yes.
Most of the existing systems can’t handle it.
Introduction to Hadoop & Spark
Why Big Data?
Introduction to Hadoop & Spark
Why is It Important Now?
X =>
Devices: Application
Connectivity
Smart Phones Social Networks
Wifi, 4G, NFC, GPS
4.6 billion mobile-phones. Internet of Things
1 - 2 billion people accessing the internet.
The devices became cheaper, faster and smaller.
The connectivity improved. Result: Many Applications
Introduction to Hadoop & Spark
Computing Components
To process & store data
we need
1. CPU Speed
4. Network
3. HDD or SSD
2. RAM - Speed & Size
Disk Size + Speed
Introduction to Hadoop & Spark
Which Components Impact the Speed of
Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above
Introduction to Hadoop & Spark
Which Components Impact the Speed of
Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above
Introduction to Hadoop & Spark
Example Big Data Customers
1. Ecommerce - Recommendations
Introduction to Hadoop & Spark
Example Big Data Customers
1. Ecommerce - Recommendations
Introduction to Hadoop & Spark
Example Big Data Problems
Recommendations - How?
USER ID MOVIE ID RATING
KUMAR matrix 4.0
KUMAR Ice age 3.5
USER ID MOVIE ID RATING
apocalypse
GIRI 3.6
now
KUMAR apocalypse now 3.6
GIRI Ice age 3.5
GIRI matrix 4.0
Introduction to Hadoop & Spark
Example Big Data Customers
2. Ecommerce - A/B Testing
Introduction to Hadoop & Spark
Big Data Customers
Government
1.Fraud Detection
2.Cyber Security Welfare
3.Justice
Telecommunications
1.Customer Churn Prevention
2.Network Performance Optimization
3.Calling Data Record (CDR) Analysis
4.Analyzing Network to Predict Failure
Introduction to Hadoop & Spark
Example Big Data Customers
Healthcare & Life Sciences
1.Health information exchange
2.Gene sequencing
3.Healthcare improvements
4.Drug Safety
Introduction to Hadoop & Spark
Big Data Solutions
1.Apache Hadoop
○ Apache Spark
2.Cassandra
3.MongoDB
4.Google Compute Engine
Introduction to Hadoop & Spark
What is Hadoop?
A. Created by Doug Cutting (of Yahoo)
B. Built for Nutch search engine project
C. Joined by Mike Cafarella
D. Based on GFS, GMR & Google Big Table
E. Named after Toy Elephant
F. Open Source - Apache
G. Powerful, Popular & Supported
H. Framework to handle Big Data
I. For distributed, scalable and reliable computing
J. Written in Java
Introduction to Hadoop & Spark
WorkFlow
Components Spark
SQL like interface Machine
learning
/ STATS
SQL Interface
Compute Engine
NoSQL Datastore
Resource Manager
File Storage
Introduction to Hadoop & Spark
Apache
• Really fast MapReduce
• 100x faster than Hadoop MapReduce in memory,
• 10x faster on disk.
• Builds on similar paradigms as MapReduce
• Integrated with Hadoop
Spark Core - A fast and general engine for large-scale
data processing.
Introduction to Hadoop & Spark
Spark Architecture
Data Sources
HDFS
HBase
SQL SparkR Java Python Scala Languages
Hive
Dataframes Streaming MLLib GraphX Libraries
Tachyon
Spark Core
Cassandra
Hadoop
Amazon EC2 Standalone Apache Mesos
YARN
Resource/cluster managers
Introduction to Hadoop & Spark