0% found this document useful (0 votes)

343 views28 pages

Introduction to Hadoop & Spark Basics

This document introduces Hadoop and Spark for big data. It discusses why big data is important due to the growth of data from devices and applications. It describes characteristics of big data including volume, velocity and variety. Examples are given of how companies use big data for recommendations, A/B testing, fraud detection and more. Hadoop and Spark are introduced as solutions for distributed big data processing using frameworks like MapReduce. Key components of these platforms are also summarized.

Uploaded by

Premjit Sengupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

343 views28 pages

Introduction to Hadoop & Spark Basics

Uploaded by

Premjit Sengupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Welcome to

Big Data
with

Hadoop & Spark

Introduction

Introduction to Hadoop & Spark

Data Variety

Introduction to Hadoop & Spark

Data Variety

ETL
Extract Transform Load

Introduction to Hadoop & Spark

Distributed Systems

1.Groups of networked computers

2.Interact with each other
3.To achieve a common goal.

Introduction to Hadoop & Spark

Question

How Many Bytes in One Petabyte?

^15
1.1259x10

Introduction to Hadoop & Spark

Question

How Much Data Facebook Stores in

One Day?

600 TB

Introduction to Hadoop & Spark

What is Big Data?

• Simply: Data of Very Big Size

• Can’t process with usual tools

• Distributed Architecture
Needed

• Structured / Unstructured

Introduction to Hadoop & Spark

Characteristics of Big Data
VOLUME VELOCITY VARIETY
Data At Rest Data In Motion Data in Many Forms

Problems Involving the

Problems related to storage handling of data coming at Problems involving
of huge data reliably. fast rate. complex data structures
e.g. Storage of Logs of a e.g. Number of requests e.g. Maps, Social Graphs,
website, Storage of data by being received by Facebook, Recommendations
gmail. Youtube streaming, Google
FB: 300 PB. 600TB/ day Analytics

Introduction to Hadoop & Spark

Characteristics of Big Data - Variety

Problems involving complex data structures

e.g. Maps, Social Graphs, Recommendations

Introduction to Hadoop & Spark

Question

Time taken to read 1 TB from HDD?

Around 6 hours

Introduction to Hadoop & Spark

Is One PetaByte Big Data?

If you have to count just vowels in 1 Petabyte

data everyday, do you need distributed
system?

Introduction to Hadoop & Spark

Is One PetaByte Big Data?

Yes.
Most of the existing systems can’t handle it.

Introduction to Hadoop & Spark

Why Big Data?

Introduction to Hadoop & Spark

Why is It Important Now?

X =>
Devices: Application
Connectivity
Smart Phones Social Networks
Wifi, 4G, NFC, GPS
4.6 billion mobile-phones. Internet of Things
1 - 2 billion people accessing the internet.

The devices became cheaper, faster and smaller.

The connectivity improved. Result: Many Applications

Introduction to Hadoop & Spark

Computing Components
To process & store data
we need

1. CPU Speed

4. Network

3. HDD or SSD
2. RAM - Speed & Size
Disk Size + Speed
Introduction to Hadoop & Spark
Which Components Impact the Speed of
Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above

Introduction to Hadoop & Spark

Which Components Impact the Speed of
Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above

Introduction to Hadoop & Spark

Example Big Data Customers
1. Ecommerce - Recommendations

Introduction to Hadoop & Spark

Example Big Data Customers
1. Ecommerce - Recommendations

Introduction to Hadoop & Spark

Example Big Data Problems
Recommendations - How?

USER ID MOVIE ID RATING

KUMAR matrix 4.0

KUMAR Ice age 3.5

USER ID MOVIE ID RATING
apocalypse
GIRI 3.6
now
KUMAR apocalypse now 3.6
GIRI Ice age 3.5
GIRI matrix 4.0

Introduction to Hadoop & Spark

Example Big Data Customers
2. Ecommerce - A/B Testing

Introduction to Hadoop & Spark

Big Data Customers
Government
1.Fraud Detection
2.Cyber Security Welfare
3.Justice

Telecommunications
1.Customer Churn Prevention
2.Network Performance Optimization
3.Calling Data Record (CDR) Analysis
4.Analyzing Network to Predict Failure

Introduction to Hadoop & Spark

Example Big Data Customers

Healthcare & Life Sciences

1.Health information exchange
2.Gene sequencing
3.Healthcare improvements
4.Drug Safety

Introduction to Hadoop & Spark

Big Data Solutions

1.Apache Hadoop
○ Apache Spark
2.Cassandra
3.MongoDB
4.Google Compute Engine

Introduction to Hadoop & Spark

What is Hadoop?

A. Created by Doug Cutting (of Yahoo)

B. Built for Nutch search engine project
C. Joined by Mike Cafarella
D. Based on GFS, GMR & Google Big Table
E. Named after Toy Elephant
F. Open Source - Apache
G. Powerful, Popular & Supported
H. Framework to handle Big Data
I. For distributed, scalable and reliable computing
J. Written in Java
Introduction to Hadoop & Spark
WorkFlow
Components Spark
SQL like interface Machine
learning
/ STATS
SQL Interface

Compute Engine

NoSQL Datastore

Resource Manager

File Storage

Introduction to Hadoop & Spark

Apache
• Really fast MapReduce
• 100x faster than Hadoop MapReduce in memory,
• 10x faster on disk.
• Builds on similar paradigms as MapReduce
• Integrated with Hadoop

Spark Core - A fast and general engine for large-scale

data processing.

Introduction to Hadoop & Spark

Spark Architecture
Data Sources

HDFS

HBase

SQL SparkR Java Python Scala Languages

Hive
Dataframes Streaming MLLib GraphX Libraries

Tachyon

Spark Core
Cassandra

Hadoop
Amazon EC2 Standalone Apache Mesos
YARN

Resource/cluster managers

Introduction to Hadoop & Spark

Hadoop & Spark: Big Data Essentials
No ratings yet
Hadoop & Spark: Big Data Essentials
28 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Big Data With Hadoop & Spark - Introduction
No ratings yet
Big Data With Hadoop & Spark - Introduction
42 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
10
No ratings yet
10
4 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Big Data Technologies
No ratings yet
Big Data Technologies
31 pages
Top Sqoop Interview Questions
No ratings yet
Top Sqoop Interview Questions
6 pages
BIG DATA & Hadoop Interview Questions With Answers
No ratings yet
BIG DATA & Hadoop Interview Questions With Answers
9 pages
Bigdataaaaa
No ratings yet
Bigdataaaaa
180 pages
Understanding Sqoop in Hadoop
No ratings yet
Understanding Sqoop in Hadoop
27 pages
HBase for Big Data Professionals
No ratings yet
HBase for Big Data Professionals
100 pages
Big Data
No ratings yet
Big Data
3 pages
Hadoop Lab: Data Node Calculations
100% (2)
Hadoop Lab: Data Node Calculations
6 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Cloudera Spark
No ratings yet
Cloudera Spark
70 pages
Big Data and Spark Developer Course
No ratings yet
Big Data and Spark Developer Course
5 pages
Big Data: NADC Says: Every Day, We Create 2.5 Quintillion Bytes of Data - So Much That 90% of The Data in The
No ratings yet
Big Data: NADC Says: Every Day, We Create 2.5 Quintillion Bytes of Data - So Much That 90% of The Data in The
3 pages
Spark MLIB
No ratings yet
Spark MLIB
50 pages
Hadoop Notes
No ratings yet
Hadoop Notes
11 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Hadoop Commands Cheat Sheet
No ratings yet
Hadoop Commands Cheat Sheet
1 page
Big Data Exam Questions and Exercises
100% (1)
Big Data Exam Questions and Exercises
6 pages
Lec - Spark
No ratings yet
Lec - Spark
65 pages
1.hadoop Admin Brochure
No ratings yet
1.hadoop Admin Brochure
11 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
52 pages
Advanced Big Data Admin Guide
No ratings yet
Advanced Big Data Admin Guide
132 pages
Databricks Cloud Workshop: SF, 2015-05-20! Download Slides
100% (1)
Databricks Cloud Workshop: SF, 2015-05-20! Download Slides
168 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Big Data and Apache Spark Overview
No ratings yet
Big Data and Apache Spark Overview
211 pages
Cloudera Introduction PDF
No ratings yet
Cloudera Introduction PDF
97 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
4 pages
Hive Queries
No ratings yet
Hive Queries
5 pages
Hadoop and Their Ecosystem
100% (2)
Hadoop and Their Ecosystem
24 pages
Mapr Snapshots
No ratings yet
Mapr Snapshots
31 pages
Big Data Course Overview and Modules
No ratings yet
Big Data Course Overview and Modules
4 pages
Cloudera Hive
No ratings yet
Cloudera Hive
132 pages
DEV3600 LabGuide
No ratings yet
DEV3600 LabGuide
26 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
Big Data Processing Techniques Guide
No ratings yet
Big Data Processing Techniques Guide
134 pages
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
Scala and Spark Practice Questions - Free Practice Test - Spark Quiz and Test
No ratings yet
Scala and Spark Practice Questions - Free Practice Test - Spark Quiz and Test
9 pages
BigData Hadoop Notes
No ratings yet
BigData Hadoop Notes
101 pages
Hadoop for Data Engineers
No ratings yet
Hadoop for Data Engineers
180 pages
9 Sqoop Notes
No ratings yet
9 Sqoop Notes
17 pages
Big Data Analytics with Apache Spark
No ratings yet
Big Data Analytics with Apache Spark
94 pages
Big Data Mock Exam: Right or Wrong
No ratings yet
Big Data Mock Exam: Right or Wrong
11 pages
Big Data Analytics Lab Syllabus
No ratings yet
Big Data Analytics Lab Syllabus
193 pages
03 Hive
No ratings yet
03 Hive
48 pages
Hadoop Interview Questions New
No ratings yet
Hadoop Interview Questions New
9 pages
Hadoop and IBM Big Insights Overview
No ratings yet
Hadoop and IBM Big Insights Overview
112 pages
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
No ratings yet
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
74 pages
MongoDB Document Database Overview
No ratings yet
MongoDB Document Database Overview
31 pages
Bda - Unit 1
No ratings yet
Bda - Unit 1
33 pages
Introduction to Big Data with Spark & Hadoop
No ratings yet
Introduction to Big Data with Spark & Hadoop
47 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
Introduction To Big Data With Spark and Hadoop
No ratings yet
Introduction To Big Data With Spark and Hadoop
61 pages
Intr Oduction of Big Data
No ratings yet
Intr Oduction of Big Data
12 pages
Bigdata Intro
No ratings yet
Bigdata Intro
76 pages
New AJSD Salesforce Brochure - V.june - 39 - 25
No ratings yet
New AJSD Salesforce Brochure - V.june - 39 - 25
7 pages
Premjit Sengupta Resume1
No ratings yet
Premjit Sengupta Resume1
3 pages
Analyzing Thank You Speech Data
No ratings yet
Analyzing Thank You Speech Data
4 pages
Portal - Progressive.in DownloadPayslip Month July&Year 2025
No ratings yet
Portal - Progressive.in DownloadPayslip Month July&Year 2025
1 page
02-Stemming - Jupyter Notebook
No ratings yet
02-Stemming - Jupyter Notebook
4 pages
N-Gram Modeling in Python for NLP
No ratings yet
N-Gram Modeling in Python for NLP
2 pages
KNN in Python for Data Scientists
No ratings yet
KNN in Python for Data Scientists
7 pages
Additional Reading Material-Probability
No ratings yet
Additional Reading Material-Probability
11 pages
Countries Region Mapping
No ratings yet
Countries Region Mapping
9 pages
Understanding Binomial Distribution Basics
No ratings yet
Understanding Binomial Distribution Basics
26 pages
Enterprise SOA Service-Oriented Architecture Best Practices PDF
No ratings yet
Enterprise SOA Service-Oriented Architecture Best Practices PDF
312 pages
System Recommendations For Use With Mastercam: Rev 3.2 January 2015
No ratings yet
System Recommendations For Use With Mastercam: Rev 3.2 January 2015
3 pages
Supply Chain Tech & Software Guide
No ratings yet
Supply Chain Tech & Software Guide
33 pages
Azure Compute, Networking and Storage Overview
No ratings yet
Azure Compute, Networking and Storage Overview
59 pages
Communication Protocol Test Harness
No ratings yet
Communication Protocol Test Harness
3 pages
Akshay Gavandi: Software Engineer Profile
No ratings yet
Akshay Gavandi: Software Engineer Profile
1 page
Google Drive Basics for Beginners
No ratings yet
Google Drive Basics for Beginners
29 pages
Real Time System, Hard Rts Vs Soft Rts
No ratings yet
Real Time System, Hard Rts Vs Soft Rts
6 pages
Delta of Venus - PDF
0% (1)
Delta of Venus - PDF
4 pages
AutoCAD Manual Reset Guide
No ratings yet
AutoCAD Manual Reset Guide
2 pages
Setup PHP/MySQL Server on Windows
No ratings yet
Setup PHP/MySQL Server on Windows
6 pages
Network PC and Server Audit Checklist
100% (1)
Network PC and Server Audit Checklist
7 pages
Setup For Failover Clustering and Microsoft Cluster Service PDF
No ratings yet
Setup For Failover Clustering and Microsoft Cluster Service PDF
40 pages
An Online Hotel Booking System
50% (2)
An Online Hotel Booking System
6 pages
Exam: 1Z0-204: Oracle EBS R12: E-Business Essentials Version: Demo
No ratings yet
Exam: 1Z0-204: Oracle EBS R12: E-Business Essentials Version: Demo
11 pages
Calibre Rule Writing PDF
No ratings yet
Calibre Rule Writing PDF
359 pages
Kubernatis (K8s) Configuration Guide
No ratings yet
Kubernatis (K8s) Configuration Guide
20 pages
XML & XSLT Guide in Dreamweaver
No ratings yet
XML & XSLT Guide in Dreamweaver
4 pages
Power PDF PDF 8 Family Comparison Chart
No ratings yet
Power PDF PDF 8 Family Comparison Chart
9 pages
Business Analyst with Data Expertise
No ratings yet
Business Analyst with Data Expertise
1 page
Microsoft Azure DP-300 Exam Dumps
No ratings yet
Microsoft Azure DP-300 Exam Dumps
11 pages
Dreamweaver 8 Basics and Beyond: 1. Before You Begin 2. Getting Started With Dreamweaver
No ratings yet
Dreamweaver 8 Basics and Beyond: 1. Before You Begin 2. Getting Started With Dreamweaver
19 pages
Reporting Tool Comparative Analysis
No ratings yet
Reporting Tool Comparative Analysis
12 pages
Oracle Database 11g Course Overview
No ratings yet
Oracle Database 11g Course Overview
29 pages
RMS Programmer with Python Expertise
No ratings yet
RMS Programmer with Python Expertise
3 pages
Discovery Kit For STM32F100 Value Line
No ratings yet
Discovery Kit For STM32F100 Value Line
4 pages
Ceragon IP20 Commissioning Guide
100% (6)
Ceragon IP20 Commissioning Guide
10 pages
Teamcenter 2007 Installation Mt00300
No ratings yet
Teamcenter 2007 Installation Mt00300
286 pages
16.0 Installing Redundant Triad License Servers
No ratings yet
16.0 Installing Redundant Triad License Servers
8 pages
LANDESK Management Suite 2016 Install Guide
No ratings yet
LANDESK Management Suite 2016 Install Guide
38 pages

Introduction to Hadoop & Spark Basics

Uploaded by

Introduction to Hadoop & Spark Basics

Uploaded by

Welcome to

Hadoop & Spark

Introduction to Hadoop & Spark

Introduction to Hadoop & Spark

Introduction to Hadoop & Spark

1.Groups of networked computers

Introduction to Hadoop & Spark

How Many Bytes in One Petabyte?

Introduction to Hadoop & Spark

How Much Data Facebook Stores in

Introduction to Hadoop & Spark

• Simply: Data of Very Big Size

• Can’t process with usual tools

Introduction to Hadoop & Spark

Problems Involving the

Introduction to Hadoop & Spark

Problems involving complex data structures

Introduction to Hadoop & Spark

Time taken to read 1 TB from HDD?

Introduction to Hadoop & Spark

If you have to count just vowels in 1 Petabyte

Introduction to Hadoop & Spark

Introduction to Hadoop & Spark

Introduction to Hadoop & Spark

The devices became cheaper, faster and smaller.

Introduction to Hadoop & Spark

Introduction to Hadoop & Spark

Introduction to Hadoop & Spark

Introduction to Hadoop & Spark

Introduction to Hadoop & Spark

USER ID MOVIE ID RATING

KUMAR matrix 4.0

KUMAR Ice age 3.5

Introduction to Hadoop & Spark

Introduction to Hadoop & Spark

Introduction to Hadoop & Spark

Healthcare & Life Sciences

Introduction to Hadoop & Spark

Introduction to Hadoop & Spark

A. Created by Doug Cutting (of Yahoo)

Introduction to Hadoop & Spark

Spark Core - A fast and general engine for large-scale

Introduction to Hadoop & Spark

SQL SparkR Java Python Scala Languages

Introduction to Hadoop & Spark

You might also like