0% found this document useful (0 votes)
94 views76 pages

Bigdata Intro

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views76 pages

Bigdata Intro

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Bigdata Hadoop and Spark

by Sumit Mittal
Welcome
to
Bigdata Hadoop & Spark Demo
Trainer Introduction

Mr. Sumit Mittal, CEO & founder of


TrendyTech. He has a Master’s degree in
Computer Applications from NIT Trichy & have
a total of 7+ years of industry experience. He has
worked for top MNC’s like Cisco & VMware.

Consistent 5 star Google rated Bigdata course


What is Bigdata ?
Bigdata is a term that describes
the large volume of data

There may be many definitions


of Bigdata
How to
classify Bigdata
?
3v’s of Bigdata
by
3v’s of Bigdata ▶

• Volume

• Variety
Formal Definition
• Velocity
3v’s of Bigdata ▶

2.5 quintillion
Volume (2,500,000,000,000,000,000)
bytes of data are created
Scale of data each day
3v’s of Bigdata ▶

Structured data
Variety RDBMS Databases (Oracle & MySQL)

Semi structured data


Different forms CSV, XML, JSON

of data Unstructured data


Audio, Video, Image, Log files.
3v’s of Bigdata ▶

900 Million photos on


Facebook

Velocity 600 Million tweets on Twitter


0.5 M
 illion hours of video on
Speed of data Youtube
3.5 Billion searches on Google
4 th
v of Bigdata
4 th
v of Bigdata ▶

Veracity Poor Quality data


Uncertainity Unclean data
of data
Why Bigdata ?
Why Bigdata ?

Process To process huge amount


of data which traditional
systems are not capable
of processing
Why Bigdata ?

Store To process huge amount


of data we 1st need to
store it
Why Bigdata ?

Store Are our traditional systems


capable to store such
massive amount of data ?
Bigdata System
Requirements

?
Bigdata System Requirements ▶

Traditional systems
Store huge amount are NOT fit to store
of data such huge amount
of data
Bigdata System Requirements ▶

Store
store massive
amount of data
? ?
Bigdata System Requirements ▶

Process huge amount Traditional systems


of data in a efficient are NOT capable
and timely manner to handle
Bigdata System Requirements ▶

Store Process
store massive
amount of data
Process it in a
timely manner
?
Bigdata System Requirements ▶

Scale easily to Traditional systems


accomodate growing have serious
requirements Limitations
Bigdata System Requirements ▶

Store Process Scale


store massive Process it in a Scale easily as
amount of data timely manner data grows
Two ways to build
a system

Monolithic Distributed
2 ways to build a system ▶

Monolithic
One powerful system
with lot of resources
2 ways to build a system ▶

Distributed
Many smaller systems
come together
Monolithic or Distributed ?
Monolithic or Distributed ? ▶

A Single powerful server

Monolithic Hard to to add resources


after a certain limit
Monolithic or Distributed ? ▶

Resources

Monolithic

RAM 8 GB Hard Disk 1 TB CPU Quad core


(Memory) (Storage) (Compute)
Monolithic or Distributed ? ▶

NO!
Is Monolithic
2x resources ≠
scalable
2x performance
?
Monolithic or Distributed ? ▶

Node

Distributed

6 Node Cluster
Monolithic or Distributed ? ▶

Many small and cheap


computers come together....

Distributed
....to act as a
single entity
Monolithic or Distributed ? ▶

Is Distributed
Yes !
system
Distributed systems
scalable
are linearly scalable
?
Monolithic or Distributed ? ▶

Distributed +
systems are
2x resources =
scalable
2x Speed
Monolithic or Distributed ? ▶

Monolithic Distributed
Architecture Architecture

Vertical Scaling Horizontal Scaling


(Not true scaling) (True scaling)
Monolithic or Distributed

Monolithic Distributed
That is why all good big data
systems are based on Distributed
architecture
What is Hadoop

?
What is Hadoop ?

Hadoop is a framework to solve


Bigdata problems
Hadoop Evolution
Hadoop Evolution

Google released a

2003 paper to describe how


to store large datasets
Hadoop Evolution

This paper was called

2003 as GFS (Google File


System)
Hadoop Evolution

Google released another

2004 paper to describe how to


process large datasets
Hadoop Evolution

This paper was called


2004 as MapReduce
Hadoop Evolution

Yahoo took these papers


2006 and Implemented it
Hadoop Evolution
The implimentation of GFS
was named as HDFS (Hadoop
Distributed File System)
2006 The implimentation of
MapReduce was named as
MapReduce (unchanged)
Hadoop Evolution
Hadoop 1.0

HDFS MapReduce
for for
distributed storage distributed processing
Hadoop Evolution

Hadoop came under


Apache Software
2009 Foundation and
became open source
Hadoop Evolution

Apache released
Hadoop 2.0 to provide
2013 major performance
enhancements
Hadoop Evolution

2003 2004 2006 2009 2013

Google Google Yahoo Hadoop Hadoop 2.0


GFS MR Implimentations under Apache released
Hadoop Evolution

Hadoop 1.0 Hadoop 2.0

MapReduce MapReduce YARN

HDFS HDFS
Hadoop Evolution

What is YARN

?
Hadoop Evolution

YARN
Mainly responsible for
Yet Another
Resource Resource management
Negotiator
Hadoop Evolution

HADOOP CORE COMPONENTS

HDFS MR YARN
for for for
distributed distributed resource
storage processing management
Hadoop Ecosystem
Hadoop Ecosystem ▶

HBASE SQOOP

YARN MR
HIVE HDFS
Hadoop Core

OOZIE PIG
Hadoop Ecosystem ▶

Data warehouse tool


built on top of Apache
Hadoop for providing
HIVE data query and analysis
Hadoop Ecosystem ▶

A scripting language
for data manipulation.
Transforms unstructured
PIG data into structured format
Hadoop Ecosystem ▶

A command-line interface
application for transferring
data between relational
SQOOP databases and Hadoop
Hadoop Ecosystem ▶

A column-oriented
NoSQL database that

HBASE runs on top of HDFS


Hadoop Ecosystem ▶

A workflow scheduler
system to manage

OOZIE Apache Hadoop jobs


Hadoop Ecosystem ▶

A distributed general
purpose in-memory

SPARK compute engine


Introduction to
Introduction to Spark ▶

Apache Spark is a
general purpose
in-memory compute
engine
Introduction to Spark ▶

In Hadoop Cluster

HDFS | MapReduce | YARN


↑ ✓ ↑
Storage Compute Resource
Unit Engine Manager
Introduction to Spark ▶

In Hadoop Cluster

HDFS | MapReduce | YARN


↓ ↓ ↓
HDFS | SPARK | YARN
Compute Engine
Introduction to Spark ▶

A plug & play


Compute Engine
SPARK
Introduction to Spark ▶

Plug it with any Storage System


LOCAL STORAGE / HDFS / AMAZON S3

Plug it with any Resource Manager


SPARK YARN / MESOS / KUBERNETES
Introduction to Spark ▶

SPARK CLUSTER

Compute ▶ SPARK

Storage ▶ Local / HDFS / Amazon s3

SPARK
Resource ▶
YARN / MESOS / KUBERNETES
Manager ▶
Introduction to Spark ▶

Current Industry Trend

Compute ▶ SPARK

Storage ▶ HDFS

SPARK
Resource ▶
YARN
Manager ▶
Introduction to Spark ▶

Spark is written in Scala

However, Spark officially


supports Java, Scala,
SPARK Python and R
Key Course Highlights
5 Star Google Rated Big Data Course All topics related to Bigdata Hadoop,
Scala, Spark, Bigdata on AWS Cloud
are covered in depth

Hands on learning so that you get


really confident

15 Weeks of online extened course Live Capstone projects, regular


designed for working professionals assignments & assessments

150+ hours of Quality learning Wide range of interview questions


specially designed to crack top covered along with resume prepara-
companies tion & career guidance
Trainer : Mr. Sumit Mittal
LinkedIn : [Link]
Website : [Link]
Call : 9108179578
email : [Link]@[Link]
Youtube chanel : TrendyTech

You might also like