BIG DATA &
MANAGEMENT
1. WHAT IS BIG DATA?
When does data become BIG?
Large Volume of Data – Structured and Unstructured
2.5 quintillion bytes of data are generated
everyday [Discussed in Lesson 1]
2,500,000,000,000,000,000
3
What is Big Data?
2.5 quintillion pennies would, if laid out flat, cover
the Earth five times
Bill Gates’s projected fortune times 2.5 million,
assuming he lives to see 2042
[$102 Billion ~ 419,985,000,000.00]
Can we process this data on traditional computing 4
systems?
2. CHARACTERISTICS
How do you classify data as BIG Data?
Volume : Size
Velocity : High Speed of Accumulation
Variety : Nature [Structured, Semi, Unstructured]
Veracity : Inconsistencies & Uncertainties (Quality)
Value : Information - Knowledge
6
Volume Velocity Variety
~ 2500 Exabytes Per Year - Excel Files (S)
- System Logs (SS)
- CT Scans (U)
Veracity Value
- Accuracy - Diseases Detection
- Trustworthiness - Drug Detection
- Misdiagnosis
- Reduced Cost
7
“ Huge amount of complex,
variously formatted data
generated at high speed, that
cannot be handled,
processed by the traditional
system
8
3. MANAGEMENT
FRAMEWORKS / TOOLS
Popular: Hadoop, Storm, Hive and Spark
Promising: Flink and Heron
Most Useful: Presto and Map Reduce
Kafka, TEZ, Impala, Beam, Apex, etc
10
STORAGE - HADOOP
HDFS – Hadoop Distributed File System
400 MB
PART A PART B PART C PART D
Machine A Machine B Machine C Machine D
100 MB 100 MB 100 MB 100 MB
11
STORAGE - HADOOP
HDFS – Hadoop Distributed File System
400 MB
PART A PART D PART B PART C PART C PART B PART D PART A
Machine A Machine B Machine C Machine D
100 MB 100 MB 100 MB 100 MB 100 MB 100 MB 100 MB 100 MB
12
PROCESSING – HADOOP
Map Reduce – Parallel Processing
TASK A
TASK A1 TASK A2 TASK A3 TASK A4
Machine A Machine B Machine C Machine D
RESULT = A1 + A2 + A3 + A4 = OUTPUT_ TASK A
13
HADOOP - HDFS
Designed for storing huge datasets in commodity hardware
Name Node [Master]
400 MB
PART A PART B PART C PART D
Data Node Data Node Data Node Data Node
[Slave] [Slave] [Slave] [Slave]
Machine A Machine B Machine C Machine D
14
PART A PART D PART B PART C PART C PART B PART D PART A
HADOOP – MAP REDUCE
Infrastructure
Master Node
TASK A
Slave Node Slave Node Slave Node Slave Node
Machine A Machine B Machine C Machine D
TASK A1 TASK A2 TASK A3 TASK A4
15
RESULT = A1 + A2 + A3 + A4 = OUTPUT_ TASK A
HADOOP – MAP REDUCE PROCESSING/PROGRAMMING
SHUFFLE &
INPUT SPLIT MAP REDUCE
SORT
Malaysia, Saudi Malaysia, 1 Malaysia, 1
Arabia, Comoros Saudi Arabia, 1 Malaysia, 1
Comoros, 1
Malaysia, Saudi Bangladesh, Bangladesh, 1 Saudi Arabia, 1
Arabia, Comoros. Algeria, Malaysia Algeria, 1 Saudi Arabia, 1 Malaysia, 2
Bangladesh, Malaysia, 1 Saudi Arabia, 2
Algeria, Malaysia. Comoros, 2
Comoros. Comoros Comoros, 1 Comoros, 1 Bangladesh, 1
Algeria, Saudi Comoros, 1 Algeria, 2
Arabia Algeria, Saudi Algeria, 1 Bangladesh, 1
Arabia Saudi Arabia, 1
Algeria, 1
Algeria, 1 16
HADOOP - YARN
YARN – Yet Another Resource Negotiator
CLIENT A Node Manager
App Master Container
CLIENT B
Node Manager
YARN
CLIENT C App Master Container
CLIENT D Node Manager
17
App Master Container
4. HADOOP ECOSYSTEM
ECOSYSTEM
Core Hadoop
Query Engines
External Data Storage
19
CORE HADOOP
20
PIG [Procedural Language Platform]
High level scripting language
Complex data transformation without Java
Simple SQL-like scripting called Pig Latin
Works with data from many sources, including structured and
unstructured data
Store the results into the Hadoop Data File System
Pig scripts are translated into a series of MapReduce jobs before 21
execution
Components
Pig Latin script language
Procedural Data Flow Language
Examples: LOAD, STORE, etc.
A runtime engine
Compiler producing Sequences
Parsing, Validation & Compilation into a
sequence of MapReduce jobs. 22
Example
1. A = LOAD ‘myfile’
2. AS (x, y, z);
3. B = FILTER A by x > 0;
4. C = GROUP B BY x;
5. D = FOREACH A GENERATE
6. x, COUNT(B);
7. STORE D INTO ‘output’; 23
Data Model
Nested Model
24
HIVE
Data warehouse infrastructure tool to process structured data in Hadoop
It stores schema in a database and processed data into HDFS.
It is designed for OLAP
It provides SQL type language for querying called HiveQL or HQL
It is familiar, fast, scalable, and extensible
25
Architecture
26
Data Flow
27
Data Modeling
Tables
Same as RDMS
Partitions
Partitioned tables of same data
connected by a key
Buckets
Smaller partitions for efficient querying
28
Example
1. create database office;
2. show databases;
3. drop database office; - if empty
4. drop database office cascade; - if not empty
5. create database office;
6. use office;
29
APACHE AMBARI
Provisioning, managing, and monitoring Apache Hadoop clusters
Intuitive, easy-to-use Hadoop management web UI backed by its RESTful
APIs
Provisioning:
Step-by-step wizard for installing Hadoop services across any
number of hosts
Handles configuration of Hadoop services for the cluster 30
Managing:
Central management for starting, stopping, and reconfiguring Hadoop
services across the entire cluster
Monitoring:
Dashboard for monitoring health and status of the Hadoop cluster
Leverages Ambari Metrics System for metrics collection
Leverages Ambari Alert Framework for system alerting and will notify you
when your attention is needed (e.g., a node goes down, remaining disk space
is low, etc.)
31
Architecture
32
MESOS – Another Resource Negotiator
33
Example
34
MESOS vs YARN
MESOS YARN
Language C++ Java
Scheduler Non-Monolithic Monolithic
Scheduling Memory & CPU Memory
Scalability Highly Scalable Less Scalable
Data Centre Complete Hadoop Job
Management
Availability Multiple Masters YARN Only
Fault Tolerance
35
Security Trusted Entities Multiple Layers
SPARK
Speed : Run workloads 100x faster
Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art
DAG scheduler, a query optimizer, and a physical execution engine
Ease of Use : Program using Java, Scala, Python, R, and SQL
Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use
it interactively from the Scala, Python, R, and SQL shells
Generality : Combine SQL, streaming, and complex analytics
Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and
Spark Streaming. You can combine these libraries seamlessly in the same application
Runs Everywhere : Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the 36
cloud. It can access diverse data sources
Architecture
Standalone
Mesos
YARN
Kubernetes
37
BARAKALLAH FEEKUM!
Any questions?
Feel free to contact me using designated
channels
38