0% found this document useful (0 votes)

98 views38 pages

Understanding Big Data Management Tools

The document discusses big data and its management. It defines big data as large volumes of structured, semi-structured and unstructured data that is generated quickly and is challenging to process using traditional systems. It then covers characteristics, frameworks like Hadoop and Spark, and the Hadoop ecosystem including components like HDFS, MapReduce, Pig and Hive.

Uploaded by

sehun twin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

98 views38 pages

Understanding Big Data Management Tools

Uploaded by

sehun twin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

BIG DATA &

MANAGEMENT
1. WHAT IS BIG DATA?
When does data become BIG?

Large Volume of Data – Structured and Unstructured

2.5 quintillion bytes of data are generated

everyday [Discussed in Lesson 1]

2,500,000,000,000,000,000
3
What is Big Data?

2.5 quintillion pennies would, if laid out flat, cover

the Earth five times

Bill Gates’s projected fortune times 2.5 million,

assuming he lives to see 2042
[$102 Billion ~ 419,985,000,000.00]

Can we process this data on traditional computing 4

systems?
2. CHARACTERISTICS
How do you classify data as BIG Data?

Volume : Size

Velocity : High Speed of Accumulation

Variety : Nature [Structured, Semi, Unstructured]

Veracity : Inconsistencies & Uncertainties (Quality)

Value : Information - Knowledge

6
Volume Velocity Variety

~ 2500 Exabytes Per Year - Excel Files (S)

- System Logs (SS)
- CT Scans (U)

Veracity Value

- Accuracy - Diseases Detection

- Trustworthiness - Drug Detection
- Misdiagnosis
- Reduced Cost
7
“ Huge amount of complex,
variously formatted data
generated at high speed, that
cannot be handled,
processed by the traditional
system

8
3. MANAGEMENT
FRAMEWORKS / TOOLS

Popular: Hadoop, Storm, Hive and Spark

Promising: Flink and Heron

Most Useful: Presto and Map Reduce

Kafka, TEZ, Impala, Beam, Apex, etc

10
STORAGE - HADOOP

HDFS – Hadoop Distributed File System

400 MB

PART A PART B PART C PART D

Machine A Machine B Machine C Machine D

100 MB 100 MB 100 MB 100 MB

11
STORAGE - HADOOP

HDFS – Hadoop Distributed File System

400 MB

PART A PART D PART B PART C PART C PART B PART D PART A

Machine A Machine B Machine C Machine D

100 MB 100 MB 100 MB 100 MB 100 MB 100 MB 100 MB 100 MB

12
PROCESSING – HADOOP

Map Reduce – Parallel Processing

TASK A

TASK A1 TASK A2 TASK A3 TASK A4

Machine A Machine B Machine C Machine D

RESULT = A1 + A2 + A3 + A4 = OUTPUT_ TASK A

13
HADOOP - HDFS

Designed for storing huge datasets in commodity hardware

Name Node [Master]

400 MB

PART A PART B PART C PART D

Data Node Data Node Data Node Data Node

[Slave] [Slave] [Slave] [Slave]
Machine A Machine B Machine C Machine D
14
PART A PART D PART B PART C PART C PART B PART D PART A
HADOOP – MAP REDUCE

Infrastructure
Master Node

TASK A

Slave Node Slave Node Slave Node Slave Node

Machine A Machine B Machine C Machine D

TASK A1 TASK A2 TASK A3 TASK A4

15
RESULT = A1 + A2 + A3 + A4 = OUTPUT_ TASK A
HADOOP – MAP REDUCE PROCESSING/PROGRAMMING

SHUFFLE &
INPUT SPLIT MAP REDUCE
SORT
Malaysia, Saudi Malaysia, 1 Malaysia, 1
Arabia, Comoros Saudi Arabia, 1 Malaysia, 1
Comoros, 1
Malaysia, Saudi Bangladesh, Bangladesh, 1 Saudi Arabia, 1
Arabia, Comoros. Algeria, Malaysia Algeria, 1 Saudi Arabia, 1 Malaysia, 2
Bangladesh, Malaysia, 1 Saudi Arabia, 2
Algeria, Malaysia. Comoros, 2
Comoros. Comoros Comoros, 1 Comoros, 1 Bangladesh, 1
Algeria, Saudi Comoros, 1 Algeria, 2
Arabia Algeria, Saudi Algeria, 1 Bangladesh, 1
Arabia Saudi Arabia, 1
Algeria, 1
Algeria, 1 16
HADOOP - YARN

YARN – Yet Another Resource Negotiator

CLIENT A Node Manager

App Master Container

CLIENT B

Node Manager
YARN
CLIENT C App Master Container

CLIENT D Node Manager

17
App Master Container
4. HADOOP ECOSYSTEM
ECOSYSTEM

Core Hadoop

Query Engines

External Data Storage

19
CORE HADOOP

20
PIG [Procedural Language Platform]

High level scripting language

Complex data transformation without Java

Simple SQL-like scripting called Pig Latin

Works with data from many sources, including structured and

unstructured data

Store the results into the Hadoop Data File System

Pig scripts are translated into a series of MapReduce jobs before 21

execution
Components

Pig Latin script language

Procedural Data Flow Language
Examples: LOAD, STORE, etc.

A runtime engine
Compiler producing Sequences
Parsing, Validation & Compilation into a
sequence of MapReduce jobs. 22
Example

1. A = LOAD ‘myfile’
2. AS (x, y, z);
3. B = FILTER A by x > 0;
4. C = GROUP B BY x;
5. D = FOREACH A GENERATE
6. x, COUNT(B);
7. STORE D INTO ‘output’; 23
Data Model

Nested Model

24
HIVE

Data warehouse infrastructure tool to process structured data in Hadoop

It stores schema in a database and processed data into HDFS.

It is designed for OLAP

It provides SQL type language for querying called HiveQL or HQL

It is familiar, fast, scalable, and extensible

25
Architecture

26
Data Flow

27
Data Modeling

Tables
Same as RDMS
Partitions
Partitioned tables of same data
connected by a key
Buckets
Smaller partitions for efficient querying
28
Example

1. create database office;

2. show databases;
3. drop database office; - if empty
4. drop database office cascade; - if not empty
5. create database office;
6. use office;

29
APACHE AMBARI

Provisioning, managing, and monitoring Apache Hadoop clusters

Intuitive, easy-to-use Hadoop management web UI backed by its RESTful

APIs

Provisioning:

Step-by-step wizard for installing Hadoop services across any

number of hosts

Handles configuration of Hadoop services for the cluster 30

Managing:

Central management for starting, stopping, and reconfiguring Hadoop

services across the entire cluster

Monitoring:

Dashboard for monitoring health and status of the Hadoop cluster

Leverages Ambari Metrics System for metrics collection

Leverages Ambari Alert Framework for system alerting and will notify you
when your attention is needed (e.g., a node goes down, remaining disk space
is low, etc.)
31
Architecture

32
MESOS – Another Resource Negotiator

33
Example

34
MESOS vs YARN

MESOS YARN

Language C++ Java

Scheduler Non-Monolithic Monolithic

Scheduling Memory & CPU Memory

Scalability Highly Scalable Less Scalable

Data Centre Complete Hadoop Job

Management
Availability Multiple Masters YARN Only

Fault Tolerance
35
Security Trusted Entities Multiple Layers
SPARK

Speed : Run workloads 100x faster

Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art
DAG scheduler, a query optimizer, and a physical execution engine

Ease of Use : Program using Java, Scala, Python, R, and SQL

Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use
it interactively from the Scala, Python, R, and SQL shells

Generality : Combine SQL, streaming, and complex analytics

Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and
Spark Streaming. You can combine these libraries seamlessly in the same application

Runs Everywhere : Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the 36
cloud. It can access diverse data sources
Architecture

Standalone
Mesos
YARN
Kubernetes

37
BARAKALLAH FEEKUM!
Any questions?

Feel free to contact me using designated

channels

Introduction To Data Science
No ratings yet
Introduction To Data Science
25 pages
Decision Trees in Machine Learning
No ratings yet
Decision Trees in Machine Learning
15 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
6 pages
Lung Cancer Prognosis: ML Algorithm Evaluation
No ratings yet
Lung Cancer Prognosis: ML Algorithm Evaluation
11 pages
Specialized Data in Predictive Analytics
No ratings yet
Specialized Data in Predictive Analytics
44 pages
Application of Emotional Intelligence (EQ) Among TVET Students To Face The Industrial Revolution 4.0 in The Light Engineering Sector of Bangladesh
No ratings yet
Application of Emotional Intelligence (EQ) Among TVET Students To Face The Industrial Revolution 4.0 in The Light Engineering Sector of Bangladesh
18 pages
Students Placement Prediction Using Machine Learning
No ratings yet
Students Placement Prediction Using Machine Learning
6 pages
Heart Attack Risk Detection Using Eye Retinal Images
No ratings yet
Heart Attack Risk Detection Using Eye Retinal Images
6 pages
Chicken Disease Classification Project Plan
No ratings yet
Chicken Disease Classification Project Plan
14 pages
Differentiated Instruction
No ratings yet
Differentiated Instruction
63 pages
2020-Student Performance Prediction Based On Blended Learning
No ratings yet
2020-Student Performance Prediction Based On Blended Learning
8 pages
ELTR105 Sec1
No ratings yet
ELTR105 Sec1
93 pages
Crime Pattern Detection Using Data Mining: Snath1@Fau - Edu
No ratings yet
Crime Pattern Detection Using Data Mining: Snath1@Fau - Edu
4 pages
Gen Physics 2
No ratings yet
Gen Physics 2
10 pages
Differentiated Instruction in Math
No ratings yet
Differentiated Instruction in Math
33 pages
Gas Production Time Series Analysis
100% (19)
Gas Production Time Series Analysis
29 pages
Healthcare ML Prediction Uncertainty
No ratings yet
Healthcare ML Prediction Uncertainty
8 pages
Higher Education Student Dropout Prediction and Analysis Through Educational Data Mining
No ratings yet
Higher Education Student Dropout Prediction and Analysis Through Educational Data Mining
5 pages
Performance and Early Drop Prediction For Higher Education Students Using Machine Learning
No ratings yet
Performance and Early Drop Prediction For Higher Education Students Using Machine Learning
9 pages
Icebreakers & Teambuilding: 1) Egg-Chicken-Dinosaur-King/Queen (Rock, Paper, Scissors)
No ratings yet
Icebreakers & Teambuilding: 1) Egg-Chicken-Dinosaur-King/Queen (Rock, Paper, Scissors)
5 pages
Student Academic Performance Prediction Under Various Machine Learning Classification Algorithms
No ratings yet
Student Academic Performance Prediction Under Various Machine Learning Classification Algorithms
19 pages
Analyzing and Predicting Students Performance by Means of Machine Learning
No ratings yet
Analyzing and Predicting Students Performance by Means of Machine Learning
16 pages
Diabetes Project Using Machine Learning
100% (1)
Diabetes Project Using Machine Learning
49 pages
Peer-to-Peer File Sharing and Its Effects On The World: Written and Presented By: Craig Schweitzer
No ratings yet
Peer-to-Peer File Sharing and Its Effects On The World: Written and Presented By: Craig Schweitzer
27 pages
Beginning E-Commerce With Visual Basic, ASP, SQL Server 7.0 and Mts
No ratings yet
Beginning E-Commerce With Visual Basic, ASP, SQL Server 7.0 and Mts
781 pages
Pattern Recognition
No ratings yet
Pattern Recognition
461 pages
Predicting Student Degree Success
No ratings yet
Predicting Student Degree Success
2 pages
Certificate of Attendance for Bataan XEPTO Training
No ratings yet
Certificate of Attendance for Bataan XEPTO Training
1 page
A Machine Learning Approach For Tracking and Predicting Student Performance in Degree Programs
No ratings yet
A Machine Learning Approach For Tracking and Predicting Student Performance in Degree Programs
2 pages
Crime Prediction with Data Mining
No ratings yet
Crime Prediction with Data Mining
49 pages
Education Loan Prediction Analysis
No ratings yet
Education Loan Prediction Analysis
5 pages
Pattern Recognition in AI
No ratings yet
Pattern Recognition in AI
3 pages
Student Behavior Clustering in Learning
No ratings yet
Student Behavior Clustering in Learning
7 pages
Predictive Analytics and Machine Learning in Business
No ratings yet
Predictive Analytics and Machine Learning in Business
7 pages
Bloom's Revised Taxonomy Explained
No ratings yet
Bloom's Revised Taxonomy Explained
1 page
Analyzing Undergraduate Students' Performance Using Educational Data Mining
No ratings yet
Analyzing Undergraduate Students' Performance Using Educational Data Mining
18 pages
ML Project Guide for Practitioners
No ratings yet
ML Project Guide for Practitioners
7 pages
Employing Interactive Classroom Activities in Promoting Fairness and
No ratings yet
Employing Interactive Classroom Activities in Promoting Fairness and
20 pages
Final Project Report Crime Data
No ratings yet
Final Project Report Crime Data
65 pages
Machine Learning for Job Applicant Analysis
No ratings yet
Machine Learning for Job Applicant Analysis
61 pages
20 Machine Learning Projects For Beginners
No ratings yet
20 Machine Learning Projects For Beginners
22 pages
Student Performance Prediction Techniques
No ratings yet
Student Performance Prediction Techniques
6 pages
Evaluating Machine Learning Algorithms For Enhanced Prediction of Student Academic Performance
100% (1)
Evaluating Machine Learning Algorithms For Enhanced Prediction of Student Academic Performance
4 pages
Global Superstore Profit Analysis
No ratings yet
Global Superstore Profit Analysis
16 pages
Step-by-Step Machine Learning
No ratings yet
Step-by-Step Machine Learning
3 pages
Data Science
100% (1)
Data Science
31 pages
Mental Stress Detection in University Students Using Machine Learning Algorithms
100% (1)
Mental Stress Detection in University Students Using Machine Learning Algorithms
5 pages
Essential Qualities of a Good Teacher
No ratings yet
Essential Qualities of a Good Teacher
62 pages
Student Performance Prediction
No ratings yet
Student Performance Prediction
19 pages
Advanced Login and User Registration System in Visual Basic 6
No ratings yet
Advanced Login and User Registration System in Visual Basic 6
2 pages
A Machine Learning Approach For Tracking and Predicting Student Performance in Degree Programs
No ratings yet
A Machine Learning Approach For Tracking and Predicting Student Performance in Degree Programs
34 pages
Academic Analytics Using Machine Learning
No ratings yet
Academic Analytics Using Machine Learning
26 pages
Uses of Predictive Analytics
No ratings yet
Uses of Predictive Analytics
4 pages
Flipped Classroom
No ratings yet
Flipped Classroom
8 pages
Machine Learning Report
No ratings yet
Machine Learning Report
58 pages
Academic Performance Prediction Based On Multisource, Multi Feature Behavioral Data
No ratings yet
Academic Performance Prediction Based On Multisource, Multi Feature Behavioral Data
6 pages
Affective Domain PDF
100% (1)
Affective Domain PDF
15 pages
Using Multimodal Learning Analytics (MMLA) To Predict Mathematics Performance in For Secondary Three Students During Online Lesson.
No ratings yet
Using Multimodal Learning Analytics (MMLA) To Predict Mathematics Performance in For Secondary Three Students During Online Lesson.
21 pages
ELTR100 Sec3
No ratings yet
ELTR100 Sec3
96 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Week 7 Practice Questions - Normalization
No ratings yet
Week 7 Practice Questions - Normalization
4 pages
Unit 5 Final PDF
No ratings yet
Unit 5 Final PDF
19 pages
Lab 6: Views and SQL Functions
No ratings yet
Lab 6: Views and SQL Functions
24 pages
Ms-Access +
No ratings yet
Ms-Access +
16 pages
Practical Paper 2024-25
No ratings yet
Practical Paper 2024-25
5 pages
Assignment 1 Data Warehousinfg
No ratings yet
Assignment 1 Data Warehousinfg
5 pages
Term Paper On Dbms
100% (1)
Term Paper On Dbms
8 pages
Change Log
No ratings yet
Change Log
9 pages
Database Lab: Attributes & Queries
No ratings yet
Database Lab: Attributes & Queries
3 pages
SQL Query and Database Concepts
No ratings yet
SQL Query and Database Concepts
22 pages
Assignment Distributed Database System
No ratings yet
Assignment Distributed Database System
6 pages
Kantha Madhulika - Sr. Big Data Engineer
No ratings yet
Kantha Madhulika - Sr. Big Data Engineer
9 pages
SQL Server Till Basic Group by
No ratings yet
SQL Server Till Basic Group by
31 pages
Multi-Model Fraud Detection at Scale
No ratings yet
Multi-Model Fraud Detection at Scale
17 pages
Waterresourcesandhydroinformaticsrev2016jan PDF
No ratings yet
Waterresourcesandhydroinformaticsrev2016jan PDF
94 pages
Major Components of Data Mining System
No ratings yet
Major Components of Data Mining System
9 pages
Dairy
No ratings yet
Dairy
21 pages
SAP Data Intelligence Presentation (2020-07)
No ratings yet
SAP Data Intelligence Presentation (2020-07)
22 pages
FortiAnalyzer 05 Reports
No ratings yet
FortiAnalyzer 05 Reports
59 pages
Oracle Cursor Usage Example
No ratings yet
Oracle Cursor Usage Example
1 page
Full Stack Development
No ratings yet
Full Stack Development
21 pages
Cloth Store Management
No ratings yet
Cloth Store Management
9 pages
Cloud Data Warehouse Benchmark
No ratings yet
Cloud Data Warehouse Benchmark
11 pages
Web & Mobile Security Lab 20CSP-338: Bachelor Degree of Engineering
No ratings yet
Web & Mobile Security Lab 20CSP-338: Bachelor Degree of Engineering
11 pages
Hibernate Search: ORM with Lucene & Elasticsearch
No ratings yet
Hibernate Search: ORM with Lucene & Elasticsearch
6 pages
Lab Maunal 6 (Stored Procedures and Views)
No ratings yet
Lab Maunal 6 (Stored Procedures and Views)
16 pages
Slides 01 Intro
No ratings yet
Slides 01 Intro
39 pages
Ebudget Presentation
No ratings yet
Ebudget Presentation
45 pages
Database Normalization
No ratings yet
Database Normalization
18 pages
Braindumpscollection C Abapd 2309 Sap Certified Associate Back End Developer Abap Cloud Verified Questions Answers by Castillo 15 04 2024 12qa
No ratings yet
Braindumpscollection C Abapd 2309 Sap Certified Associate Back End Developer Abap Cloud Verified Questions Answers by Castillo 15 04 2024 12qa
24 pages

Understanding Big Data Management Tools

Uploaded by

Understanding Big Data Management Tools

Uploaded by

BIG DATA &

Large Volume of Data – Structured and Unstructured

2.5 quintillion bytes of data are generated

2.5 quintillion pennies would, if laid out flat, cover

Bill Gates’s projected fortune times 2.5 million,

Can we process this data on traditional computing 4

Velocity : High Speed of Accumulation

Variety : Nature [Structured, Semi, Unstructured]

Veracity : Inconsistencies & Uncertainties (Quality)

Value : Information - Knowledge

~ 2500 Exabytes Per Year - Excel Files (S)

- Accuracy - Diseases Detection

Popular: Hadoop, Storm, Hive and Spark

Promising: Flink and Heron

Most Useful: Presto and Map Reduce

Kafka, TEZ, Impala, Beam, Apex, etc

HDFS – Hadoop Distributed File System

PART A PART B PART C PART D

Machine A Machine B Machine C Machine D

100 MB 100 MB 100 MB 100 MB

HDFS – Hadoop Distributed File System

PART A PART D PART B PART C PART C PART B PART D PART A

Machine A Machine B Machine C Machine D

100 MB 100 MB 100 MB 100 MB 100 MB 100 MB 100 MB 100 MB

Map Reduce – Parallel Processing

TASK A1 TASK A2 TASK A3 TASK A4

Machine A Machine B Machine C Machine D

RESULT = A1 + A2 + A3 + A4 = OUTPUT_ TASK A

Designed for storing huge datasets in commodity hardware

PART A PART B PART C PART D

Data Node Data Node Data Node Data Node

Slave Node Slave Node Slave Node Slave Node

Machine A Machine B Machine C Machine D

TASK A1 TASK A2 TASK A3 TASK A4

YARN – Yet Another Resource Negotiator

App Master Container

CLIENT D Node Manager

External Data Storage

High level scripting language

Complex data transformation without Java

Simple SQL-like scripting called Pig Latin

Works with data from many sources, including structured and

Store the results into the Hadoop Data File System

Pig scripts are translated into a series of MapReduce jobs before 21

Pig Latin script language

Data warehouse infrastructure tool to process structured data in Hadoop

It stores schema in a database and processed data into HDFS.

It is designed for OLAP

It provides SQL type language for querying called HiveQL or HQL

It is familiar, fast, scalable, and extensible

1. create database office;

Provisioning, managing, and monitoring Apache Hadoop clusters

Intuitive, easy-to-use Hadoop management web UI backed by its RESTful

Step-by-step wizard for installing Hadoop services across any

Handles configuration of Hadoop services for the cluster 30

Central management for starting, stopping, and reconfiguring Hadoop

Dashboard for monitoring health and status of the Hadoop cluster

Leverages Ambari Metrics System for metrics collection

Language C++ Java

Scheduler Non-Monolithic Monolithic

Scheduling Memory & CPU Memory

Scalability Highly Scalable Less Scalable

Data Centre Complete Hadoop Job

Speed : Run workloads 100x faster

Ease of Use : Program using Java, Scala, Python, R, and SQL

Generality : Combine SQL, streaming, and complex analytics

Feel free to contact me using designated

You might also like