0% found this document useful (0 votes)

7 views28 pages

BigDataAnalytics Week 01

Uploaded by

lixajep757

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views28 pages

BigDataAnalytics Week 01

Uploaded by

lixajep757

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction

Big Data Analytics

References & Grading
• References
• “Data Mining – Concepts and Techniques” by Jiawei Han,
Micheline Kamber, Jian Pei, third edition
References & Grading
• References
• “Data Mining – Concepts and Techniques” by Jiawei Han,
Micheline Kamber, Jian Pei, third edition
• “Neural Networks and Deep Learning” (search
neuralnetworksanddeeplearning on the Web)
References & Grading
• References
• “Data Mining – Concepts and Techniques” by Jiawei Han,
Micheline Kamber, Jian Pei, third edition
• “Neural Networks and Deep Learning” (search
neuralnetworksanddeeplearning on the Web)
• Grading
• Mid-term exam 46%
• Final exam: 50%
• 2 Homeworks: 4%
This Class

• Basic Concepts of Big Data Processing

• Traditional way of data processing
• Hadoop
• Example: Word count, Page Rank, K-means clustering

• Spark

• Basic Concepts of Analytics

• Frequent Patterns
• Classification
• Neural Network

• Real world projects in telco business

5
Let’s start with an example
Midnight bus routing in Seoul, Korea
# of people in a region - # of people living in the reqion

① Existing Route ② Improved Route

Big Data Analytics Process

Data Collection Big Data Analytics Process

Service 1 Service 1 Data

Transforming raw data

Summary
Service 2 Service 2 Data
Finding Variables

Inter related
Service 3 Service 3 Data Applying Various Data
&
Analytics Algorithms
Trial & Error

Service 4 Service 4 Data Consolidating Results

Verifying Results with Real

Service 5 Service 5 Data Data
{“S_id”: “xx”, “R_id”: “yy”, “S_time”: “20201010150021”, “E_time”: “20201010150313”,
“S_cid”: “aa”, “R_cid”: “bb”}
{"result":"","jobid":"66072232606","host":"","logid":2,"time":"","c_msg_id":"758595208#!900222",
"type":"xml-RCP-consume-REQSMS"}?
Traditional Data Processing

11
Midnight bus routing in Seoul, Korea
# of people in a region - # of people living in the reqion

① Existing Route ② Improved Route

Cloud Computing What is Cloud Computing?

Complicated data processings which take massive computing power and data storages are
executed on the central server (Cloud) rather than on personal PC’s or local servers.

13
Cloud Computing - Google Motivation

◆ Google adopted service platforms based on cloud computing with distributed file systems, not
commercial DBMS from the beginning of Google foundation, which have big advantages in terms of
cost and scalability
◆But it had a disadvantage that it is very hard to develop parallel applications

Young Google’s Concern

◆ It’s a road to success if they can ◆ DBMS Restriction

manipulate/analyze whole data in the Web ⚫ Only Scalable upto a certain point: OLTP, DW
◆ The size of the Web data increases twice ⚫ Doesn’t fit Big Data Processings
every year (Maybe more that twice now!!) ⚫ Very expensive to acquire good DB, and
operation cost is also very expensive
⚫ They need Entire Web Scale data
⚫ If the DB is not powerful enough to process
processing capabilities
all the data, a new powerful server should be
⚫ Hugh increase in HW/SW cost
puchrsed

◆ Google entered search engine business using Cloud Computing platforms to acquire
competitive edges by spending much less HW/SW cost compared to Yahoo.
◆ The only problem is that they need to develop scalable, fault-tolerant, parallel software, which is
very hard

14
Cloud Computing Google Cloud Computing Platform

Parallel Computing
Major Challenges In 2009, 50 Billion
Google Search and other Google Services pages indexed
 Transparency
It’s very hard to
build distributed
Distributed Data Processing Engine In 2008,
programs. Need a Distributed DB System
Map/Reduce Based 200+
consistent Programming Model (BigTable)
GFS Clusters
programming
model which is
1000~5000 PCs per
easy to program to Distributed File System Cluster
process huge data GFS (Google File System)
 Fault Tolerance
Lots of Commodity OS
Optimized Linux OS by Google
Devices cause lots
of Troubles In 2006,
Commodity PC Server Clustering
 Scalability About 450,000 PCs

More data needs

more HW in low
cost

15
Contents

1. Cloud Computing
2. Map-Reduce
• Motivation
• Key Idea of Map/Reduce
• Word Count example
• Sort example
• Inverted Index example
• Google Page Rank example
• Matrix example
• Impact of Map/Reduce on Google
3. Hadoop
4. Big Data Mining

16
Map-Reduce and Hadoop Key Idea of Map/Reduce

 Data type: key-value records

 Map function:
(Kin, Vin) ➔ list(Kinter, Vinter)
 Reduce function:
(Kinter, list(Vinter)) ➔ list(Kout, Vout)

◆ Hadoop Platform supports communications between nodes

⚫ Failure recovery
⚫ Scalability
⚫ Load balancing

17
Map-Reduce and Hadoop Word Count example

Input Map Shuffle & Sort Reduce Output

the, 1
brown, 1
the quick fox, 1 brown, 2
Map
brown fox fox, 2
Reduce
how, 1
the, 1
fox, 1
now, 1
the, 1 the, 3
the fox ate
Map
the mouse quick, 1

how, 1
ate, 1 ate, 1
now, 1
mouse, 1
brown, 1 Reduce cow, 1
how now mouse, 1
Map cow, 1
brown cow quick, 1

18
Map-Reduce Inverted Index example

◆Input: (filename, text) records Map Reduce

◆Output: list of files containing each
word [Link]
to, [Link]
◆Map: be, [Link]
to be or
foreach word in [Link](): not to be or, [Link]
output(word, filename) afraid, ([Link])
not, [Link]
be, ([Link], [Link])
◆Combine: uniquify filenames for greatness, ([Link])
each word not, ([Link], [Link])
of, ([Link])
◆Reduce: or, ([Link])
[Link] be, [Link] to, ([Link])
def reduce(word, filenames):
output(word, sort(filenames)) be not not, [Link]
afraid of afraid, [Link]
greatness
of, [Link]
greatness, [Link]

19
Map-Reduce Google PageRank example

◆ Definition : Let D be the set of all Web pages. Let I(p) be the set of pages that link to
the page p and let ci be the total number of links going out of page pi. The PageRank of
page pi, denoted by ri, is then given by

◆ Page Rank corresponds to the probability distribution of a random walk on the web
graphs : Random Surfer Model

◆ Sometimes, a random surfer gets bored and jumps to a different page

20
Map-Reduce Google PageRank example

21
Map-Reduce Matrix example

Addition Multiplication

22
Contents

1. Cloud Computing
2. Map-Reduce
3. Hadoop
• What is Hadoop?
• HDFS
• Principle of Hadoop’s Map/Reduce Execution
• Implementation of Word Count
4. Big Data Mining

23
Hadoop What is Hadoop?

Hadoop is a Open Source Platform based on Java. Hadoop has HDFS(hadoop Distributed
File System) which is similar to GFS. Based on HDFS, Hadoop provides Map/Reduce Job
execution on a cluster composed of thousands of PCs.

Tera, Peta data processing applications

Hive [Apache Hadoop Project]

(SQL-like Query Handling) ◆Hadoop Core
⚫ Distributed File System
Distributed Data Processing Engine Distributed DB System ⚫ Map/Reduce Framework
Map/Reduce Based Programming Model (HBASE) ◆Pig (initiated by Yahoo!)
⚫ Parallel Programming Language and
Runtime
◆Hbase (initiated by Powerset)
⚫ Table storage for semi-structured data
Distributed File System
◆Hive (initiated by Facebook)
⚫ SQL-like query language & meta-store
(Hadoop Distributed File System)

Linux OS
Redhat, CentOS, Ubuntu …

Commodity PC Server Clustering ◆ PC Server

⚫ CPU :Intel Quad Core x 2 CPU
⚫ Memory: 16GB
⚫ HDD: 500GB x 4
⚫ NIC: Gigabit Ethernet x 2 Port

24
Map-Reduce Word Count example

Input Map Shuffle & Sort Reduce Output

the, 1
brown, 1
the quick fox, 1 brown, 2
Map
brown fox fox, 2
Reduce
how, 1
the, 1
fox, 1
now, 1
the, 1 the, 3
the fox ate
Map
the mouse quick, 1

how, 1
ate, 1 ate, 1
now, 1
mouse, 1
brown, 1 Reduce cow, 1
how now mouse, 1
Map cow, 1
brown cow quick, 1

25
Hadoop 소개 Implementation of Word Count

public class MapClass extends MapReduceBase

implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable ONE = new IntWritable(1);

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
String line = [Link]();
StringTokenizer itr = new StringTokenizer(line);
while ([Link]()) {
[Link](new text([Link]()), ONE);
}
}
}

26
Hadoop 소개 Implementation of Word Count

public class ReduceClass extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
int sum = 0;
while ([Link]()) {
sum += [Link]().get();
}
[Link](key, new IntWritable(sum));
}
}

27
Hadoop 소개 Implementation of Word Count

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf([Link]);
[Link]("wordcount");

[Link]([Link]);
[Link]([Link]);
[Link]([Link]);

[Link](conf, args[0]);
[Link](conf, new Path(args[1]));

[Link]([Link]); // out keys are words (strings)

[Link]([Link]); // values are counts

[Link](conf);
}

Cloud Compute
No ratings yet
Cloud Compute
46 pages
Hadoop and MapReduce Overview
No ratings yet
Hadoop and MapReduce Overview
43 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Hadoop and Spark Overview
No ratings yet
Hadoop and Spark Overview
34 pages
Current Trends in Data Mining and BI
No ratings yet
Current Trends in Data Mining and BI
30 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
Cloud Computing & MapReduce Basics
No ratings yet
Cloud Computing & MapReduce Basics
55 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
No ratings yet
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
52 pages
Introduction to the Hadoop Ecosystem
No ratings yet
Introduction to the Hadoop Ecosystem
127 pages
Unstructured Data in Hadoop Analysis
No ratings yet
Unstructured Data in Hadoop Analysis
57 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Hadoop Course Content
No ratings yet
Hadoop Course Content
3 pages
Introduction to MapReduce and Hadoop
No ratings yet
Introduction to MapReduce and Hadoop
45 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Hadoop by Dr. Kamal Gulati
No ratings yet
Hadoop by Dr. Kamal Gulati
33 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
MapReduce and Hadoop Overview
No ratings yet
MapReduce and Hadoop Overview
69 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Data-Intensive Computing Overview
No ratings yet
Data-Intensive Computing Overview
46 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
Problem-Solving Using Mapreduce/Hadoop
No ratings yet
Problem-Solving Using Mapreduce/Hadoop
22 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Hadoop and Spark for Big Data Analysis
No ratings yet
Hadoop and Spark for Big Data Analysis
48 pages
03 Big Data Overview
No ratings yet
03 Big Data Overview
96 pages
PDC Lecture 13
No ratings yet
PDC Lecture 13
32 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Big Data Analytics Unit-1
No ratings yet
Big Data Analytics Unit-1
39 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Lecture 3 - Big Data
No ratings yet
Lecture 3 - Big Data
43 pages
Hadoop-Yahoo - Tutorial Course 1
No ratings yet
Hadoop-Yahoo - Tutorial Course 1
149 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Big Data
No ratings yet
Big Data
18 pages
Unit 1 J2 Big Data
No ratings yet
Unit 1 J2 Big Data
6 pages
Hadoop Administration
No ratings yet
Hadoop Administration
97 pages
Chapter 4 MapReduce
No ratings yet
Chapter 4 MapReduce
82 pages
Big Data and NoSQL Systems Overview
No ratings yet
Big Data and NoSQL Systems Overview
51 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Hadoop 2
No ratings yet
Hadoop 2
27 pages
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
No ratings yet
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
36 pages
Hadoop Week 1
No ratings yet
Hadoop Week 1
25 pages
Introduction to Hadoop & Big Data
No ratings yet
Introduction to Hadoop & Big Data
22 pages
Hadoop for Big Data Beginners
No ratings yet
Hadoop for Big Data Beginners
87 pages
Biggdata
No ratings yet
Biggdata
24 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Cloud Chapter 4SWE
No ratings yet
Cloud Chapter 4SWE
40 pages
Understanding Big Data and Hadoop
No ratings yet
Understanding Big Data and Hadoop
17 pages
BIA BigData Overview
No ratings yet
BIA BigData Overview
38 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
Introduction to Hadoop and Big Data
No ratings yet
Introduction to Hadoop and Big Data
22 pages
Big Data Analytics and MapReduce Overview
No ratings yet
Big Data Analytics and MapReduce Overview
35 pages
CAS NT-600 Manual
No ratings yet
CAS NT-600 Manual
37 pages
CTI Publications - Historical
No ratings yet
CTI Publications - Historical
2 pages
Kaizen Fundamentals and PDCA Methodology
No ratings yet
Kaizen Fundamentals and PDCA Methodology
20 pages
Cellular Concept
No ratings yet
Cellular Concept
13 pages
Security For Industrial Automation and Control Systems, Part 3-2: Security Risk Assessment For System Design
No ratings yet
Security For Industrial Automation and Control Systems, Part 3-2: Security Risk Assessment For System Design
40 pages
User Documentation Essentials
No ratings yet
User Documentation Essentials
40 pages
Requirements Engineering For Software and Systems by Phillip A. Laplante
80% (5)
Requirements Engineering For Software and Systems by Phillip A. Laplante
399 pages
H - ND - Stan'da e - Deme S - Stemler - N - N Kabul Ed - Lmes - Ne - L - K - N B - R - Ali - Ma - L - Terat - R Taramasi (#1460714) - 3829649
No ratings yet
H - ND - Stan'da e - Deme S - Stemler - N - N Kabul Ed - Lmes - Ne - L - K - N B - R - Ali - Ma - L - Terat - R Taramasi (#1460714) - 3829649
12 pages
Arduino L298N DC Motor Control Guide
0% (1)
Arduino L298N DC Motor Control Guide
15 pages
Enterprise Network Storage Solution
No ratings yet
Enterprise Network Storage Solution
3 pages
ChangLiu General CFD
No ratings yet
ChangLiu General CFD
2 pages
Cpoyright Issues in Cyber Space
No ratings yet
Cpoyright Issues in Cyber Space
25 pages
Cloudera Administrator Training For Apache Hadoop
No ratings yet
Cloudera Administrator Training For Apache Hadoop
3 pages
86 °C Freezers 838 Liters IDAM DF-86V838E Security
No ratings yet
86 °C Freezers 838 Liters IDAM DF-86V838E Security
2 pages
Employee Photo
No ratings yet
Employee Photo
3 pages
Batch Determination Process
No ratings yet
Batch Determination Process
6 pages
Wiring Diagram Handbook: For LHD and RHD Trucks
100% (7)
Wiring Diagram Handbook: For LHD and RHD Trucks
168 pages
Iphone 11
No ratings yet
Iphone 11
2 pages
Toshiba 57H81
No ratings yet
Toshiba 57H81
96 pages
RTU User Manual For Stundent
No ratings yet
RTU User Manual For Stundent
7 pages
Oh 24
100% (1)
Oh 24
500 pages
New Media Dominance in News Consumption
0% (1)
New Media Dominance in News Consumption
2 pages
Delco 44MT Brochure 1 14
100% (1)
Delco 44MT Brochure 1 14
4 pages
A7 Pro ANC Noise Reduction Al Display Bluetooth Earphones Stereo Earphones Sports Earphones Wreless Earphones With Microphone - 4
No ratings yet
A7 Pro ANC Noise Reduction Al Display Bluetooth Earphones Stereo Earphones Sports Earphones Wreless Earphones With Microphone - 4
1 page
Essentials of Effective MIS Management
No ratings yet
Essentials of Effective MIS Management
8 pages
Deepam PDF
No ratings yet
Deepam PDF
2 pages
Datos Técnicos - Especificaciones
No ratings yet
Datos Técnicos - Especificaciones
4 pages
Nokia C3-01 User Guide: Issue 2.0
No ratings yet
Nokia C3-01 User Guide: Issue 2.0
62 pages
Parish Office Computer Based Record Keeping System Capstone Documentation
No ratings yet
Parish Office Computer Based Record Keeping System Capstone Documentation
2 pages
Development of A Voice-Controlled Personal Assistant For The Elderly and Disabled
No ratings yet
Development of A Voice-Controlled Personal Assistant For The Elderly and Disabled
6 pages

BigDataAnalytics Week 01

Uploaded by

BigDataAnalytics Week 01

Uploaded by

Introduction

Big Data Analytics

• Basic Concepts of Big Data Processing

• Basic Concepts of Analytics

• Real world projects in telco business

① Existing Route ② Improved Route

Data Collection Big Data Analytics Process

Service 1 Service 1 Data

Service 4 Service 4 Data Consolidating Results

Verifying Results with Real

① Existing Route ② Improved Route

Young Google’s Concern

◆ It’s a road to success if they can ◆ DBMS Restriction

More data needs

 Data type: key-value records

◆ Hadoop Platform supports communications between nodes

Input Map Shuffle & Sort Reduce Output

◆Input: (filename, text) records Map Reduce

◆ Sometimes, a random surfer gets bored and jumps to a different page

Tera, Peta data processing applications

Hive [Apache Hadoop Project]

Commodity PC Server Clustering ◆ PC Server

Input Map Shuffle & Sort Reduce Output

public class MapClass extends MapReduceBase

private final static IntWritable ONE = new IntWritable(1);

public void map(LongWritable key, Text value,

public class ReduceClass extends MapReduceBase

public void reduce(Text key, Iterator<IntWritable> values,

public static void main(String[] args) throws Exception {

[Link]([Link]); // out keys are words (strings)

You might also like