0% found this document useful (0 votes)
7 views28 pages

BigDataAnalytics Week 01

Uploaded by

lixajep757
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views28 pages

BigDataAnalytics Week 01

Uploaded by

lixajep757
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction

Big Data Analytics


References & Grading
• References
• “Data Mining – Concepts and Techniques” by Jiawei Han,
Micheline Kamber, Jian Pei, third edition
References & Grading
• References
• “Data Mining – Concepts and Techniques” by Jiawei Han,
Micheline Kamber, Jian Pei, third edition
• “Neural Networks and Deep Learning” (search
neuralnetworksanddeeplearning on the Web)
References & Grading
• References
• “Data Mining – Concepts and Techniques” by Jiawei Han,
Micheline Kamber, Jian Pei, third edition
• “Neural Networks and Deep Learning” (search
neuralnetworksanddeeplearning on the Web)
• Grading
• Mid-term exam 46%
• Final exam: 50%
• 2 Homeworks: 4%
This Class

• Basic Concepts of Big Data Processing


• Traditional way of data processing
• Hadoop
• Example: Word count, Page Rank, K-means clustering

• Spark

• Basic Concepts of Analytics


• Frequent Patterns
• Classification
• Neural Network

• Real world projects in telco business


5
Let’s start with an example
Midnight bus routing in Seoul, Korea
# of people in a region - # of people living in the reqion

① Existing Route ② Improved Route


Big Data Analytics Process

Data Collection Big Data Analytics Process

Service 1 Service 1 Data


Transforming raw data

Summary
Service 2 Service 2 Data
Finding Variables

Inter related
Service 3 Service 3 Data Applying Various Data
&
Analytics Algorithms
Trial & Error

Service 4 Service 4 Data Consolidating Results

Verifying Results with Real


Service 5 Service 5 Data Data
{“S_id”: “xx”, “R_id”: “yy”, “S_time”: “20201010150021”, “E_time”: “20201010150313”,
“S_cid”: “aa”, “R_cid”: “bb”}
{"result":"","jobid":"66072232606","host":"","logid":2,"time":"","c_msg_id":"758595208#!900222",
"type":"xml-RCP-consume-REQSMS"}?
Traditional Data Processing

11
Midnight bus routing in Seoul, Korea
# of people in a region - # of people living in the reqion

① Existing Route ② Improved Route


Cloud Computing What is Cloud Computing?

Complicated data processings which take massive computing power and data storages are
executed on the central server (Cloud) rather than on personal PC’s or local servers.

13
Cloud Computing - Google Motivation

◆ Google adopted service platforms based on cloud computing with distributed file systems, not
commercial DBMS from the beginning of Google foundation, which have big advantages in terms of
cost and scalability
◆But it had a disadvantage that it is very hard to develop parallel applications

Young Google’s Concern

◆ It’s a road to success if they can ◆ DBMS Restriction


manipulate/analyze whole data in the Web ⚫ Only Scalable upto a certain point: OLTP, DW
◆ The size of the Web data increases twice ⚫ Doesn’t fit Big Data Processings
every year (Maybe more that twice now!!) ⚫ Very expensive to acquire good DB, and
operation cost is also very expensive
⚫ They need Entire Web Scale data
⚫ If the DB is not powerful enough to process
processing capabilities
all the data, a new powerful server should be
⚫ Hugh increase in HW/SW cost
puchrsed

◆ Google entered search engine business using Cloud Computing platforms to acquire
competitive edges by spending much less HW/SW cost compared to Yahoo.
◆ The only problem is that they need to develop scalable, fault-tolerant, parallel software, which is
very hard

14
Cloud Computing Google Cloud Computing Platform

Parallel Computing
Major Challenges In 2009, 50 Billion
Google Search and other Google Services pages indexed
 Transparency
It’s very hard to
build distributed
Distributed Data Processing Engine In 2008,
programs. Need a Distributed DB System
Map/Reduce Based 200+
consistent Programming Model (BigTable)
GFS Clusters
programming
model which is
1000~5000 PCs per
easy to program to Distributed File System Cluster
process huge data GFS (Google File System)
 Fault Tolerance
Lots of Commodity OS
Optimized Linux OS by Google
Devices cause lots
of Troubles In 2006,
Commodity PC Server Clustering
 Scalability About 450,000 PCs

More data needs


more HW in low
cost

15
Contents

1. Cloud Computing
2. Map-Reduce
• Motivation
• Key Idea of Map/Reduce
• Word Count example
• Sort example
• Inverted Index example
• Google Page Rank example
• Matrix example
• Impact of Map/Reduce on Google
3. Hadoop
4. Big Data Mining

16
Map-Reduce and Hadoop Key Idea of Map/Reduce

 Data type: key-value records


 Map function:
(Kin, Vin) ➔ list(Kinter, Vinter)
 Reduce function:
(Kinter, list(Vinter)) ➔ list(Kout, Vout)

◆ Hadoop Platform supports communications between nodes


⚫ Failure recovery
⚫ Scalability
⚫ Load balancing

17
Map-Reduce and Hadoop Word Count example

Input Map Shuffle & Sort Reduce Output

the, 1
brown, 1
the quick fox, 1 brown, 2
Map
brown fox fox, 2
Reduce
how, 1
the, 1
fox, 1
now, 1
the, 1 the, 3
the fox ate
Map
the mouse quick, 1

how, 1
ate, 1 ate, 1
now, 1
mouse, 1
brown, 1 Reduce cow, 1
how now mouse, 1
Map cow, 1
brown cow quick, 1

18
Map-Reduce Inverted Index example

◆Input: (filename, text) records Map Reduce


◆Output: list of files containing each
word [Link]
to, [Link]
◆Map: be, [Link]
to be or
foreach word in [Link](): not to be or, [Link]
output(word, filename) afraid, ([Link])
not, [Link]
be, ([Link], [Link])
◆Combine: uniquify filenames for greatness, ([Link])
each word not, ([Link], [Link])
of, ([Link])
◆Reduce: or, ([Link])
[Link] be, [Link] to, ([Link])
def reduce(word, filenames):
output(word, sort(filenames)) be not not, [Link]
afraid of afraid, [Link]
greatness
of, [Link]
greatness, [Link]

19
Map-Reduce Google PageRank example

◆ Definition : Let D be the set of all Web pages. Let I(p) be the set of pages that link to
the page p and let ci be the total number of links going out of page pi. The PageRank of
page pi, denoted by ri, is then given by

◆ Page Rank corresponds to the probability distribution of a random walk on the web
graphs : Random Surfer Model

◆ Sometimes, a random surfer gets bored and jumps to a different page

20
Map-Reduce Google PageRank example

21
Map-Reduce Matrix example

Addition Multiplication

22
Contents

1. Cloud Computing
2. Map-Reduce
3. Hadoop
• What is Hadoop?
• HDFS
• Principle of Hadoop’s Map/Reduce Execution
• Implementation of Word Count
4. Big Data Mining

23
Hadoop What is Hadoop?

Hadoop is a Open Source Platform based on Java. Hadoop has HDFS(hadoop Distributed
File System) which is similar to GFS. Based on HDFS, Hadoop provides Map/Reduce Job
execution on a cluster composed of thousands of PCs.

Tera, Peta data processing applications

Hive [Apache Hadoop Project]


(SQL-like Query Handling) ◆Hadoop Core
⚫ Distributed File System
Distributed Data Processing Engine Distributed DB System ⚫ Map/Reduce Framework
Map/Reduce Based Programming Model (HBASE) ◆Pig (initiated by Yahoo!)
⚫ Parallel Programming Language and
Runtime
◆Hbase (initiated by Powerset)
⚫ Table storage for semi-structured data
Distributed File System
◆Hive (initiated by Facebook)
⚫ SQL-like query language & meta-store
(Hadoop Distributed File System)

Linux OS
Redhat, CentOS, Ubuntu …

Commodity PC Server Clustering ◆ PC Server


⚫ CPU :Intel Quad Core x 2 CPU
⚫ Memory: 16GB
⚫ HDD: 500GB x 4
⚫ NIC: Gigabit Ethernet x 2 Port

24
Map-Reduce Word Count example

Input Map Shuffle & Sort Reduce Output

the, 1
brown, 1
the quick fox, 1 brown, 2
Map
brown fox fox, 2
Reduce
how, 1
the, 1
fox, 1
now, 1
the, 1 the, 3
the fox ate
Map
the mouse quick, 1

how, 1
ate, 1 ate, 1
now, 1
mouse, 1
brown, 1 Reduce cow, 1
how now mouse, 1
Map cow, 1
brown cow quick, 1

25
Hadoop 소개 Implementation of Word Count

public class MapClass extends MapReduceBase


implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable ONE = new IntWritable(1);

public void map(LongWritable key, Text value,


OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
String line = [Link]();
StringTokenizer itr = new StringTokenizer(line);
while ([Link]()) {
[Link](new text([Link]()), ONE);
}
}
}

26
Hadoop 소개 Implementation of Word Count

public class ReduceClass extends MapReduceBase


implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,


OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
int sum = 0;
while ([Link]()) {
sum += [Link]().get();
}
[Link](key, new IntWritable(sum));
}
}

27
Hadoop 소개 Implementation of Word Count

public static void main(String[] args) throws Exception {


JobConf conf = new JobConf([Link]);
[Link]("wordcount");

[Link]([Link]);
[Link]([Link]);
[Link]([Link]);

[Link](conf, args[0]);
[Link](conf, new Path(args[1]));

[Link]([Link]); // out keys are words (strings)


[Link]([Link]); // values are counts

[Link](conf);
}

28

You might also like