Introduction
Big Data Analytics
References & Grading
• References
• “Data Mining – Concepts and Techniques” by Jiawei Han,
Micheline Kamber, Jian Pei, third edition
References & Grading
• References
• “Data Mining – Concepts and Techniques” by Jiawei Han,
Micheline Kamber, Jian Pei, third edition
• “Neural Networks and Deep Learning” (search
neuralnetworksanddeeplearning on the Web)
References & Grading
• References
• “Data Mining – Concepts and Techniques” by Jiawei Han,
Micheline Kamber, Jian Pei, third edition
• “Neural Networks and Deep Learning” (search
neuralnetworksanddeeplearning on the Web)
• Grading
• Mid-term exam 46%
• Final exam: 50%
• 2 Homeworks: 4%
This Class
• Basic Concepts of Big Data Processing
• Traditional way of data processing
• Hadoop
• Example: Word count, Page Rank, K-means clustering
• Spark
• Basic Concepts of Analytics
• Frequent Patterns
• Classification
• Neural Network
• Real world projects in telco business
5
Let’s start with an example
Midnight bus routing in Seoul, Korea
# of people in a region - # of people living in the reqion
① Existing Route ② Improved Route
Big Data Analytics Process
Data Collection Big Data Analytics Process
Service 1 Service 1 Data
Transforming raw data
Summary
Service 2 Service 2 Data
Finding Variables
Inter related
Service 3 Service 3 Data Applying Various Data
&
Analytics Algorithms
Trial & Error
Service 4 Service 4 Data Consolidating Results
Verifying Results with Real
Service 5 Service 5 Data Data
{“S_id”: “xx”, “R_id”: “yy”, “S_time”: “20201010150021”, “E_time”: “20201010150313”,
“S_cid”: “aa”, “R_cid”: “bb”}
{"result":"","jobid":"66072232606","host":"","logid":2,"time":"","c_msg_id":"758595208#!900222",
"type":"xml-RCP-consume-REQSMS"}?
Traditional Data Processing
11
Midnight bus routing in Seoul, Korea
# of people in a region - # of people living in the reqion
① Existing Route ② Improved Route
Cloud Computing What is Cloud Computing?
Complicated data processings which take massive computing power and data storages are
executed on the central server (Cloud) rather than on personal PC’s or local servers.
13
Cloud Computing - Google Motivation
◆ Google adopted service platforms based on cloud computing with distributed file systems, not
commercial DBMS from the beginning of Google foundation, which have big advantages in terms of
cost and scalability
◆But it had a disadvantage that it is very hard to develop parallel applications
Young Google’s Concern
◆ It’s a road to success if they can ◆ DBMS Restriction
manipulate/analyze whole data in the Web ⚫ Only Scalable upto a certain point: OLTP, DW
◆ The size of the Web data increases twice ⚫ Doesn’t fit Big Data Processings
every year (Maybe more that twice now!!) ⚫ Very expensive to acquire good DB, and
operation cost is also very expensive
⚫ They need Entire Web Scale data
⚫ If the DB is not powerful enough to process
processing capabilities
all the data, a new powerful server should be
⚫ Hugh increase in HW/SW cost
puchrsed
◆ Google entered search engine business using Cloud Computing platforms to acquire
competitive edges by spending much less HW/SW cost compared to Yahoo.
◆ The only problem is that they need to develop scalable, fault-tolerant, parallel software, which is
very hard
14
Cloud Computing Google Cloud Computing Platform
Parallel Computing
Major Challenges In 2009, 50 Billion
Google Search and other Google Services pages indexed
Transparency
It’s very hard to
build distributed
Distributed Data Processing Engine In 2008,
programs. Need a Distributed DB System
Map/Reduce Based 200+
consistent Programming Model (BigTable)
GFS Clusters
programming
model which is
1000~5000 PCs per
easy to program to Distributed File System Cluster
process huge data GFS (Google File System)
Fault Tolerance
Lots of Commodity OS
Optimized Linux OS by Google
Devices cause lots
of Troubles In 2006,
Commodity PC Server Clustering
Scalability About 450,000 PCs
More data needs
more HW in low
cost
15
Contents
1. Cloud Computing
2. Map-Reduce
• Motivation
• Key Idea of Map/Reduce
• Word Count example
• Sort example
• Inverted Index example
• Google Page Rank example
• Matrix example
• Impact of Map/Reduce on Google
3. Hadoop
4. Big Data Mining
16
Map-Reduce and Hadoop Key Idea of Map/Reduce
Data type: key-value records
Map function:
(Kin, Vin) ➔ list(Kinter, Vinter)
Reduce function:
(Kinter, list(Vinter)) ➔ list(Kout, Vout)
◆ Hadoop Platform supports communications between nodes
⚫ Failure recovery
⚫ Scalability
⚫ Load balancing
17
Map-Reduce and Hadoop Word Count example
Input Map Shuffle & Sort Reduce Output
the, 1
brown, 1
the quick fox, 1 brown, 2
Map
brown fox fox, 2
Reduce
how, 1
the, 1
fox, 1
now, 1
the, 1 the, 3
the fox ate
Map
the mouse quick, 1
how, 1
ate, 1 ate, 1
now, 1
mouse, 1
brown, 1 Reduce cow, 1
how now mouse, 1
Map cow, 1
brown cow quick, 1
18
Map-Reduce Inverted Index example
◆Input: (filename, text) records Map Reduce
◆Output: list of files containing each
word [Link]
to, [Link]
◆Map: be, [Link]
to be or
foreach word in [Link](): not to be or, [Link]
output(word, filename) afraid, ([Link])
not, [Link]
be, ([Link], [Link])
◆Combine: uniquify filenames for greatness, ([Link])
each word not, ([Link], [Link])
of, ([Link])
◆Reduce: or, ([Link])
[Link] be, [Link] to, ([Link])
def reduce(word, filenames):
output(word, sort(filenames)) be not not, [Link]
afraid of afraid, [Link]
greatness
of, [Link]
greatness, [Link]
19
Map-Reduce Google PageRank example
◆ Definition : Let D be the set of all Web pages. Let I(p) be the set of pages that link to
the page p and let ci be the total number of links going out of page pi. The PageRank of
page pi, denoted by ri, is then given by
◆ Page Rank corresponds to the probability distribution of a random walk on the web
graphs : Random Surfer Model
◆ Sometimes, a random surfer gets bored and jumps to a different page
20
Map-Reduce Google PageRank example
21
Map-Reduce Matrix example
Addition Multiplication
22
Contents
1. Cloud Computing
2. Map-Reduce
3. Hadoop
• What is Hadoop?
• HDFS
• Principle of Hadoop’s Map/Reduce Execution
• Implementation of Word Count
4. Big Data Mining
23
Hadoop What is Hadoop?
Hadoop is a Open Source Platform based on Java. Hadoop has HDFS(hadoop Distributed
File System) which is similar to GFS. Based on HDFS, Hadoop provides Map/Reduce Job
execution on a cluster composed of thousands of PCs.
Tera, Peta data processing applications
Hive [Apache Hadoop Project]
(SQL-like Query Handling) ◆Hadoop Core
⚫ Distributed File System
Distributed Data Processing Engine Distributed DB System ⚫ Map/Reduce Framework
Map/Reduce Based Programming Model (HBASE) ◆Pig (initiated by Yahoo!)
⚫ Parallel Programming Language and
Runtime
◆Hbase (initiated by Powerset)
⚫ Table storage for semi-structured data
Distributed File System
◆Hive (initiated by Facebook)
⚫ SQL-like query language & meta-store
(Hadoop Distributed File System)
Linux OS
Redhat, CentOS, Ubuntu …
Commodity PC Server Clustering ◆ PC Server
⚫ CPU :Intel Quad Core x 2 CPU
⚫ Memory: 16GB
⚫ HDD: 500GB x 4
⚫ NIC: Gigabit Ethernet x 2 Port
24
Map-Reduce Word Count example
Input Map Shuffle & Sort Reduce Output
the, 1
brown, 1
the quick fox, 1 brown, 2
Map
brown fox fox, 2
Reduce
how, 1
the, 1
fox, 1
now, 1
the, 1 the, 3
the fox ate
Map
the mouse quick, 1
how, 1
ate, 1 ate, 1
now, 1
mouse, 1
brown, 1 Reduce cow, 1
how now mouse, 1
Map cow, 1
brown cow quick, 1
25
Hadoop 소개 Implementation of Word Count
public class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable ONE = new IntWritable(1);
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
String line = [Link]();
StringTokenizer itr = new StringTokenizer(line);
while ([Link]()) {
[Link](new text([Link]()), ONE);
}
}
}
26
Hadoop 소개 Implementation of Word Count
public class ReduceClass extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
int sum = 0;
while ([Link]()) {
sum += [Link]().get();
}
[Link](key, new IntWritable(sum));
}
}
27
Hadoop 소개 Implementation of Word Count
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf([Link]);
[Link]("wordcount");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](conf, args[0]);
[Link](conf, new Path(args[1]));
[Link]([Link]); // out keys are words (strings)
[Link]([Link]); // values are counts
[Link](conf);
}
28