BIG DATA – DATA ANALYSIS
Lê Hồng Hải
UET-VNUH
Big Data Overview
1 Introduction
2 Big Data storages
3 Big Data processing
4 Streaming
2
Big Data
The definition of big data is data that
contains greater variety, arriving in
increasing volumes and with more
velocity. This is also known as the 3 Vs
3
Big Data
4
Big data architecture components
• Data sources – relational databases, files (e.g., web
server log files) produced by applications, real-time
data produced by IoT devices.
• Big data storage –storing high data volumes of
different types before filtering, aggregating, and
preparing data for analysis.
• Real-time message ingestion store – to capture and
store real-time messages for stream processing.
• Analytical data store – relational databases for
preparing and structuring big data for further
analytical querying.
• Big data analytics and reporting, which may include
OLAP cubes, ML tools, BI tools, etc. – to provide big
data insights to end users.
5
Big data architecture
6
Big Data Storage
1. Distributed file systems
2. Sharding across multiple databases
3. Key-value storage systems
4. Parallel and distributed databases
7
Distributed File Systems
A distributed file system stores data across
a large collection of machines, but provides
a single file-system view
Provides redundant storage of massive
amounts of data on cheap and unreliable
computers
◼ Google File System (GFS)
◼ Hadoop File System (HDFS)
8
Hadoop File System Architecture
▪ Single Namespace for entire
cluster
▪ Files are broken up into
blocks
• Typically 64 MB block size
• Each block replicated on
multiple DataNodes
▪ Client
• Finds the location of
blocks from NameNode
• Accesses data directly
from DataNode
9
Hadoop Distributed File System (HDFS)
Data Coherency
◼ Write-once-read-many access model
◼ Client can only append to existing files
Distributed file systems good for millions
of large files
10
Big Data Storage
1. Distributed file systems
2. Sharding across multiple databases
3. Key-value storage systems
4. Parallel and distributed databases
11
Sharding
Sharding: partition data across multiple
databases
Partitioning usually done on some
partitioning attributes (also known as
partitioning keys or shard keys e.g.
user ID
◼ E.g., records with key values from 1 to
100,000 on database 1,
records with key values from 100,001 to
200,000 on database 2, etc
12
Key Value Storage Systems
Key-value storage systems store large
numbers (billions or even more) of small
(KB-MB) sized records
Records are partitioned across multiple
machines and
Queries are routed by the system to
appropriate machine
Records are also replicated across
multiple machines, to ensure availability
even if a machine fails
◼ Key-value stores ensure that updates are
applied to all replicas, to ensure that their
values are consistent13
Key Value Storage Systems
Key-value stores may store
◼ uninterpreted bytes, with an associated key
E.g., Amazon S3, Amazon Dynamo
◼ Wide-table (can have arbitrarily many
attribute names) with associated key
▪ Google BigTable, Apache Cassandra, Apache Hbase,
Amazon DynamoDB
◼ JSON
MongoDB, CouchDB (document model)
Document stores store semi-structured
data, typically JSON
Some key-value stores support multiple
versions of data, with timestamps/version
numbers 14
Data Representation
An example of a JSON object is:
{
"ID": "22222",
"name": {
"firstname: "Albert",
"lastname: "Einstein"
},
"deptname": "Physics",
"children": [
{ "firstname": "Hans", "lastname":
"Einstein" },
{ "firstname": "Eduard", "lastname":
"Einstein" }
]
}
15
Key Value Storage Systems
Key-value stores support
◼ put(key, value): used to store values with an
associated key,
◼ get(key): which retrieves the stored value
associated with the specified key
◼ delete(key) -- Remove the key and its
associated value
Some systems also support range
queries on key values
Document stores also support queries on
non-key attributes
◼ See book for MongoDB queries
◼ Also called NoSQL systems
16
Replication and Consistency
Availability (system can run even if parts have
failed) is essential for parallel/distributed
databases
◼ Via replication, so even if a node has failed, another copy
is available
Consistency is important for replicated data
◼ All live replicas have same value, and each read sees
latest version
Network partitions (network can break into two
or more parts, each with active systems that can’t
talk to other parts)
In presence of partitions, cannot guarantee both
availability and consistency
◼ Brewer’s CAP “Theorem”
17
Big data architecture
18
Big Data Processing
Map-Reduce
Spark
Streaming
19
The MapReduce Paradigm
Platform for reliable, scalable parallel computing
Abstracts issues of distributed and parallel
environment from programmer
◼ Programmer provides core logic (via map() and
reduce() functions)
◼ System takes care of parallelization of
computation, coordination, etc
20
MapReduce - Dataflow
21
The MapReduce Paradigm
Paradigm dates back many decades
◼ But very large scale implementations
running on clusters with 10^3 to 10^4
machines are more recent
◼ Google Map Reduce, Hadoop, ..
Data storage/access typically done using
distributed file systems or key-value stores
22
MapReduce Programming Model
Input: a set of key/value pairs
User supplies two functions:
◼ map(k,v) → list(k1,v1)
◼ reduce(k1, list(v1)) → v2
(k1,v1) is an intermediate key/value pair
Output is the set of (k1,v2) pairs
23
Flow of Keys and Values
Flow of keys and values in a map
reduce task
rk1 rv1 rk1 rv1,rv7,...
rk7 rv2 rk2 rv8,rvi,...
mk1 mv1
rk3 rv3 rk3 rv3,...
mk2 mv2
rk1 rv7
rk7 rv2,...
rk2 rv8
rki ... rvn,...
rk2 rvi
mkn mvn
rki rvn
map inputs map outputs reduce inputs
(key, value) (key, value)
https://www.geeksforgeeks.org/how-to-execute-wordcount-program-
in-mapreduce-using-cloudera-distribution-hadoop-cdh/
24
Example
I am a tiger, you are also a
tiger
I,1 a,2
map am,1 a, 1 also,1
a,1 a,1 reduce am,1
are,1 part0
also,1
tiger,1 am,1
map you,1 are,1
are,1 I,1
tiger,1 I, 1
tiger,1 tiger,2 part1
also,1 you,1 reduce you,1
map a, 1
tiger,1
JobTracker generates JobTracker generates
Hadoop sorts the
three TaskTrackers for two TaskTrackers
25 for
intermediate data
map tasks map tasks
25
Parallel Processing of MapReduce Job
User
Program
copy copy copy
Master
assign assign
map reduce
Part 1 Map 1 Reduce 1 File 1
Part 2
Part 3 Reduce 1 write File 2
Map 2
Part 4
local
write
Part n
read Map n Reduce m File m
Remote
Read, Sort
Input file Intermediate Output files
partitions files
26
Map Reduce vs. Databases
Map Reduce widely used for parallel
processing
◼ Google, Yahoo, and 100’s of other companies
◼ Example uses: compute PageRank, build
keyword indices, do data analysis of web click
logs, ….
Many real-world uses of MapReduce
cannot be expressed in SQL
But many computations are much easier to
express in SQL
27
Map Reduce vs. Databases (Cont.)
Relational operations (select, project, join,
aggregation, etc.) can be expressed using
Map Reduce
SQL queries can be translated into Map
Reduce infrastructure for execution
◼ Apache Hive SQL, Apache Pig Latin, Microsoft
SCOPE
28
Where is MapReduce Inefficient?
Long pipelines sharing data
Interactive applications
Streaming applications
(MapReduce would need to write and read
from disk a lot)
29
Spark
The key idea of Spark is Resilient Distributed
Datasets (RDD)
It supports in-memory processing computation
30
RDD Spark
Resilient Distributed Dataset (RDD)
abstraction
◼ Collection of records that can be stored
across multiple machines
Read-only partitioned collection of records
(like a DFS) but with a record of how the
dataset was created as a combination of
transformations from other dataset(s)
31
Word Count in Spark
32
Spark DataFramesand DataSet
RDDs in Spark can be typed in programs,
but not dynamically
The DataSet type allows types to be
specified dynamically
Row is a row type, with attribute names
◼ In code below, attribute names/types of
instructor and department are inferred from
files read
33
Spark DataFramesand DataSet
Operations filter, join, groupBy, agg, etc defined
on DataSet, and can execute in parallel
Dataset<Row> instructor =
spark.read().parquet("...");
Dataset<Row> department =
spark.read().parquet("...");
instructor.filter(instructor.col("salary").gt(100000
))
.join(department, instructor.col("dept name")
.equalTo(department.col("dept name")))
.groupBy(department.col("building"))
.agg(count(instructor.col("ID")));
34
StreamingData
35
Streaming Data and Applications
Streaming data refers to data that
arrives in a continuous fashion
Applications include:
◼ Stock market: stream of trades
◼ Sensors: sensor readings
Internet of things
◼ Network monitoring data
◼ Social media: tweets and posts can be viewed
as a stream
Queries on streams can be very useful
◼ Monitoring, alerts, automated triggering of
actions
36
Publish Subscribe Systems
Publish-subscribe (pub-sub) systems
provide a convenient abstraction for
processing streams
◼ Tuples in a stream are published to a topic
◼ Consumers subscribe to topic
37
Apache Kafka
Apache Kafka is a popular parallel pub-sub
system widely used to manage streaming data
Parallel pub-sub systems allow tuples in a
topic to be partitioned across multiple
machines
38
Big data architecture
39
Data Analytics
1. Overview
2. Data Warehousing (DW)
3. Online Analytical Processing (OLAP)
4. Data Mining
40
Overview
Data analytics: the processing of data to
infer patterns, correlations, or models for
prediction
Primarily used to make business decisions
◼ E.g., what product to suggest for purchase
◼ E.g., what products to manufacture/stock, in
what quantity
Critical for businesses today
41
Common steps in data analytics
Gather data from multiple sources into one
location
Data warehouses also integrate data into a
common schema
Data often needs to be extracted from
source formats, transformed into
common schema, and loaded into the
data warehouse (ETL)
42
Data Analytics
Generate aggregates and reports
summarizing data
◼ Dashboards showing graphical charts/reports
◼ Online analytical processing (OLAP)
systems allow interactive querying
◼ Statistical analysis using tools such as
R/SAS/SPSS
Build predictive models and use the
models for decision making
43
Overview (Cont.)
Predictive models are widely used today
◼ E.g., use customer profile features and the
history of a customer to predict the likelihood
of default on a loan
◼ E.g., use history of sales to predict future sales
Other examples of business decisions:
◼ What items to stock?
◼ What insurance premium to change?
◼ To whom to send advertisements?
44
Overview (Cont.)
Machine learning techniques are key to
finding patterns in data and making
predictions
Data mining extends techniques
developed by machine-learning
communities to run them on very large
datasets
The term business intelligence (BI) is
synonym for data analytics
45
Data Warehousing
A data warehouse is a repository (archive)
of information gathered from multiple
sources, stored under a unified schema, at a
single site
46
Warehouse Design issues
Data transformation and data
cleansing
◼E.g., correct mistakes in addresses
(misspellings, zip code errors)
How to propagate updates
What data to summarize
47
Multidimensional Data
Data in warehouses can usually be divided
into
◼ Fact tables, which are large
E.g, sales(item_id, store_id,
customer_id, date, number, price)
◼ Dimension tables, which are relatively
small
Store extra information about stores,
items, etc.
48
Fact Tables
Attributes of fact tables can be usually
viewed as
◼ Measure attributes
measure some value, and can be
aggregated upon
e.g., the attributes number or price of
the sales relation
◼ Dimension attributes
dimensions on which measure attributes
are viewed
49
Data Warehouse Star Schema
50
More on Data Warehouse Star Schema
51
Multidimensional Data and Warehouse Schemas
More complicated schema structures
◼ Snowflake schema: multiple levels of
dimension tables
52
Data lakes
Some applications do not find it worthwhile
to bring data to a common schema
◼ Data lakes are repositories which allow data to
be stored in multiple formats, without schema
integration
◼ Less upfront effort, but more effort during
querying
53
Database Support for Data Warehouses
Data in warehouses usually append-only,
not updated. Can avoid concurrency
control overheads
Data warehouses often use column-
oriented storage
54
Column-oriented storage
Arrays are compressed, reducing storage,
IO and memory costs significantly
Queries can fetch only attributes that they
care about, reducing IO and memory cost
Data warehouses often use parallel storage
and query processing infrastructure
55
Data Analysis and OLAP
Online Analytical Processing (OLAP)
Interactive analysis of data, allowing data
to be summarized and viewed in different
ways in an online fashion (with negligible
delay)
56
Cross Tabulation
The table below is an example of a cross-
tabulation (cross-tab), also referred to as a
pivot-table
57
Data Cube
A data cube is a multidimensional
generalization of a cross-tab
Can have n dimensions; we show 3 below
Cross-tabs can be used as views on a data
cube
58
Online Analytical Processing Operations
Pivoting: changing the dimensions used in a
cross-tab
Slicing: creating a cross-tab for fixed values
only
Rollup: moving from finer-granularity data to
a coarser granularity
Drill down: The opposite operation - that of
moving from coarser-granularity data to finer-
granularity data
59
Hierarchies on Dimensions
Hierarchy on dimension attributes: lets
dimensions be viewed at different levels of
detail
60
Cross Tabulation With Hierarchy
Cross-tabs can be easily extended to deal
with hierarchies
Can drill down or roll up on a hierarchy
E.g. hierarchy: item_name → category
61
Reporting and Visualization
Reporting tools help create formatted
reports with tabular/graphical
representation of data
Data visualization tools help create
interactive visualization of data
◼ E.g., PowerBI, Tableau, FusionChart, plotly,
Datawrapper, Google Charts, etc.
62
Reporting and Visualization
63
Data Mining
Data mining is the process of semi-
automatically analyzing large databases to
find useful patterns
Some types of knowledge can be represented
as rules
More generally, knowledge is discovered by
applying machine learning techniques to
past instances of data to form a model
64
Types of Data Mining Tasks
Prediction based on past history
◼ Predict if a credit card applicant poses a good
credit risk, based on some attributes (income,
job type, age, ..) and past history
Some examples of prediction mechanisms:
◼ Classification
Items (with associated attributes) belong to one of
several classes
Training instances have attribute values and classes
provided
◼ Regression formulae
Given a set of mappings for an unknown function,
predict the function result for a new parameter value
65
THANKS YOU