BDS Session 4
BDS Session 4
Janardhanan PS
[email protected]
Topics for today
2
Big Data Analytics Lifecycle
• Explore a specific data analytics lifecycle that organizes and manages the tasks and
activities associated with the analysis of Big Data
3
Example
1. A Market Research (MR) firm creates and runs surveys to understand specific
market segments for their client, e.g. consumer electronics - LED TVs
2. These surveys contain questions that have structured : numeric, boolean,
categorical, grade as well as unstructured : free form text answers
3. A survey is rolled out to many users with various demographic attributes. The
list of survey users could be provided by the client and/or MR firm
4. The results are collected and analyzed for business insights often using multiple
tools and analysis techniques
5. The insights are curated and shared to create a presentation for the client of
the MR firm
6. The client makes critical business decisions about their product based on the
survey results
4
Big Data Analytics Lifecycle
Stages
5
1. Business Case Evaluation
• Based on business requirements, determine whether the business problems being addressed is
really a Big Data problem
✓ A business problem needs to be directly related to one or more of the Big Data High volume
characteristics of Volume, Velocity, or Variety. Unstructured data
• Must begin with a well-defined business case that presents a clear understanding of the Find market fit
✓ justification for new product
✓ motivation
✓ goals of carrying out the analysis.
• A business case should be created, assessed and approved prior to proceeding with the
actual hands-on analysis tasks.
What are the business questions ?
Define thresholds
• Helps decision-makers to on survey stats
✓ Understand the business resources that will need to be utilized
✓ Identify which business challenges the analysis will tackle.
✓ Identify KPIs can help determine assessment criteria and guidance for the evaluation of
the analytic results
6
2. Data Identification
• Main objective is to identify the datasets required for the analysis project and their sources
✓ Wider variety of data sources may increase the probability of finding hidden patterns and
correlations.
✓ Caution: Too much data variety can also confuse - overfitting problem.
✓ The required datasets and their sources can be internal and/or external to the enterprise.
• The data is gathered from all of the data sources that were identified during the last stage
Clean bad data, e.g. empty responses
• The acquired data is then looked upon for Junk text inputs
✓ filtering / removal of corrupt data Filter a subset if we don’t need to look at all
✓ removal of unusable data for analysis attributes, all demographics
• In many cases involving unstructured external data, some or most of the acquired data may be
irrelevant (noise) and can be discarded as part of the filtering process.
• “Corrupt” data can include records with missing or nonsensical values or invalid data types
✓ Advisable to store a verbatim copy of the original dataset before proceeding with the
filtering
• Data needs to be persisted once it gets generated or enters the enterprise boundary
✓ For batch analytics, this data is persisted to disk prior to analysis
✓ For real-time analytics, the data is analyzed first and then persisted to disk
8
4. Data Extraction
9
5. Data Validation & Cleansing
Validate survey responses
Contradictory answers
Identify population skews, e.g. responses have inherent
• Invalid data can skew and falsify analysis results gender bias so no point in making a gender based analysis
Codify certain columns for easier analysis
• Data input into Big Data analyses can be unstructured without any indication of validity
✓Complexity can further make it difficult to arrive at a set of suitable validation constraints
✓Dedicated stage is required to establish complex validation rules and removing any known
invalid data.
• Big Data solutions often receive redundant data across different datasets.
✓This can be exploited to explore interconnected datasets in order to
▪ assemble validation parameters
▪ fill in missing valid data
• For batch analytics, data validation and cleansing can be achieved via an offline ETL operation
• For real-time analytics, a more complex in-memory system is required to validate and cleanse
the data as it arrives from the source
10
6. Data Aggregation & Representation
• Dedicated to integrating multiple datasets together to arrive at a unified view
✓ Need to merge data spread across multiple datasets through a common field
✓ Needs reconciliation of data coming from different sources Final joined data set (e.g. current
✓ Needs to identify the dataset representing the correct values. with old survey or 3rd party
demographics data) with certain
aggregations done for
downstream analysis
• Can be complicated because of :
✓ Data Structure – Although the data format may be the same, the data model
may be different
✓ Semantics – A value that is labeled differently in two different datasets may
mean the same thing, for example “surname” and “last name.”
• The large volumes makes data aggregation a time and effort-intensive operation
✓ Reconciling these differences can require complex logic that is executed
automatically without the need for human intervention
• Future data analysis requirements need to be considered during this stage to help
foster data reusability.
11
7. Data Analysis
• Dedicated to carrying out the actual analysis task, which typically involves
one or more types of analytics
✓ Can be iterative in nature, especially if the data analysis is exploratory
✓ Analysis is repeated until the appropriate pattern or correlation is uncovered
13
8. Data Visualization Visual results will need to be shared with the
stakeholders for the new product launch. e.g.
show top features that appeal to each segment
of target product user (gender, age group).
• Dedicated to using data visualization techniques and tools to graphically communicate the analysis
results for effective interpretation by business users
✓ The ability to analyze massive amounts of data and find useful insights carries little value if the
only ones that can interpret the results are the analysts.
✓ Business users need to be able to understand the results in order to obtain value from the
analysis and subsequently have the ability to provide feedback
• Provide users with the ability to perform visual analysis, allowing for the discovery of answers to
questions that users have not yet even formulated
✓ A method of drilling down to comparatively simple statistics is crucial, in order for users to
understand how the rolled up or aggregated results were generated
• Important to use the most suitable visualization technique by keeping the business domain in context
✓ Interpretation of result can vary based on the visualization shown
14
9. Utilization of Analysis Results
• Alerts
✓Results can be used as input for existing alerts or may form the basis of new alerts
✓Alerts may be created to inform users via email or SMS text about an event that requires
them to take corrective action
15
Topics for today
16
Consistency
18
ACID (2)
• Atomicity
✓Ensures that all operations will always succeed or fail completely
✓No partial transactions
• Consistency
✓Ensures that only data that conforms to the constraints of database schema can be written to
the database
✓Database that is in consistent state will remain in consistent stage following a successful
transaction.
• Isolation
✓Ensures that results of transaction are not available to other operations until it is complete.
✓Critical for concurrency control.
• Durability
✓Ensures that results of transaction are permanent
✓Once transaction is committed , it can not be rolled back.
19
CAP
• Stands for Consistency (C) , Availability (A) and Partition Tolerance (P)
• Triple constraint related to the distributed database systems
• Consistency
✓ A read of a data item from any node indicates that the same data is present across
multiple nodes
• Availability
✓ A read/write request will always be acknowledged in form of success or failure in
reasonable time
• Partition tolerance
✓ System can continue to function when communication outages split the cluster into multiple
silos and can still service read/write requests
split brain problem
20
Consistency levels in distributed systems
• What is consistency in distributed systems
• Each replica Node has same view of data at a given time
• Each read request gets most recent values of write
• Refers to the rules related to making a concurrent, distributed system
appear like a single centralized system
Types of consistency
1. Eventual consistency
2. Causal consistency
3. sequential consistency
4. Strict Consistency / Linearizability
Refer: http://dbmsmusings.blogspot.com/2019/07/overview-of-consistency-levels-in.html
21
Levels of Consistency (1)
• Eventual consistency
• Weakest form of consistency
• All replicas will eventually return the same value for Read requests - If there are no writes for “some time” then all nodes will
eventually agree on a latest value of the data item
• Ensures High Availability
• Eg:- DNS and Cassandra uses Eventual consistency
• Causal consistency
• Weak consistency, but stronger than eventual consistency
• Preserves the order of causality-related (dependent) operations
• Does not ensure ordering of operations that are non-causal
• Popular and useful model where only causally connected writes and reads need to be ordered. So if a write of a data item (Y)
happened after a read of same or another data item (X) in a node then all nodes must observe write X before write to Y.
• Eg: Comments and their replies on social media
• Sequential
• Stronger than causal consistency
• All writes across nodes for all data items are globally ordered. All nodes must see the same order. But does not need real-time
ordering.
• Preserves the ordering specified by each client's program
• Does not ensure writes are visible instantaneously or in the same order as they occurred
• Eg: FB posts
22
Levels of Consistency (2)
• Strict consistency
• Strongest consistency model
• Read request from any replicas get the latest write value
• Requires real-time line ordering of all writes - assumes actual write time can be known.
So “reads” read the latest data in real time across processors.
• Eg: Updates on LDAP
• Linearizability
• Linearizability is very similar to strict consistency
• Linearizability model acknowledges that there is a period of time that occurs between when an
operation is submitted to the system, and when the system responds with an acknowledgement that
it was completed
• Acknowledges that write requests take time to write all copies - so does not impose ordering within
overlapping time periods of read / write
• A linearizability guarantee does not place any ordering constraints on operations that occur with
overlapping start and end times.
23
C: Consistency
24
When can it be in-consistent
1
2 write id=3 value 50
3
update id=3 to 50 6
5
4
id 3 = 50
read id=3 is allowed
before peer C is updated even
though write “happens before”
read
25
A: Availability
26
P: Partition tolerance (1)
27
P: Partition tolerance (2)
28
P: Partition tolerance (3)
29
Simple Illustration of CAP
set(x,1)
x=1 x=1
ok
ok 1 CA
set(x,1) get(x)
Consistency, Availability
set(x,2)Fails
x=1 x=1
Wait until 1 CP
Link restored
set(x,2) get(x)
Consistency, Partition tolerance
set(x,2)failed
x=2 x=1
ok 1 AP
set(x,2) get(x)
Availability, Partition tolerance
30
Topics for today
31
CAP Theorem (Brewer’s Theorem)
• So a system can be
• CA or AP or CP but cannot support CAP
• In effect, when there is a partition, the system has to decide whether to pick
consistency or availability.
C CA A
X
CP AP
P
32
CAP Theorem – Historical note
• Prof. Eric Brewer (UC Berkeley) presented it as the CAP principle in a 1999 article
• In 2002 a formal proof was given by Gilbert and Lynch, making CAP a theorem
– [Seth Gilbert, Nancy A. Lynch: Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services.
SIGACT News 33(2): 51-59 (2002)]
https://mwhittaker.github.io/blog/an_illustrated_proof_of_the_cap_theorem/
33
Consistency clarification
34
Database options
A
• Different design choices are
AP : made by Big Data DBs
Workloads where
CA : eventual consistency is • Faults are likely to happen in
Traditional DBMS for OLTP sufficient large scale systems
e.g. PostgreSQL, Cassandra,
CouchDB, • Provides flexibility depending
MySQL
Pick Two Dynamo, on use case to choose C, A, P
behavior mainly around
consistency semantics when
there are faults
C CP : P
Can provide consistency
with scale.
HBase, MongoDB,
Redis 35
Importance of CAP Theorem
36
BASE (Start from here)
• BASE is a database design principle based on CAP theorem
• Leveraged by AP database systems that use distributed technology
• Stands for
✓ Basically Available (BA)
✓ Soft state (S)
✓ Eventual consistency (E)
• It favors availability over consistency
• Soft state - Inconsistency (stale answers) allowed
• Eventual consistency - If updates stop, then after some time, consistency will be achieved
• Soft approach towards consistency allows to serve multiple clients without any latency albeit serving
inconsistent results
• Not useful for transactional systems where lack of consistency is concern
• Useful for write-heavy workloads where reads need not be consistent in real-time, e.g. social media
applications, monitoring data for non-real-time analysis etc.
• BASE Philosophy: best effort, optimistic, staleness and approximation allowed
• ACID to BASE Transformation - https://www.datasciencecentral.com/acid-to-base-transformation/
37
BASE – Basically Available
38
BASE – Soft State
• Database may be in inconsistent state when data is read, thus results may change if the
same data is requested again
✓ Because data could be uploaded for consistency, even though no user has written to the
database between two reads
39
BASE – Eventual Consistency
• State in which reads by different clients, immediately following a write to database, may not return
consistent results
• Database only attains consistency once the changes have been propagated to all nodes
• While database is in the process of attaining the state of eventual consistency, it will be in a soft state
40
Topics for today
41
Database Sphere
42
What is NoSQL Database ?
• NoSQL databases, also known as "Not Only SQL" databases, are a type of
database that do not use traditional SQL (Structured Query Language) for storing
and manipulating data
• They are designed to handle large amounts of unstructured, semi-structured, or
polymorphic data and are often used for big data, real-time data processing, and
cloud-based applications
• NoSQL databases use a distributed architecture, allowing them to scale horizontally
across multiple servers or nodes, making them ideal for handling high levels of
concurrency and data volume
43
What is NoSQL ?
45
How to choose the right NoSQL database?
46
Why NoSQL (1)
• RDBMS meant for OLTP systems / Systems of Record
• Strict consistency and durability guarantees (ACID) over multiple data items involved
in a transaction
• But they have scale and cost issues with large volumes of data, distributed geo-scale
applications, very high transaction volumes
• Typical web scale systems do not need strict consistency and durability for every use case
• Social networking
• Real-time applications
• Log analysis
• Browsing retail catalogs
• Reviews and blogs
•…
47
Why NoSQL (2)
48
Choice between consistency and availability
• In a distributed database
• Scalability and fault tolerance can be improved through additional nodes,
although this puts challenges on maintaining consistency (C).
• The addition of nodes can also cause availability (A) to suffer due to the
latency caused by increased communication between nodes.
• May have to update all replicas before sending success to client . so longer
takes time and system may not be available during this period to service
reads on same data item.
• Large scale distributed systems cannot be 100% partition tolerant (P).
• Although communication outages are rare and temporary, partition tolerance
(P) must always be supported by distributed database
• CA- Implies single site cluster in which all nodes communicate with each other.
• CP – Implies all the available data consistent or accurate, but some data may not be available
• AP – Implies all data available, but some data returned may be inconsistent
50
NoSQL Characteristics
• Auto sharding
✓ automatically spreads data across the number of servers
✓ applications need not be aware about it
✓ helps in data balancing and recovery from failure
• Replication
✓ Good support for replication of data which offers high availability, fault tolerance
51
NoSQL Use Cases
• Big data: NoSQL databases are perfect for handling large amounts of data since they can
scale horizontally across multiple servers or nodes and handle high levels of concurrency
• Real-time data processing: They are often used for real-time data processing since they can
handle high levels of concurrency and support low latency
• Cloud-based applications: NoSQL databases are perfect for cloud-based applications since
they can easily scale and handle large amounts of data in a distributed environment
• Content management: NoSQL databases are often used for content management systems
since they can handle large amounts of data and support flexible data models
• Social media: NoSQL databases are often used for social media applications since they can
handle high levels of concurrency and support flexible data models
• Internet of Things (IoT): These databases are often used for IoT applications since they can
handle large amounts of data from a large number of devices and handle high levels of
concurrency
• E-commerce: They are often used for e-commerce applications since they can handle high
levels of concurrency and support flexible data models
52
NoSQL - Pros and Cons
Pros Cons
• Joins between data sets / tables
• Cost effective for large data sets
• Group by operations
• Easy to implement
• ACID properties for transactions
• Easy to distribute esp across DCs
• SQL interface
• Easier to scale up/down
• Lack of standardisation in this space
• Relaxes data consistency when required
• Makes it difficult to port from SQL
• No pre-defined schema and across NoSQL stores
• Easier to model semi-structured data or • Less skills compared to SQL
connectivity data • Lesser BI tools compared to mature SQL
• Easy to support data replication BI space
53
SQL vs NoSQL
SQL NoSQL
54
Topics for today
55
Classification of NoSQL DBs
• Key – value
✓ Maintains a big hash table of keys and values
✓ Example : DynamoDB, Redis, Riak etc
• Document
✓ Maintains data in collections of documents
✓ Example : MongoDB, CouchDB etc
• Column
✓ Each storage block has data from only one column
✓ Example : Cassandra, HBase
• Graph
✓ Network databases
✓ Graph stores data in nodes
✓ Example : Neo4j, GraphX, HyperGraphDB, Apache
Tinkerpop
56
Classification: Document-based
• Store data in form of documents using well known formats like JSON
• Documents accessible via their id, but can be accessed through other index as well
• Maintains data in collections of documents
• Example,
• MongoDB, CouchDB, CouchBase
• Book document :
{
“Book Title” : “Fundamentals of Database Systems”,
“Publisher” : “Addison-Wesley”,
“Authors” : “Elmasri & Navathe”
“Year of Publication” : “2011”
}
57
Classification: Key-Value store
• Simple data model based on fast access by the key to the value associated with the key
• Value can be a record or object or document or even complex data structure
• Maintains a big hash table of keys and values
• For example,
✓ DynamoDB, Redis, Riak
Key Value
2014HW112220 { Santosh,Sharma,Pilani}
2018HW123123 {Eshwar,Pillai,Hyd}
58
Classification: Column-based
59
Classification: Graph based
60
NoSQL Offerings in Cloud
61
Vendors
• Amazon
• Facebook
• Google
• Oracle
62
Summary
• What process steps must one should take in a Big Data Analytics project
• What is C, A, P and various options when partitions happen
• Why is this important for Big Data environment where faults may happen in
large distributed systems
• CAP Theorem
• NoSQL introduction
• Classification of NoSQL databases
63
Next Session:
NoSQL Database - MongoDB