Bda Unit-1
Bda Unit-1
Unit – 1
Introduction to Big Data
By :- Urvi Dhamecha
Urvi Dhamecha
What’s Big Data?
• Big data is the term for a collection of data sets so large
and complex that it becomes difficult to process using
on-hand database management tools or traditional data
processing applications.
• The challenges include capture, storage, search, sharing,
transfer, analysis, and visualization.
• The trend to larger data sets is due to the additional
information derivable from analysis of a single large set
of related data, as compared to separate smaller sets
with the same total amount of data, allowing
correlations to be found to "spot business trends,
determine quality of research, prevent diseases, link
legal citations, combat crime, and determine real-time
roadway traffic conditions.”
Urvi Dhamecha
Distributed file system
• A distributed file system (DFS) is a file
system with data stored on a server.
Urvi Dhamecha
Types of Big Data
• Structured Data
• Semi-Structured Data
• Unstructured Data
Urvi Dhamecha
Types of Big Data
• Structured Data:
– Information stored in databases is known as
structured data because it is represented in
a strict format.
– The DBMS then checks to ensure that all
data follows the structures and constraints
specified in the schema.
Urvi Dhamecha
Types of Big Data
• Semi-Structured Data:
– In some applications, data is collected in an ad-hoc
manner before it is known how it will be stored and
managed.
– This data may have a certain structure, but not all the
information collected will have identical structure.
This type of data is known as semi-structured data.
– In semi-structured data, the schema information is
mixed in with the data values, since each data object
can have different attributes that are not known in
advance. Hence, this type of data is sometimes
referred to as self-describing data.
Urvi Dhamecha
Types of Big Data
• Unstructured Data:
– A third category is known as unstructured data,
because there is very limited indication of the type
of data.
– A typical example would be a text document that
contains information embedded within it. Web
pages in HTML that contain some data are
considered as unstructured data.
Urvi Dhamecha
Characteristics of big data
The FOUR V’s of Big Data
• Volume
• Velocity
• Variety
• Veracity
Urvi Dhamecha
Urvi Dhamecha
Volume
Urvi Dhamecha
Volume
• Big data is always large in volume. It actually doesn't
have to be a certain number of petabytes to qualify.
• If your store of old data and new incoming data has
gotten so large that you are having difficulty handling
it, that's big data.
• Remember that it's going to keep getting bigger. Your
consultant needs to recommend a scalable
solution that can grow with your data.
Urvi Dhamecha
Variety
Urvi Dhamecha
Variety
• Variety points to the number of sources or incoming
vectors leading to your databases.
• That might be embedded sensor data, phone
conversations, documents, video uploads or feeds,
social media, and much more.
• Variety in data means variety in databases – you'll
almost certainly need to add a non-relational database
if you haven't already done so.
Urvi Dhamecha
Velocity
Urvi Dhamecha
Velocity
• Velocity or speed refers to how fast the data is coming
in, but also to how fast you need to be able to analyze
and utilize it.
• If you have one or more business processes that
require real-time data analysis, you have a velocity
challenge.
• Solving this issue might mean expanding your private
cloud using a hybrid model that allows bursting for
additional compute power as-needed for data analysis.
• Your consultant may need to offer suggestions for
hardware, software, and business process changes to
handle today's high-speed data.
Urvi Dhamecha
Veracity
Urvi Dhamecha
Veracity
• Veracity is probably the toughest nut to crack.
• Veracity refers to the quality, accuracy, integrity and
credibility of data.
• Gathered data could have missing pieces, might be
inaccurate or might not be able to provide real,
valuable insight.
• Veracity, overall, refers to the level of trust there is in
the collected data.
• If you can't trust the data itself, the source of the data,
or the processes you are using to identify which data
points are important, you have a veracity problem.
Urvi Dhamecha
Veracity
• One of the biggest problems with big data is the
tendency for errors to snowball.
• User entry errors, redundancy and corruption all
affect the value of data.
• Your consulting firm needs to help you clean your
existing data and put processes in place to reduce
the accumulation of dirty data going forward.
Urvi Dhamecha
Big Data V.S. Relational Data
Application Relation-Based Data Big Data
Data processing Single-computer platform Cluster platforms that scale
that scales with better to thousands of nodes,
CPUs, centralized distributed process.
processing.
Data management Relational database (SQL), Non-relational databases
centralized storage. that manage varied data
types and formats (NoSQL),
distributed storage.
Analytics Batched, descriptive, Real-time, predictive and
centralized. prescriptive, distributed
analytics.
Urvi Dhamecha
Advantage of “Big Data” Analytics
• Scalability – nodes can be added to scale the system
with little administration.
• Unlike traditional RDBMS, no pre-processing is
required before storing.
• Any unstructured data such as text, images and
videos can be stored.
• There is no limit to how much data needs to be
stored and for how long.
• Protection against hardware failure – in case of any
node failure, it is redirected to other nodes. Multiple
copies of the data are automatically stored.
Urvi Dhamecha
Big Data Analytics
• Big data analytics is the process of examining
large data sets to uncover hidden patterns,
unknown correlations, market trends,
customer preferences and other useful
business information.
Urvi Dhamecha
Big data applications
• Understanding and targeting users
• Understanding and optimizing business processes
• Performance optimization
• Improving healthcare and public health
• Improving sports performance
• Improving science and research
• Optimizing machine and device performance
• Improving security and law enforcement
• Improving and optimizing cities and countries
• Financial trading
Urvi Dhamecha
Big Data Architecture
• What is Big Data Architecture?
Urvi Dhamecha
Big Data Architecture
• What is Big Data Architecture?
1) Data ingestion 2) Data Processing
3) Data Storage 4) Data visulization
Urvi Dhamecha
Big Data Architecture
Big Data Architecture Layers
1. Data Ingestion
2. Data Processing
3. Data Storage
4. Data Visualization
Urvi Dhamecha
Big Data Architecture
Big Data Architecture Layers
There are four main Big Data architecture layers to an
architecture of Big Data:
1. Data Ingestion
This layer is responsible for collecting and storing data
from various sources. In Big Data, the data ingestion
process of extracting data from various sources and
loading it into a data repository. Data ingestion is a key
component of a Big Data architecture because it
determines how data will be ingested, transformed,
and stored.
Urvi Dhamecha
Big Data Architecture
2. Data Processing
Data processing is the second layer, responsible for
collecting, cleaning, and preparing the data for analysis.
This layer is critical for ensuring that the data is high
quality and ready to be used in the future.
3. Data Storage
Data storage is the third layer, responsible for storing
the data in a format that can be easily accessed and
analyzed. This layer is essential for ensuring that the
data is accessible and available to the other layers.
Urvi Dhamecha
Big Data Architecture
4. Data Visualization
Data visualization is the fourth layer and is responsible
for creating visualizations of the data that humans can
easily understand. This layer is important for making
the data accessible.
Urvi Dhamecha
Big Data Storage
File System and Distributed File System
• Difference between the file system and Distributed
File system.
Local File System Distributed File System
LFS store the data on a single Block. DFS divides data as multiple blocks and
stores it into different DataNodes.
It is not reliable because LFS data does It is reliable because in DFS data blocks
not replicate the Data files. are replicated into different DataNodes.
Urvi Dhamecha
Big Data Storage
File System and Distributed File System
Local File System Distributed File System
LFS is cheaper because it does not needs DFS is expensive because it needs extra
extra memory for storing any data file. memory to replicate the same data blocks.
LFS is not appropriate for analysis of very DFS is appropriate for analysis of big file of
big file of data because it needs large time data because it needs less amount of time
to process. to process as compare to Local file system.
LFS is less complex than DFS. DFS is more complex than LFS.
Urvi Dhamecha
NoSQL
What is NoSQL..?
• NoSQL is database management system that
provides mechanism for storage and retrieval of
massive amount of unstructured data in a distributed
environment on virtual servers with the focus to
provide high scalability, performance and availability.
• NoSQL was developed in response to a large volume
of data stored about users, objects and products that
need to be frequently accessed and processed.
• Some say the term “NoSQL” stands for “non SQL”
while others say it stands for “not only SQL.
Urvi Dhamecha
Features of NoSQL
• NoSQL is next generation database which is
completely different from the traditional database.
• NoSQL stands for Not only SQL. SQL as well as other
query languages can be used with NoSQL databases.
• NoSQL is non-relational database, and it is schema-
free.
• NoSQL is free of JOINs.
• NoSQL uses distributed architecture and works on
multiple processors to give high performance.
• NoSQL databases are horizontally scalable.
Urvi Dhamecha
Features of NoSQL
• Many open-source NoSQL databases are available.
• Data file can be easily replicated.
• NoSQL uses simple API (Application Programing
Interface).
• NoSQL can manage huge amount of data.
• NoSQL can be implemented on commodity hardware
which has separate RAM and disk (shared nothing
concept).
Urvi Dhamecha
Features of NoSQL
• 24 × 7 Data availability
• Location transparency
• Schema-less data model
• Modern day transaction analysis
• Architecture that suits big data
• Analytics and business intelligence
Urvi Dhamecha
Why NoSQL..?
• A relational database product can deal with more
predictable, structured data.
• NoSQL is required because today’s industry needs a
very agile system that can process unstructured and
unpredictable data dynamically.
• NoSQL is known for its high performance with high
availability, rich query language, and easy scalability
as per the need.
• SQL supports atomicity, consistency, isolation,
durability (ACID) properties.
• NoSQL supports CAP theorem.
Urvi Dhamecha
CAP Theorem
• Consistency, Availability, Partition tolerance (CAP)
theorem, also called as Brewer’s theorem.
Urvi Dhamecha
Sharding
• Database sharding is a technique for horizontal
scaling of databases, where the data is split across
multiple database instances, or shards, to improve
performance and reduce the impact of large
amounts of data on a single database.
Urvi Dhamecha
Sharding
Sharding Architectures:
• Key Based Sharding
• Horizontal or Range Based Sharding
• Vertical Sharding
• Directory-Based Sharding
Urvi Dhamecha
Sharding
• Key Based Sharding
Urvi Dhamecha
Sharding
• Key Based Sharding
Urvi Dhamecha
Sharding
• Key Based Sharding
Urvi Dhamecha
Sharding
• Horizontal or Range Based Sharding
Urvi Dhamecha
Sharding
• Horizontal or Range Based Sharding
Urvi Dhamecha
Sharding
• Vertical Sharding
Urvi Dhamecha
Sharding
• Vertical Sharding
Urvi Dhamecha
Sharding
• Directory-Based Sharding
Urvi Dhamecha
Replication
What is Replication?
• Data replication is the process of creating and
maintaining multiple copies of the same data in different
locations or on different storage devices.
• The goal of data replication is to improve data
availability, reliability, and fault tolerance.
• By having multiple copies of data, systems can continue
to function even if one copy becomes unavailable due to
hardware failure, network issues, or other reasons.
• Data replication is commonly used in distributed systems,
databases, and storage systems to ensure that data is
always accessible and to improve system performance
and scalability.
Urvi Dhamecha
Replication
Benefits of data replication:
• Improve the availability of data
• Increase the speed of data access
• Enhance server performance
• Accomplish disaster recovery
Urvi Dhamecha
Replication
Improve the availability of data
• When a particular system experiences a technical glitch due to
malware or a faulty hardware component, the data can still be
accessed from a different site or node.
• Data replication enhances the resilience and reliability of
systems by storing data at multiple nodes across the network.
Increase data access speed
• In organizations where there are multiple branch offices
spread across the globe, users may experience some latency
while accessing data from one country to another.
• Placing replicas on local servers provides users with faster
data access and query execution times.
Urvi Dhamecha
Replication
Enhance server performance
• Database replication effectively reduces the load on the primary server by
dispersing it among other nodes in the distributed system, thereby
improving network performance.
• By routing all read-operations to a replica database, IT administrators can
save the primary server for write-operations that demand more
processing power.
Accomplish Disaster recovery
• Businesses are often susceptible to data loss due to a data breach or
hardware malfunction.
• During such a catastrophe, the employees' valuable data, along with client
information can be compromised.
• Data replication facilitates the recovery of data which is lost or corrupted
by maintaining accurate backups at well-monitored locations, thereby
contributing to enhanced data protection.
Urvi Dhamecha
Replication
Types of data replication
• Full table replication
• Transactional replication
• Snapshot replication
• Merge replication
• Key-based incremental replication
Urvi Dhamecha
Replication
Full table replication
• Full table replication means that the entire data is
replicated. This includes new, updated as well as existing
data that is copied from source to the destination.
• This method of replication is generally associated with
higher costs since the processing power and network
bandwidth requirements are high.
• However, full table replication can be beneficial when it
comes to the recovery of hard-deleted data.
Urvi Dhamecha
Replication
Transactional replication
• In this method, the data replication software makes full
initial copies of data from origin to destination following
which the subscriber database receives updates
whenever data is modified.
• This is more efficient mode of replication since fewer
rows are copied each time data is changed.
• Transactional replication is usually found in server-to-
server environments.
Urvi Dhamecha
Replication
Snapshot replication
• In Snapshot replication, data is replicated exactly as it
appears at any given time.
• Unlike other methods, Snapshot replication does not pay
attention to the changes made to data.
• This mode of replication is used when changes made to
data tends to be infrequent; for example performing
initial synchronizations between publishers and
subscribers
Urvi Dhamecha
Replication
Merge replication
• This type of replication is commonly found in server-to-
client environments and allows both the publisher and
subscriber to make changes to data dynamically.
• In merge replication, data from two or more databases
are combined to form a single database thereby
contributing to the complexity of using this technique.
Urvi Dhamecha
Replication
Key-based incremental replication
• Also called key-based incremental data capture, this
technique only copies data changed since the last
update.
• Keys can be looked at as elements that exist within
databases that trigger data replication.
• Since only a few rows are copied during each update, the
costs are significantly low.
• However, the drawback lies in the fact that this
replication mode cannot be used to recover hard deleted
data, since the key value is also deleted along with the
record.
Urvi Dhamecha
ACID and BASE Properties
ACID Model:
To explain ACID in more detail and easy way is to
understand through breaking down the acronym, ACID:
• Atomicity: This property states transaction must be
treated as an atomic unit, that is, either all of its
operations are executed or none, and there must be
no state in a database where a transaction is left
partially completed also the states should be defined
either before the execution of the transaction or
after the execution of the transaction.
Urvi Dhamecha
ACID and BASE Properties
ACID Model:
• Consistency: The database must remain in a consistent state
after any transaction also no transaction should have any
adverse effect on the data residing in the database and if the
database was in a consistent state before the execution of a
transaction then it must remain consistent after the execution
of the transaction as well.
• Isolation: In a database system where more than one
transaction is being executed simultaneously and in parallel,
the property of isolation states that each one of the
transactions is going to be administered and executed as it is
the only transaction in the system also no transaction will
affect the existence of any other transactions.
Urvi Dhamecha
ACID and BASE Properties
ACID Model:
Urvi Dhamecha
ACID and BASE Properties
BASE Model:
Acronym BASE stands for:-
• Basically Available: Instead of making it compulsory
for immediate consistency, BASE-modelled NoSQL
databases will ensure the availability of data by
spreading and replicating it across the nodes of the
database cluster.
• Soft State: Due to the lack of immediate consistency,
the data values may change over time. The BASE
model breaks off with the concept of a database that
obligates its own consistency, delegating that
responsibility to developers.
Urvi Dhamecha
ACID and BASE Properties
Base Model:
• Eventually Consistent: The fact that BASE does not
obligates immediate consistency but it does not
mean that it never achieves it. However, until it does,
the data reads are still possible (even though they
might not reflect reality).
Urvi Dhamecha
ACID and BASE Properties
Difference between ACID and BASE:
S. No Criteria ACID BASE
Urvi Dhamecha
ACID and BASE Properties
Difference between ACID and BASE:
S. No Criteria ACID BASE
DynamoDB,
Oracle, MySQL, SQL
11. Examples Cassandra, CouchDB,
Server, etc.
SimpleDB etc.
Urvi Dhamecha
End of Unit – 1
Urvi Dhamecha