0% found this document useful (0 votes)
16 views32 pages

BDA Module 5 - Part1 (No SQL) 2023

The document provides an overview of NoSQL databases, detailing their characteristics, advantages, and types, including key-value, column-oriented, document, and graph databases. It discusses the benefits of NoSQL, such as scalability and flexibility, while also addressing challenges like lack of ACID compliance and expertise. Additionally, it covers distribution models, the CAP theorem, and the BASE consistency model as alternatives to traditional ACID transactions.

Uploaded by

shubyadav1010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views32 pages

BDA Module 5 - Part1 (No SQL) 2023

The document provides an overview of NoSQL databases, detailing their characteristics, advantages, and types, including key-value, column-oriented, document, and graph databases. It discusses the benefits of NoSQL, such as scalability and flexibility, while also addressing challenges like lack of ACID compliance and expertise. Additionally, it covers distribution models, the CAP theorem, and the BASE consistency model as alternatives to traditional ACID transactions.

Uploaded by

shubyadav1010
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Big Data Analytics

Module V-Part 1

Introduction to NoSQL Data Management

By

Dr. Jagadamba G
Dept. of ISE, SIT, Tumakuru
Learning Objectives and Learning Outcomes

Learning Objectives Learning Outcomes

The big data technology


landscape
a) To understand the
1. What is NoSQL databases? significance of NoSQL
databases.
2. Why NoSQL?
b) To understand the need for
3. Key advantages of NoSQL. NewSQL.

Big Data and Analytics by Seema Acharya and Subhashini


Chellappan
Introduction to NoSQL (Not Only SQL)

• The NoSQL provided a platform for schema free


database that can handle large amount of data.
• These databases are scalable, enable availability of
user, support replication and are distributed and
possibly open source.
• Before we develop applications that can interact
with NoSQL databases, we should understand
need for maintaining separation between data
management and data storage in these databases.
• It focuses on high performance scalable data
storage and provides low-level access to the data
management layer.
• This allows data management tasks to be created
easily in any programming language.
Why NoSQL?
Non-relational data storage systems

No fixed table schema

No Joins
NoSQL

No multi-document transactions

Relaxes one or more ACID properties


Benefits of NoSQL Databases Challenges against NoSQL
• Scalable • ACID transaction
• Simple data model • Cannot use SQL
• Streaming/Volume • Ecosystem/tools/adds-on
• Reliability • Cannot perform searches
• Schema-lies • Data loss
• Rapid development • No referential integrity
• Flexible • Lack of availability of expertise
• Cheaper than RDBMS
• Creates a caching layer
• Wide data type variety
• Uses large binary objects for storing
large data
• Bulk upload
• Graphs
• Lower administration
• Distributed storage
• Real-time analysis
Characteristics of NoSQL
•Rows in tables—NoSQL systems store and retrieve data from many formats: key-value
stores, graph databases, column-family (Bigtable) stores, document stores, and even
rows in tables.
•Free of joins—NoSQL systems allow you to extract your data using simple interfaces
without joins.
•Schema-free—NoSQL systems allow you to drag-and-drop your data into a folder and
then query it without creating an entity-relational model.
•Works on many processors—NoSQL systems allow you to store your database on
multiple processors and maintain high-speed performance.
•Uses shared-nothing commodity computers—Most (but not all) NoSQL systems
leverage low-cost commodity processors that have separate RAM and disk.
•Supports linear scalability—When you add more processors, you get a consistent
increase in performance.
•Innovative—NoSQL offers options to a single way of storing, retrieving, and
manipulating data. NoSQL supporters (also known as NoSQLers) have an inclusive
•attitude about NoSQL and recognize SQL solutions as viable options. To the NoSQL
community, NoSQL means “Not only SQL.”
History of NoSQL

• Invented by Carlo Strozzi in 1998


• It started with the mechanism for data retrieval and storage
• Eric Evans reintroduced the term NoSQL in 2009.
• NoSQL databse are mangoDB, Cassandra, redis, Hbase, Splunk, Neo4j, CouchDB, etc
Types of NoSQL
Types of NoSQL

Key value data Column-oriented Document data Graph data


model Data model model model

• Riak • Cassandra • MongoDB • InfiniteGraph


• Redis • HBase • CouchDB • Neo4
• Membase • HyperTable • RavenDB • Allegro Graph
1. Key value data Model
• A key-value database (also known as a key-value store and key-value store database) is a
type of NoSQL database that uses a simple key/value method to store data.

• The key-value part refers to the fact that the database stores data as a collection of
key/value pairs. This is a simple method of storing data, and it is known to scale well.

• The key-value pair is a well established concept in many programming languages.


Programming languages typically refer to a key-value as an associative array or data
structure. A key-value is also commonly referred to as a dictionary or hash.
• Example: Phone directory
Key Value
Bob (123) 456-7890
Jane (234) 567-8901
Tara (345) 678-9012
Tiara (456) 789-0123
The Key
• The key in a key-value pair must (or at least, should) be unique. This is the
unique identifier that allows you to access the value associated with that
key.
• In theory, the key could be anything. But this may depend on the DBMS.
One DBMS may impose limitations while another may impose none.
• However, for performance reasons, you should avoid having a key that’s
too long. But too short can cause readability issues too. In any case, the key
should follow an agreed convention in order to keep things consistent.
The Value

• The value in a key-value store can be anything, such as text (long


or short), a number, markup code such as HTML, programming code
such as PHP, an image, etc.
• The value could also be a list, or even another key-value pair
encapsulated in an object.
• Some key-store DBMSs allow you to specify a data type for the
value. For example, you could specify that the value should be an
integer. Other DBMSs don’t provide this functionality and therefore,
the value could be of any type.
Examples of Key-Value Database
Management Systems
• Redis
• Oracle NoSQL Database
• Voldemorte
• Aerospike
• Oracle Berkeley DB
2. Column-oriented Data model
• In this, data is stored in cells grouped in columns of data rather than as rows of data.
• Columns are logically grouped into column families. Column families can contain a
virtually unlimited number of columns that can be created at runtime or while defining
the schema.
• Read and write is done using columns rather than rows.
• Column families are groups of similar data that is usually accessed together. As an
example, we often access customers’ names and profile information at the same time,
but not the information on their orders.
• The main advantages of storing data in columns over relational DBMS are fast
search/access and data aggregation.
• Each column family can be compared to a container of rows in an RDBMS table, where
the key identifies the row and the row consists of multiple columns. The difference is
that various rows do not have to have the same columns, and columns can be added to
any row at any time without having to add them to other rows.
Examples of column oriented data model

• Content management systems


• Blogging platforms
• Systems that maintain counters
• Services that have expiring usage
• Systems that require heavy write requests (like log
aggregators)
3. Document Data Model
• There are many types of document
databases, such as XML, JSON, BSON, etc.
• These are self describing, hierarchical tree
data structures that can contain maps,
collections and scalar value.

• Document databases store documents in the value part of the key/value store
• For easier transactions from relational database, document database provides
indexing and searching etc.
• It provides good performance and scalability, but doesn't provides ACID and data
integrity.
• Document database not a replacement to relational database, but an alternate
way
Examples of Document Data model

• MangoDB
• CouchDB
• Terrastore
• orientDB
• RavenDB
• Lotus Notes

Note: Couchbase now offers ACID Transactions.


4. Graph base NoSQL database
• It is designed to handle very large sets of data that is capable of
integrating heterogeneous data from many sources and making links
between datasets.
• It focuses on the relationships between entities and is able to infer new
knowledge out of existing information.
• It is built upon the Entity – Attribute – Value model.
• Entities are also known as nodes, which have properties.
• It is a very flexible way to describe how data relates to other data.
• Nodes store data about each entity in the database, relationships
describe a relationship between nodes, and a property is simply the node
on the opposite end of the relationship.
• Whereas a traditional database stores a description of each possible
relationship in foreign key fields or junction tables.
• But, graph databases allow virtual relationship on any definition.
Examples of Graph base NoSQL database

• Neo4J
• InfoGrid
• Infinite Graph.

Note: Fortune 500 financial services company uses Neo4j to more quickly identify potential fraud,
stopping millions of fraudulent transactions.
With the advent of the NoSQL movement, businesses of all sizes have a
variety of modern options from which to build solutions relevant to their use
cases.

• Calculating average income? Ask a relational database.

• Building a shopping cart? Use a key-value Store.

• Storing structured product information? Store as a document.

• Describing how a user got from point A to point B? Follow a graph.


Advantages of NoSQL
Advantages of NoSQL

Cheap, Easy to implement

Easy to distribute

Can easily scale up & down


Advantages of NoSQL
Relaxes the data consistency
requirement

Doesn’t require a pre-defined


schema

Data can be replicated to


multiple nodes and can be
partitioned
NoSQL Vendors
NoSQL Vendors

Company Product Most widely used by

Amazon DynamoDB LinkedIn, Mozilla

Facebook Cassandra Netflix, Twitter, eBay

Google BigTable Adobe Photoshop


NoSQL Vendors

Company Product Most widely used by

Amazon DynamoDB LinkedIn, Mozilla

Facebook Cassandra Netflix, Twitter, eBay

Google BigTable Adobe Photoshop


Materialized View
• Materialized view is slightly different from normal views. And will be used in some environments
where the source data is in a format that is not suitable for querying.
• These views are disk based and updated periodically as per the requirements of the query.
• It does have a storage cost associated with it.
• It does have updations cost associated with it.
• There is no SQL standard for defining a materialized view, and the functionality is provided by some
databases systems as an extension.
• Materialized views are efficient when the view is accessed frequently as it saves the computation
time by storing the results before hand., i.e., when response time should be very fast.
Distribution models
There are two styles of distributing data:
• Sharding provides horizontal scalability, which allows different sites to have
different types of data. This scalability helps in reducing the work load of servers
• Replication is just a process of coping the same data across different sites while
sharding is the process of distributing different datasets on different sites.
• In addition sharding improves both read and write performance, while replication
improves red performance but not write performance.
CAP theorem
The CAP theorem applies to distributed systems—namely, that a distributed
system can deliver only two of three desired characteristics: Consistency,
Availability, and Partition tolerance (the ‘C,’ ‘A’ and ‘P’ in CAP).

• Consistency: Consistency means that all clients see the same data at the same time, no matter
which node they connect to. For this to happen, whenever data is written to one node, it must
be instantly forwarded or replicated to all the other nodes in the system before the write is
deemed ‘successful.’
• Availability: Availability means that that any client making a request for data gets a response,
even if one or more nodes are down. Another way to state this—all working nodes in the
distributed system return a valid response for any request, without exception.
• Partition tolerance: A partition is a communications break within a distributed system—a lost
or temporarily delayed connection between two nodes. Partition tolerance means that the
cluster must continue to work despite any number of communication breakdowns between
nodes in the system.
ACID Property
• ACID transactions are a very important feature that most relational
databases have had for decades. They enable you to combine a series of
different database operations into one transaction that provides the
following four guarantees:
• Atomicity - that the operations will all either succeed or fail as a single
unit;
• Consistency - that they won’t violate certain constraints you defined for
the data as a whole;
• Isolation - that each operation is hidden from view until the whole
transaction is complete;
• Durability - that all changes to the data are safely persisted.
BASE an alternate to ACID

• When it comes to NoSQL databases, data consistency models can


sometimes be strikingly different than those used by relational databases
(as well as quite different from other NoSQL stores).

• The two most common consistency models are known by the acronyms
ACID and BASE. While they’re often pitted against each other in a battle
for ultimate victory (please someone make a video of that), both
consistency models come with advantages – and disadvantages – and
neither is always a perfect fit.
BASE an alternate to ACID

• In the NoSQL database world, ACID transactions are less fashionable as


some databases have loosened the requirements for immediate
consistency, data freshness and accuracy in order to gain other benefits,
like scale and resilience.

• Here’s how the BASE acronym breaks down:

• Basic Availability: The database appears to work most of the time.


• Soft-state: Stores don’t have to be write-consistent, nor do different
replicas have to be mutually consistent all the time.
• Eventual consistency: Stores exhibit consistency at some later point (e.g.,
lazily at read time).
Sharding

• Sharding is a partitioning pattern for the NoSQL age.


• Sharding is a method of splitting and storing a single logical dataset in
multiple databases.
• Sharding is also referred as horizontal partitioning. The distinction of
horizontal vs vertical comes from the traditional tabular view of a
database.
• A database can be split vertically — storing different tables & columns in
a separate database, or horizontally — storing rows of a same table in
multiple database nodes.

Common questions

Powered by AI

Key-value databases store data as key/value pairs, suitable for applications needing simple storage models like caching or shopping carts . Document databases, using self-describing formats like XML or JSON, allow complex queries and hierarchical data structures . Column-oriented databases group data into columns rather than rows, enabling fast access and aggregation for systems like log aggregators . Graph databases focus on relationships using nodes and edges, making them ideal for social networks or fraud detection scenarios, where relationships and connections between data are crucial .

ACID transactions provide guarantees of Atomicity, Consistency, Isolation, and Durability, which ensure that transactions are fully completed or reverted, maintain data integrity, operate independently until completion, and persist data changes safely . BASE, used in NoSQL, trades these strict guarantees for benefits like scale and resilience, offering Basic Availability, Soft-state, and Eventual Consistency . ACID ensures strict consistency and is more suitable for applications requiring absolute data integrity, while BASE sacrifices immediate consistency for availability and partition tolerance, which enhances scalability and performance in distributed systems .

The CAP theorem states that in any distributed system, it is impossible to simultaneously guarantee Consistency, Availability, and Partition tolerance (CAP). NoSQL databases, focusing on partition tolerance due to their distributed nature, often trade-off between consistency and availability based on application needs. For instance, a system prioritizing consistency over availability (CP system) ensures that all nodes view the same data, but may sacrifice availability during partitions. Conversely, an AP system prioritizes availability, ensuring node responsiveness at the cost of allowing temporary inconsistencies . These trade-offs affect system design, influencing choices between immediate and eventual consistency, impacting user experience and fault tolerance .

Sharding in NoSQL databases involves partitioning data horizontally across multiple nodes, which enhances scalability by distributing the dataset and workload across multiple servers, thus decreasing the load on any single node . This method allows a NoSQL database to handle more requests simultaneously by distributing the data, which improves both read and write performance under heavy loads . By enabling horizontal scaling, sharding allows for incremental addition of resources, reducing costs and improving the system's ability to manage large volumes of data efficiently .

NoSQL databases face challenges such as the lack of support for ACID transactions, which impacts data consistency, and the absence of SQL querying capabilities, necessitating new query learning . Issues like potential data loss, lack of referential integrity, and less developed ecosystems or tools can hinder integration with existing systems . Additionally, the scarcity of expertise in NoSQL technologies can limit their adoption in businesses that rely on traditional RDBMS skill sets . These challenges necessitate careful consideration of application requirements and team capabilities when integrating NoSQL databases into technology stacks .

The document data model facilitates easier transactions from relational databases by storing data in a flexible, self-describing format such as JSON or XML, which can capture complex nested structures similar to relational tables with joins . This model supports indexing and searching capabilities, simplifying access to structured data, hence aiding migration from relational systems . However, document databases face limitations by not fully supporting ACID transactions and data integrity, which are strengths of relational databases, making them less ideal for scenarios needing stringent transactional consistency .

NoSQL databases emerged to address limitations of traditional SQL databases, such as their need for fixed schema, limited scalability, and inability to efficiently handle large volumes of unstructured or semi-structured data . NoSQL offers schema-free data models, horizontal scalability, and support for a variety of data formats, which suit large-scale, distributed applications that experience variable data types and require high read/write operations .

Materialized views in NoSQL systems help improve query performance by storing pre-computed query results, which reduces computation time for frequently accessed data, thus speeding up response times . They are particularly useful when the source data is not in a suitable format for direct querying . However, materialized views involve storage and update costs, as they physically store data on disk and require periodic updates to maintain data currency according to query needs, leading to potential overheads .

A company might choose a graph database for applications where understanding and exploring relationships between datasets is crucial, such as recommendation engines, social networks, or fraud detection . Graph databases provide advantages by efficiently managing and querying highly connected data through nodes and relationships, offering superior performance for recursive queries compared to relational databases that rely on JOIN operations in large tables . Furthermore, they allow dynamic schema evolution, making it easy to add new relationship types as data requirements change .

A company might prefer a column-oriented NoSQL database when applications demand fast data aggregation and high volume writes, such as in analytics platforms or logging systems . Column-oriented databases store and process data as columns rather than rows, allowing for more efficient read/write operations on large datasets with similar data accessed together, optimizing storage for frequently queried attributes . This contrasts with relational databases, which may incur significant performance penalties in column-wise queries due to their row-oriented storage structure .

You might also like