Module 7 NoSQL
Module 7 NoSQL
Introduction to NoSQL
NoSQL (Not Only SQL) databases are designed to handle large volumes of unstructured and semi-structured
data. Unlike traditional relational databases that rely on fixed schemas and tables, NoSQL offers flexible data
models and supports horizontal scaling. This makes them well-suited for modern applications that require high
performance, scalability, and the ability to manage diverse data types efficiently.
Key Features of NoSQL Databases
● Dynamic schema: Allow flexible shaping of data to meet new requirements without the need to
migrate or change schemas.
● Horizontal scalability: They scale horizontally for adding more nodes into the existing ones and
acquire enough storage for even bigger datasets and much higher traffic by distributing the load on
multiple servers.
● Document-based: Data are presented in flexible, semi-structured formats like JSON/BSON (e.g.,
MongoDB).
● Key-value-based: They possess a simple but fast access pattern (e.g., Redis) by storing data as pairs
of keys and values.
● Column-based: Data are organized into columns instead of rows (e.g., CASSANDRA).
● Distributed and high availability: They are designed to be highly available and to automatically
handle node failures and data replication across multiple nodes in a database cluster.
● Flexibility: Allow developers to store and retrieve data in a flexible and dynamic manner, with
support for multiple data types and changing data structures.
● Performance: Perfect for big data and real-time analytics and high volume applications.
● Lack of standardization: NoSQL systems can be vastly different from one another, making it even
harder to choose the right one for a specific use case.
● Lack of ACID compliance: NoSQL databases may not provide consistency, which is a disadvantage
for applications that need strict data integrity.
● Narrow focus: Great for storage but lack functionalities as transaction management, in which
relational databases are great.
● Absence of Complex Query Support: They are not designed to handle complex queries, which means
that they are not a good fit for applications that require complex data analysis or reporting.
● Lack of maturity: Being relatively new, NoSQL may not have the reliability, security and feature set
of traditional relational databases.
● Management complexity: For large datasets, maintaining a NoSQL database could be quite more
complicated than managing a relational database.
● Limited GUI Tools: While some NoSQL databases, like MongoDB offer GUI tools like MongoDB
Compass, not all NoSQL databases provide flexible or user-friendly GUI tools.
SQL vs. NoSQL:
MySQL, PostgreSQL,
Examples MongoDB, Cassandra, Redis
Oracle
Popular NoSQL Databases & Their Use Cases
Column-Family
Cassandra Big data, high availability systems
Store
Use of NoSQL
● Big Data Applications: Efficiently stores and processes massive amounts of unstructured and
semi-structured data.
● Real-Time Analytics: Supports fast queries and analysis for use cases like recommendation engines
or fraud detection.
● Scalable Web Applications: Handles high traffic and large user bases by scaling horizontally across
servers.
● Flexible Data Storage: Manages diverse data formats (JSON, key-value, documents, graphs) without
rigid schemas.
Reasons to Choose NoSQL
A growing business faces a lot of challenges and opportunities, so it demands future-proof planning. Some tools
and technology are the best fit for your application today, but the same may not work tomorrow. Picking the
right database is also a part of the application, which is a challenging decision for organizations. It shows how
well you can architect your application. Suppose you choose a database for your application based on the
current scenario and consider the small number of users; then what can happen after a couple of years? Users
may grow, and if users grow, then you will start facing the scalability issue and several other problems on your
website. If your site isn't able to handle the large volume of user growth, then it will affect your business badly,
and your business may fail as well.
Well, there has always been an argument among developers that which database is best suitable for the
applications. Let us see why you should choose NoSQL. But before that, let's learn about how to choose between
relational and non-relational databases.
Relational or non-relational?
Relational or non-relational, both databases can store information but the difference lies in how they’re built, the
kind of information they store, and how they store it. In this article, we are going to discuss some scenarios
where you can choose non-relational databases over relational databases for your application. We will discuss
some features of NoSQL but you need to remember that there isn't technology or database that fits all. You will
find pros and cons for each one of them. So to make a choice, you first need to ask a few questions about the
needs of your application. Answering these questions will help you identify the needs of your application and
you will be able to find that is NoSQL is the best fit for your application or not.
● Can your application support your projected user volume growth?
● To cope up with user demands/activities. Do you need to scale your applications?
● How much money and time can be saved with 0% downtime?
● Would your application benefit from rapid development cycles? (flexible data model)
● Does your application generate huge amounts of data?
Why NoSQL?
For over forty years, relational databases have been the go-to solution for data storage. Structured Query
Language (SQL) is highly organized, much like a phone book, and is designed for reliable transactions.
However, as applications grow and require the ability to handle millions or even billions of queries in real time,
relational databases start to face scalability issues. Large websites like Facebook, Google, and Amazon generate
massive amounts of data and queries, and relational databases struggle to keep up. This is where NoSQL comes
into play.
NoSQL databases are designed to address two primary concerns:
● High speed operations
● Flexibility in data storage
These features make NoSQL ideal for handling massive amounts of unstructured data where the data
requirements aren't clearly defined at the start. Now, let's discuss five important features of NoSQL databases
that make them a great option for large-scale applications.
Why Choose NoSQL for Large-Scale Applications
1. Multi-Model
Relational databases require a fixed schema with tables and columns, and they need adjustments when
requirements change. This can result in the need for new columns, additional relationships, and coordination
with database administrators. NoSQL databases, on the other hand, provide much more flexibility. They don't
require predefined schemas and allow you to store different types of data together.
● Flexible Schema: NoSQL allows changes as your application evolves.
● Agile Development: Ideal for fast-paced, agile projects that need quick implementation.
● Handle Various Data Types: NoSQL lets you add new data types without restructuring the entire
database.
2. Easily Scalable
The primary reason to choose NoSQL is its scalability. While relational databases can be scaled, it is a
complicated and costly process. Relational databases often require server upgrades and "sharding" (splitting
data into smaller parts across multiple servers), which can lead to downtime.
● Effortless Scaling: NoSQL databases use a masterless, peer-to-peer architecture.
● Easy Expansion: Adding new servers to the cluster can be done quickly with minimal downtime.
● Performance Boost: This scalability results in improved performance and higher read/write speeds.
3. Distributed
NoSQL databases are designed to work on a global scale by distributing data across multiple locations,
including various data centers or cloud regions. This distribution improves both write and read operations,
which is something relational databases struggle with since they are usually centralized.
● Global Distribution: Data is distributed across multiple locations for greater reliability and
accessibility.
● Continuous Availability: NoSQL databases are built to keep systems running smoothly, even if one
node fails.
One major concern in building applications is the risk of hardware failure. NoSQL databases are designed to
address these issues at the architectural level, ensuring high availability. For example, databases like Cassandra
use various strategies to detect node failure, and Riak uses network partitioning to repair itself.
● Multiple Data Copies: Data is stored on several nodes, ensuring access even if one node fails.
● Zero Downtime: NoSQL databases ensure your application remains available, even during hardware
failures.
● Built-In Redundancy: No need for developers to create custom redundant solutions.
NoSQL databases are optimized for handling large volumes of data quickly, making them the perfect solution
for big data applications. They ensure that data doesn’t become a bottleneck in your system, allowing your
application to run seamlessly even when dealing with vast amounts of data.
● Handles Massive Data: NoSQL is designed to scale and process large data sets.
● Optimized Performance: Helps avoid data bottlenecks in fast, high-volume environments.
While NoSQL offers many advantages, it’s important to note that it isn’t the right fit for every application. For
certain types of data, SQL may still be a better choice.
● Transactional Data: If your application is focused on transactional data, SQL is the ideal choice.
SQL databases are designed for processing and managing transactions effectively.
● Analytical Data: For applications that require handling large amounts of analytical data, NoSQL is
generally more suitable. SQL was not designed for data analytics and might struggle with large
datasets.
The inherent trade-offs in networked shared-data system design make it very difficult to create a dependable and
effective system. The CAP theorem, or CAP principle, is a central foundation for comprehending these
trade-offs in distributed systems. The CAP theorem emphasizes the limitations that system designers have while
addressing distributed data replication. It states that only two of the three properties—consistency, availability,
and partition tolerance—can be concurrently attained by a distributed system.
Developers must carefully balance these attributes according to their particular application demands because of
this underlying restriction. Designers may decide which qualities to prioritize to obtain the best performance
and reliability for their systems by knowing the CAP theorem. This article will provide a thorough analysis of
all the properties given in the CAP theorem, investigate the associated trade-offs, and talk about how these ideas
relate to distributed systems in the real world.
The CAP theorem is a fundamental concept in distributed systems theory that was first proposed by Eric Brewer
in 2000 and subsequently shown by Seth Gilbert and Nancy Lynch in 2002. It asserts that all three of the
following qualities cannot be concurrently guaranteed in any distributed data system:
1. Consistency
Consistency means that all the nodes (databases) inside a network will have the same copies of a replicated data
item visible for various transactions. It guarantees that every node in a distributed cluster returns the same, most
recent, and successful write. It refers to every client having the same view of the data. There are various types of
consistency models. Consistency in CAP refers to sequential consistency, a very strong form of consistency.
Note that the concept of Consistency in ACID and CAP are slightly different since in CAP, it refers to the
consistency of the values in different copies of the same data item in a replicated distributed system. In ACID, it
refers to the fact that a transaction will not violate the integrity constraints specified on the database schema.
For example, a user checks his account balance and knows that he has 500 rupees. He spends 200 rupees on
some products. Hence the amount of 200 must be deducted changing his account balance to 300 rupees. This
change must be committed and communicated with all other databases that hold this user's details. Otherwise,
there will be inconsistency, and the other database might show his account balance as 500 rupees which is not
true.
2. Availability
Availability means that each read or write request for a data item will either be processed successfully or will
receive a message that the operation cannot be completed. Every non-failing node returns a response for all the
read and write requests in a reasonable amount of time. The key word here is "every". In simple terms, every
node (on either side of a network partition) must be able to respond in a reasonable amount of time.
For example, user A is a content creator having 1000 other users subscribed to his channel. Another user B who
is far away from user A tries to subscribe to user A's channel. Since the distance between both users are huge,
they are connected to different database node of the social media network. If the distributed system follows the
principle of availability, user B must be able to subscribe to user A's channel.
3. Partition Tolerance
Partition tolerance means that the system can continue operating even if the network connecting the nodes has a
fault that results in two or more partitions, where the nodes in each partition can only communicate among each
other. That means, the system continues to function and upholds its consistency guarantees in spite of network
partitions. Network partitions are a fact of life. Distributed systems guaranteeing partition tolerance can
gracefully recover from partitions once the partition heals.
For example, take the example of the same social media network where two users are trying to find the
subscriber count of a particular channel. Due to some technical fault, there occurs a network outage, the second
database connected by user B losses its connection with the first database. Hence the subscriber count is shown
to the user B with the help of replica of data which was previously stored in database 1 backed up prior to
network outage. Hence the distributed system is partition tolerant.
The CAP theorem states that distributed databases can have at most two of the three properties: consistency,
availability, and partition tolerance. As a result, database systems prioritize only two properties at a time.
The Trade-Offs in the CAP Theorem
The CAP theorem implies that a distributed system can only provide two out of three properties:
These types of systems always accept the request to view or modify the data sent by the user and they are always
responded with data which is consistent among all the database nodes of a big, distributed network.
However, such types of distributed systems are not realizable in the real world because when network failure
occurs, there are two options: Either send old data which was replicated moments ago before network failure or
do not allow users to access the already existing data. If we choose the first option, our system will become
Available and if we choose the second option our system will become Consistent.
The combination of consistency and availability is not possible in distributed systems and for achieving CA, the
system has to be monolithic such that when a user updates the state of the system, all other users accessing it are
also notified about the new changes which means that the consistency is maintained. And since it follows
monolithic architecture, all users are connected to a single system which means it is also available. These types
of systems are generally not preferred due to a requirement of distributed computing which can be only done
when consistency or availability is sacrificed for partition tolerance.
Example databases: MySQL
2. AP (Availability and Partition Tolerance)
These types of system are distributed in nature, ensuring that the request sent by the user to view or modify the
data present in the database nodes are not dropped and are processed in presence of a network partition.
The system prioritizes availability over consistency and can respond with possibly stale data which was
replicated from other nodes before the partition was created due to some technical failure. Such design choices
are generally used while building social media websites such as Facebook, Instagram, Reddit, etc. and online
content websites like YouTube, blog, news, etc. where consistency is usually not required, and a bigger problem
arises if the service is unavailable causing corporations to lose money since the users may shift to new platform.
The system can be distributed across multiple nodes and is designed to operate reliably even in the face of
network partitions.
Example databases: Amazon DynamoDB, Google Cloud Spanner.
These types of systems are distributed in nature, ensuring that the request sent by the user to view or modify the
data present in the database nodes are dropped instead of responding with inconsistent data in presence of a
network partition.
The system prioritizes consistency over availability and does not allow users to read crucial data from the stored
replica which was backed up prior to the occurrence of network partition. Consistency is chosen over
availability for critical applications where latest data plays an important role such as stock market application,
ticket booking application, banking, etc. where problems will arise due to old data present to users of
application.
For example, in a train ticket booking application, there is one seat which can be booked. A replica of the
database is created, and it is sent to other nodes of the distributed system. A network outage occurs which
causes the user connected to the partitioned node to fetch details from this replica. Some users connected to the
unpartitioned part of the distributed network and already booked the last remaining seat. However, the user
connected to the partitioned node will still have one seat which makes the available data inconsistent. It would
have been better if the user was shown an error and made the system unavailable for the user and maintain
consistency. Hence consistency is chosen in such scenarios.
Example databases: Apache HBase, MongoDB, Redis.
Types of NoSQL Databases
NoSQL databases can be classified into four main types, based on their data storage and retrieval methods:
1. Document-based databases
2. Key-value stores
3. Column-oriented databases
4. Graph-based databases
Each type has unique advantages and use cases, making NoSQL a preferred choice for big data applications,
real-time analytics, cloud computing and distributed systems.
1. Document-Based Database
The document-based database is a nonrelational database. Instead of storing the data in rows and columns
(tables), it uses the documents to store the data in the database. A document database stores data in JSON,
BSON or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data objects used in applications
which means less translation is required to use these data in the applications. In the Document database, the
particular elements can be accessed by using the index value that is assigned for faster querying.
Collections are the group of documents that store documents that have similar contents. Not all the documents
are in any collection as they require a similar schema because document databases have a flexible schema.
● Flexible schema: Documents in the database have a flexible schema. It means the documents in the
database need not be the same schema.
● Faster creation and maintenance: the creation of documents is easy and minimal maintenance is
required once we create the document.
● No foreign keys: There is no dynamic relationship between two documents so documents can be
independent of one another. So, there is no requirement for a foreign key in a document database.
● Open formats: To build a document we use XML, JSON, and others.
2. Key-Value Stores
A key-value store is a nonrelational database. The simplest form of a NoSQL database is a key-value store.
Every data element in the database is stored in key-value pairs. The data can be retrieved by using a unique key
allotted to each element in the database. The values can be simple data types like strings, numbers or complex
objects. A key-value store is like a relational database with only two columns which is the key and the value.
4. Graph-Based Databases
Graph-based databases focus on the relationship between the elements. It stores the data in the form of nodes in
the database. The connections between the nodes are called links or relationships, making them ideal for
complex relationship-based queries.
● Data is represented as nodes (objects) and edges (connections).
● Fast graph traversal algorithms help retrieve relationships quickly.
● Used in scenarios where relationships are as important as the data itself.
● Relationship-Centric Storage: Perfect for social networks, fraud detection, recommendation engines.
● Real-Time Query Processing: Queries return results almost instantly.
● Schema Flexibility: Easily adapts to evolving relationship structures
Popular Graph Databases & Use Cases
Key-Value
Feature Document-Based Column-Oriented Graph-Based
Store
Neo4j,
MongoDB, Redis,
Examples Cassandra, HBase Amazon
CouchDB DynamoDB
Neptune
A graph database (GDB) is a database that uses graph structures for storing data. It uses nodes, edges, and
properties instead of tables or documents to represent and store data. The edges represent relationships between
the nodes. This helps in retrieving data more easily and, in many cases, with one operation. Graph databases are
commonly referred to as NoSQL. Ex: Neo4j, Amazon Neptune, ArangoDB etc.
Representation:
The graph database is based on graph theory. The data is stored in the nodes of the graph and the relationship
between the data is represented by the edges between the nodes.
When do we need a Graph Database?
If we have friends of friends and stuff like that, these are many to many relationships.
Used when the query in the relational database is very complex.
For example- there is a profile and the profile has some specific information in it but the major selling point is
the relationship between these different profiles that is how you get connected within a network.
In the same way, if there is a data element such as a user data element inside a graph database there could be
multiple user data elements but the relationship is what is going to be the factor for all these data elements
which are stored inside the graph database.
When you add lots of relationships in the relational database, the data sets are going to be huge and when you
query it, the complexity is going to be more complex and it is going to be more than usual. However, in graph
databases, it is specifically designed for this particular purpose and one can query relationships with ease.
Why do Graph Databases matter?
Because graphs are good at handling relationships, some databases store data in the form of a graph.
For example, we have a social network in which five friends are all connected. These friends are Anay, Bhagya,
Chaitanya, Dilip, and Erica. A graph database that will store their personal information may look something like
this:
555-111
1 Anay Agarwal [email protected]
-5555
555-222
2 Bhagya Kumar [email protected]
-5555
555-333
3 Chaitanya Nayak [email protected]
-5555
555-444
4 Dilip Jain [email protected]
-5555
555-555
5 Erica Emmanuel [email protected]
-5555
Now, we will also need another table to capture the friendship/relationship between users/friends. Our
friendship table will look something like this:
user_id friend_id
1 2
1 3
1 4
1 5
2 1
2 3
2 4
2 5
3 1
3 2
3 4
3 5
4 1
4 2
4 3
4 5
5 1
5 2
5 3
5 4
We will avoid going deep into the Database(primary key & foreign key) theory. Instead just assume that the
friendship table uses id's of both the friends. Assume that our social network here has a feature that allows every
user to see the personal information of his/her friends. So, If Chaitanya were requesting information then it
would mean she needs information about Anay, Bhagya, Dilip and Erica. We will approach this problem the
traditional way(Relational database). We must first identify Chaitanya's id in the User's table:
555-333-555
3 Chaitanya Nayak [email protected]
5
Now, we'd look for all tuples in the friendship table where the user_id is 3. Resulting relation would be
something like this:
user_id friend_id
3 1
3 2
3 4
3 5
Now, let's analyse the time taken in this Relational database approach. This will be approximately log(N) times
where N represents the number of tuples in the friendship table or number of relations. Here, the database
maintains the rows in the order of id's. So, in general for 'M' no of queries, we have a time complexity of
M*log(N) Only if we had used a graph database approach, the total time complexity would have been O(N).
Because, once we've located Cindy in the database, we have to take only a single step for finding her friends.
Here is how our query would be executed:
Advantages: Frequent schema changes, managing volume of data, real-time query response time, and more
intelligent data activation requirements are done by graph model.
Disadvantages: Note that graph databases aren’t always the best solution for an application. We will need to
assess the needs of the application before deciding the architecture.
Limitations of Graph Databases:
● Graph Databases may not be offering better choice over the NoSQL variations.
● If an application needs to scale horizontally this may introduce poor performance.
● Not very efficient when it needs to update all nodes with a given parameter.