NoSQL Apache Cassandra
NoSQL Apache Cassandra
● But they were later deprecated because they were inefficient and hard to manage.
● Modern Cassandra uses composite columns or collections (maps, sets, lists) to represent the same
hierarchical data in a cleaner way.
Introduction to Apache Cassandra:
● Apache Cassandra is an open-source, distributed NoSQL database designed for handling large amounts of data
across many commodity servers without a single point of failure.
● Highly Scalable: Easily add more nodes without downtime.
● Fault Tolerant: Built to handle hardware failures with no downtime.
● Write-Optimized: Optimized for heavy write operations.
● Flexible Schema: Schema-less data model allows for flexible data representation.
● Eventual Consistency: Tunable consistency levels for applications that need high availability and fault
tolerance.
In essence, Cassandra is a great choice for applications that need to store vast amounts of data and ensure it’s always
available, even in the face of hardware failures, but can tolerate eventual consistency rather than strict consistency.
History of Cassandra:
History of Apache Cassandra:
● Cassandra was initially designed at Facebook.
● The design of Cassandra was influenced by Amazon’s Dynamo, a highly available and distributed key-value
store that was used for Amazon’s retail operations. Dynamo provided concepts like eventual consistency,
replication, and partitioning, which Cassandra adopted and expanded on. While Dynamo was a key influence,
Cassandra also integrated ideas from other systems, including Google’s Bigtable and Amazon's S3 for
large-scale storage.
Apache Cassandra was designed to meet these challenges with the following design objectives in mind:
● Full multi-primary database replication
● Global availability at low latency
● Scaling out on commodity hardware
● Linear throughput increase with each additional processor
● Online load balancing and cluster growth
● Partitioned key-oriented queries
● Flexible schema
Features of Apache Cassandra:
Features of Apache Cassandra:
Write Optimized:
● Cassandra is optimized for write-heavy workloads. It handles massive write operations efficiently by
using memtables (in-memory structures) and SSTables (on-disk files) to store data in an append-only
fashion.
● Example:
○ Yahoo uses Cassandra to store time-series data related to user behavior and ad analytics. Their
system receives millions of events per second, and Cassandra is ideal for this because of its ability
to efficiently handle high write throughput, ensuring real-time analytics can occur.
Tunable Consistency:
● The Tunable Consistency feature in Apache Cassandra allows you to balance the trade-off between
consistency and availability depending on your application’s requirements. This flexibility is key to
Cassandra's ability to offer high availability and fault tolerance, as it enables users to control how strictly
data consistency is enforced in a distributed system.
Cassandra Query Language (CQL):
● Cassandra Query Language (CQL) Feature: Cassandra provides a SQL-like query language called CQL
(Cassandra Query Language), which is familiar to those who have worked with relational databases but
tailored for Cassandra's NoSQL model.
● Example:
○ Spotify relies on Cassandra to handle real-time user activity data, which includes listening history,
preferences, and user-generated playlists. Since the system needs to support millions of concurrent
users, performance and scalability are essential.
Elastic Scalability:
● Elastic scalability is a core feature of Apache Cassandra that allows the database to scale dynamically
based on workload requirements, without requiring downtime or major changes to the system. This means
that you can add or remove nodes to the cluster as needed, and Cassandra will automatically rebalance
data to accommodate the new configuration.
Why to use Apache Cassandra?
The following are some of the important reasons to use Apache Cassandra.
Keyspace
● Keyyspace is a fundamental concept that represents a logical container for data. Keyspaces help in logically
organizing data into distinct namespaces. For example, a company might use different keyspaces for different
departments, such as sales_keyspace and marketing_keyspace, to keep their data isolated and manage it
separately.
Column Family
● Cassandra Column Family is the collection of sorted rows which are used to store related data.
Column
● Cassandra Column is the basic data structure used to store various data such as key, value, timestamp, big
integer, text, Boolean, double, float, and so on.
Architecture of Cassandra:
The following is the list of the important component of Apache Cassandra architecture
● Cluster
● Data cluster
● Node
● Commit log
● Meme-table
● SSTable
● Bloom filter
Components of Cassandra Architecture Node
Node is the place where data is stored. It is the basic
There are following components in the Cassandra Architecture: component of Cassandra.
Data Center
A collection of nodes are called data center. Many
nodes are categorized as a data center.
Cluster
The cluster is the collection of many data centers.
Commit Log
Every write operation is written to Commit Log.
Commit log is used for crash recovery.
Mem-table
After data written in Commit log, data is written in
Mem-table. Data is written in Mem-table
temporarily.
SSTable
When Mem-table reaches a certain threshold, data is
flushed to an SSTable disk file.
● Cassandra is designed such that it has no master
or slave nodes.
● It has a ring-type architecture, that is, its nodes
are logically distributed like a ring.
● Data is automatically distributed across all the
nodes.
● Similar to HDFS, data is replicated across the
nodes for redundancy.
● Data is kept in memory and lazily written to the
disk.
● Hash values of the keys are used to distribute the
data among nodes in the cluster.
● In a ring architecture, each node is assigned a
token value.
Additional features of Cassandra
architecture are:
● SSTable is a persistent, immutable data file format used to store data on disk. Once data is written to an
SSTable, it is not modified, ensuring efficient and reliable storage.
● SSTables store the data that has been flushed from the memtable, ensuring that data is not lost even if the
system restarts.
● During the flush operation, data from the memtable is sorted and written to an SSTable. This process involves
creating multiple files:
● Data File: Contains the actual data rows.
● Index File: Allows quick lookups by storing indexes that map keys to their positions in the data file.
● Bloom Filter: An auxiliary data structure used to quickly check if a key is present in the SSTable, reducing
unnecessary disk reads.
● Compaction: Over time, as more SSTables are created, they are periodically compacted. Compaction merges
multiple SSTables into a smaller number of larger SSTables, improving read performance and reclaiming disk
space.
Working of Bloom Filter:
● Bloom filter is stored as part of each SSTable (Sorted String Table) on disk.
● Each SSTable has its own Bloom filter. When an SSTable is created, a Bloom Filter is also created for that
SSTable.
● Bloom filter is used to quickly determine whether a data item is likely to be in a specific SSTable or not.
● It can return two types of results:
○ Definitely not present (no false negatives)
○ Possibly present (with a certain probability of false positives)
How Bloom Filters are Used in Cassandra?
● Each element is hashed by multiple hash functions to generate several bit positions in the bit array.
Why multiple Hash functions are used for the same record?
Bloom filters are a probabilistic data structure that allows Cassandra to determine one of two possible states: - The
data definitely does not exist in the given file, or - The data probably exists in the given file.
The primary goal of using multiple hash functions is to reduce the false positive rate, which is the probability that
the Bloom filter incorrectly indicates that an element is present when it is not.
● Single Hash Function: If only one hash function is used, each element is mapped to a single bit in the bit
array. This can lead to high false positive rates, especially if the bit array is relatively small compared to the
number of elements.
● Multiple Hash Functions: By using multiple hash functions, each element is mapped to several bits in the bit
array. This spreads the bits more evenly across the array, reducing the likelihood of collisions and thus the
false positive rate.
● For a given element (like key "A"), each hash function produces a different bit position in the bit array. These
positions are used to set the corresponding bits to 1.
Insertion into the Bloom Filter
Let’s insert the customer IDs "123", "456", and "789" into the Bloom Filter.
Hashing and Setting Bits:
1. Insert ID "123": Based on three different Hash functions, the key is hashed.
○ H1("123") -> Index 2
○ H2("123") -> Index 5
○ H3("123") -> Index 8
○ Set bits at indices 2, 5, and 8 to 1 in the Bloom Filter.
2. Insert ID "456":
○ H1("456") -> Index 1
○ H2("456") -> Index 4
○ H3("456") -> Index 7
○ Set bits at indices 1, 4, and 7 to 1 in the Bloom Filter.
3. Insert ID "789":
○ H1("789") -> Index 0
○ H2("789") -> Index 3
○ H3("789") -> Index 6
○ Set bits at indices 0, 3, and 6 to 1 in the Bloom Filter.
Assuming the Bloom Filter bit array looks like this after inserting all three IDs: [1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
Querying the Bloom Filter
Let’s check if a new customer ID, say "1234", is in the Bloom Filter.
Hashing "1234":
● Bit at Index 2: 1
● Bit at Index 5: 1
● Bit at Index 8: 1
Since all these bits are set to 1, the Bloom Filter indicates that "1234" might be in the set. However, it’s possible
that "1234" is not actually in the set, so a disk read might still be necessary to confirm its presence.
False Positive Example
To illustrate a false positive, suppose we check for an ID "0000" that was never added. The hash functions might
produce:
The Bloom Filter indicates indices 2, 5, and 8 are all set to 1, so it will incorrectly tell us that "0000" might be in
the set, even though it was not. This is a false positive.
Exercise:
Assume we have a Bloom filter with a bit array of size 10
(i.e., 10 bits, all initialized to 0). It uses following two hash
functions.
Query the Bloom filter for the word "rat" and “cat”.
Solution:
Insertion Process
1. Insert "cat"
● Hash Function 1:
○ ASCII values: c = 99, a = 97, t = 116
○ Sum: 99 + 97 + 116 = 312
○ h1("cat") = 312 % 10 = 2
● Set bit at index 2 to 1.
● Hash Function 2:
○ Product: 99 * 97 * 116 = 1127884
○ h2("cat") = 1127884 % 10 = 4
● Set bit at index 4 to 1.
● Hash Function 1:
○ Sum of ASCII values: 114 + 97 + 116 = 327
○ h1("rat") = 327 % 10 = 7
○ Check bit at index 7: It is 0.
● Hash Function 2:
○ Product of ASCII values: 114 * 97 * 116 = 1297732
○ h2("rat") = 1297732 % 10 = 2
○ Check bit at index 2: It is 1.
● Result: Since one of the bits (index 7) is 0, the Bloom filter correctly identifies that "rat" is definitely
not in the filter (true negative).
Easy Data
Distribution
Transparent Distribution of Data:
● Cassandra ensures that data is evenly distributed across all nodes without the user worrying about node
locations. This is achieved through the following mechanisms:.
● It uses following concept to implement Transparent distribution:
○ Partitioning
○ Partition Key
○ Hashing
○ Token ring and Virtual Nodes (VNodes)
○ Replication
Data Partitioning and Hashing:
Partitioning:
Data is divided into partitions based on a partition key. Each partition is placed on nodes in the cluster.
Partition Key
The partition key decides where the data lives in the cluster. Rows with the same partition key go to the same node.
Hashing
● Cassandra applies a hash function (Murmur3 by default) on the partition key to convert it into a token.
This token determines the position of the data in the token ring.
Token Ring & Virtual Nodes (VNodes)
5. Replication
To ensure high availability, data is replicated across multiple nodes. The number of copies is controlled by the
Replication Factor (RF).
Example: With RF = 3, each piece of data is stored on 3 different nodes.
Example: Social Media Application
Imagine you’re building a social media application where users can post status updates. You need to design a Cassandra table to store
these status updates efficiently.
You create a table called status_updates to store the status updates for each user:
1. Data Distribution: When you insert a new status update, Cassandra uses the user_id to determine where to store the data. The
user_id is hashed to produce a token, and Cassandra uses this token to map the data to a specific node in the cluster.
2. Grouping Data: All status updates for a given user_id will be stored on the same node. This means that when you query status
updates for a specific user, Cassandra can efficiently fetch all posts from the node where they are stored.
Virtual Nodes:
● Virtual nodes in a Cassandra cluster are also called vnodes.
● Vnodes can be defined for each physical node in the cluster.
● Each node in the ring can hold multiple virtual nodes.
● By default, each node has 256 virtual nodes.
● Virtual nodes help achieve finer granularity in the partitioning of
data, and data gets partitioned into each virtual node using the
hash value of the key.
● On adding a new node to the cluster, the virtual nodes on it get
equal portions of the existing data. So there is no need to
separately balance the data by running a balancer.
● The image depicts a cluster with four physical node. Each
physical node in the cluster has four virtual nodes. So there are
16 vnodes in the cluster.
Why do we need Virtual Nodes?
Without VNodes (Single Large Range)
By Cassandra’s replication rules, usually only one neighbor (say Node 4) has the replica for that entire range.
So Node 4 must do all the repair work → slow + heavy load.
With VNodes (Many Small Ranges per Node)
● Each node owns many small token ranges (e.g., 256 random ranges across the ring).
● Example: Node 3 might own ranges like 100–200, 800–900, 1500–1600, etc.
● These small ranges overlap with different neighbors, not just one.
Instead of one node doing 100% of the repair, the workload is spread across multiple neighbors.
Token Generator:
● The token generator is used in Cassandra versions earlier than version 1.2 to assign a token to each node in the cluster.
In these versions, there was no concept of virtual nodes and only physical nodes were considered for distribution of
data.
● The token generator tool is used to generate a token for each node in the cluster based on the data centers and number
of nodes in each data center.
● A token in Cassandra is a 127-bit integer assigned to a node. Data partitioning is done based on the token of the nodes
as described earlier in this lesson.
● Starting from version 1.2 of Cassandra, vnodes are also assigned tokens and this assignment is done automatically so
that the use of the token generator tool is not required.
Example of Token Generator:
A token generator is an interactive tool which generates tokens for the topology specified. Let us now look at an
example in which the token generator is run for a cluster with 2 data centers.
DataCenter 1
DataCenter 2
The following diagram depicts a four node cluster with token values of 0, 25, 50 and 75.
For a given key, a hash value is generated in the range of 1 to 100. Keys with hash values in the range 1 to 25 are stored on
the first node, 26 to 50 are stored on the second node, 51 to 75 are stored on the third node, and 76 to 100 are stored on the
fourth node. Please note that actual tokens and hash values in Cassandra are 127-bit positive integers.
Fault Tolerance:
Replication in Cassandra:
● One piece of data can be replicated to multiple
(replica) nodes, ensuring reliability and fault
tolerance. Cassandra supports the notion of a
replication factor (RF), which describes how many
copies of your data should exist in the database. So
far, our data has only been replicated to one replica
(RF = 1).
● Cassandra allows replication based on nodes, racks,
and data centers, unlike HDFS that allows
replication based on only nodes and racks.
● Replication across data centers guarantees data
availability even when a data center is down.
SimpleStrategy in Cassandra
Configuration File: PropertyFileSnitch relies on a properties file (cassandra-rackdc.properties) to define the data
center and rack for each node. This file maps IP addresses or hostnames of nodes to their respective data centers and
racks.
File Location and Format: The properties file is typically located in the Cassandra configuration directory (e.g.,
/etc/cassandra/). It has a simple key-value format where each line specifies the data center and rack information for
a node.
Example:
# cassandra-rackdc.properties
192.168.1.1=dc1:rack1
192.168.1.2=dc1:rack1
192.168.2.1=dc2:rack2
In this example:
● 192.168.1.1 and 192.168.1.2 are in dc1 (data center 1) and rack1 (rack 1).
● 192.168.2.1 is in dc2 (data center 2) and rack2 (rack 2).
Determine Node Location:
● Snitches help Cassandra determine which data center and rack a node belongs to.
Data Replication:
● By understanding the topology, snitches enable Cassandra to place replicas of data in different data centers and
racks, enhancing data availability and fault tolerance.
● Snitches help route read and write requests to the most appropriate nodes based on their location. This
minimizes latency and balances the load across the cluster.
Failure Detection:
● Snitches provide information that helps Cassandra detect and handle node and data center failures effectively.
This ensures that data remains available even if some nodes or data centers fail.
Gossip Protocol:
Cassandra has no master node.
Every node must:
Initial Contact: The seed node helps in the initial discovery phase. The new node contacts the seed node to get
information about the rest of the cluster. The seed node then provides information about other nodes in the cluster,
allowing the new node to connect to them.
Not a Unique Role: It’s worth noting that seed nodes are not special or unique in terms of cluster operations; they
don't have any additional responsibilities beyond this initial discovery process. Once the new node has connected to
other nodes in the cluster, it doesn’t need to rely on the seed node anymore for cluster operations.
Configuration: In the cassandra.yaml configuration file, you specify the list of seed nodes under the
seed_provider configuration. It’s a good practice to list more than one seed node to provide redundancy. This way, if
one seed node is unavailable, the new node can still discover and join the cluster via another seed node.
Best Practices: To ensure high availability and reliability, you should configure multiple seed nodes (typically 2-3)
across different machines or data centers if possible. This way, the failure of a single node will not prevent new
nodes from joining the cluster.
Seed Nodes:
Seed nodes are used to bootstrap the gossip protocol. The features of
seed nodes are:
On startup, two nodes connect to two other nodes that are specified as
seed nodes. Once all the four nodes are connected, seed node
information is no longer required as steady state is achieved.
Configurations:
● The main configuration file in Cassandra is the Cassandra.yaml file. We will look at this file in more detail in
the lesson on installation.
● Right now, let us remember that this file contains the name of the cluster, seed nodes for this node, topology
file information, and data file location.
● This file is located in /etc/Cassandra in some installations and in /etc/Cassandra/conf directory in others
Efficient Writes in
Cassandra:
Cassandra Write Process:
The Cassandra write process ensures fast writes. Steps in the Cassandra write process are:
● ONE: A write is successful if it is acknowledged by at least one replica node. This level offers a good balance
between availability and consistency but does not guarantee that all replicas will have the latest data.
● QUORUM: A write is successful if a majority (quorum) of the replica nodes acknowledge the write.
For a replication factor (RF) of NNN, the quorum is ⌈N/2⌉. For instance, if RF is 3, then a quorum would be
2. This level provides a strong consistency guarantee while still maintaining good availability.
● ALL: A write is considered successful only if all replica nodes acknowledge it. This level provides the
highest consistency but at the cost of availability, as the write will fail if even a single node is down or
unreachable.
How are write requests accomplished?
● The coordinator sends a write request to all replicas that own the row being written.
● As long as all replica nodes are up and available, they will get the write regardless of the consistency level
specified by the client.
● The write consistency level determines how many replica nodes must respond with a success acknowledgment
in order for the write to be considered successful.
● Success means that the data was written to the commit log and the memtable as described in how data is
written.
● The coordinator node forwards the write to replicas of that row, and responds to the client once it receives write
acknowledgments from the number of nodes specified by the consistency level.
Exceptions:
○ If the coordinator cannot write to enough replicas to meet the requested consistency level, it throws an
Unavailable Exception and does not perform any writes.
○ If there are enough replicas available but the required writes don't finish within the timeout window, the
coordinator throws a Timeout Exception.
● For example, in a single datacenter
10-node cluster with a replication
factor of 3, an incoming write will go
to all 3 nodes that own the requested
row.
● If the write consistency level specified
by the client is ONE, the first node to
complete the write responds back to
the coordinator, which then proxies the
success message back to the client.
● A consistency level of ONE means
that it is possible that 2 of the 3
replicas can miss the write if they
happen to be down at the time the
request is made.
● If a replica misses a write, the row is
made consistent later using one of the
built-in repair mechanisms: hinted
handoff, read repair, or anti-entropy Single datacenter cluster with 3 replica nodes and consistency set to ONE
node repair.
Hinted Handoff:
● Hinted Handoff is a mechanism in Apache Cassandra designed to handle temporary failures or
unavailability of nodes during write operations.
● It helps maintain data consistency and availability by ensuring that writes are eventually
propagated to all replicas, even if some of them are temporarily down.
How Hinted Handoff Works?
● On occasion, a node becomes unresponsive while data is being written. Reasons for unresponsiveness are
hardware problems, network issues, or overloaded nodes that experience long garbage collection (GC)
pauses.
● By design, hinted handoff inherently allows Cassandra to continue performing the same number of writes
even when the cluster is operating at reduced capacity.
● After the failure detector marks a node as down, missed writes are stored by the coordinator for a period of
time, if hinted handoff is enabled in the cassandra.yaml file.
● In Cassandra 3.0 and later, the hint is stored in a local hints directory on each node for improved replay.
● The hint consists of a target ID for the downed node, a hint ID that is a time UUID for the data, a message
ID that identifies the Cassandra version, and the data itself as a blob.
● Hints are flushed to disk every 10 seconds, reducing the staleness of the hints.
● When gossip discovers when a node has comes back online, the coordinator replays each remaining hint to
write the data to the newly-returned node, then deletes the hint file.
● If a node is down for longer than max_hint_window_in_ms (3 hours by default), the coordinator stops
writing new hints.
How Hinted Handoff Works
1. Write Operation: When a write request is made, Cassandra writes the data to the appropriate nodes based on
the replication strategy.
2. Node Failure: If a node is down or unreachable, the other nodes will store a hint for the write operation that
was meant for the unavailable node.
Hint Logs are files or data structures in Cassandra where hints about write operations are stored. These hints
are kept temporarily to ensure that write operations are not lost when a node is down or unreachable during
the write.
3. Hint Storage: These hints are stored on the nodes that successfully received the write request.
4. Node Recovery: When the previously unavailable node comes back online, the nodes that stored the hints will
send these hints to the recovered node.
5. Replay Hint: The recovered node then replays the hints and applies the write operations, ensuring it
eventually gets all the updates.
● For example, consider a cluster
consisting of three nodes, A, B,
and C,with a replication factor of
2.
● When a row K is written to the
coordinator (node A in this case),
even if node C is down, the
consistency level of ONE or
QUORUM can be met.
● Why? Both nodes A and B will
receive the data, so the
consistency level requirement is
met.
● A hint is stored for node C and
written when node C comes up. In
the meantime, the coordinator can
acknowledge that the write
succeeded.
Cassandra Read Process:
● The Cassandra read process ensures fast reads. Read happens across all nodes in parallel.
● If a node is down, data is read from the replica of the data.
● Priority for the replica is assigned on the basis of distance.
Data in the memtable and sstable is checked first so that the data can be retrieved faster if it is already in memory.
It has two data centers:
● data center 1
● data center 2
All these nodes are in data center 1. The fourth copy is stored on node 13 of data center 2.
If a client process is running on data node 7 wants to access data row 1; node 7 will be given the highest preference as the data is local
here. The next preference is for node 5 where the data is rack local. The next preference is for node 3 where the data is on a different
rack but within the same data center.
The least preference is given to node 13 that is in a different data center. So the read process preference in this example is node 7, node
5, node 3, and node 13 in that order.
Read Consistency Levels:
Read Consistency Levels:
● ONE: A read is considered successful if at least one replica node returns the data. This level is quick but does
not guarantee the most recent data, as it might not be synchronized with all replicas.
● QUORUM: A read is successful if a majority of replica nodes return the data. This level ensures that the data
is recent and consistent across the majority of nodes, providing a good balance between consistency and
performance.
● ALL: A read is successful only if all replica nodes return the data. This level ensures the highest consistency
but can be slower and less available if some nodes are down.
● LOCAL_QUORUM: This is similar to the QUORUM consistency level but applies only to replicas in the
same data center. It helps reduce cross-data-center latency while still ensuring consistency.
● EACH_QUORUM: A read or write operation is successful if a quorum of nodes in each data center (in a
multi-data-center setup) acknowledges the operation. This ensures consistency across all data centers.
SERIAL and SERIAL_LOCAL Consistency Levels:
SERIAL = Cassandra checks across the whole cluster (all datacenters) to make sure the read/write you’re doing
with an IF condition is the most up-to-date and correct.
Global correctness, but slower.
LOCAL_SERIAL = Cassandra only checks inside your local datacenter for the most up-to-date value.
Faster, but might not include very recent changes made in another datacenter.
You want to borrow a book, but only if nobody else has borrowed it
Example 1: Serial Consistency
Objective: Ensure that a user profile update is consistent across all data centers and appears in a strict order.
1. Application Requirement: You have a user profile with an account_balance field. When a user makes a
purchase, you need to update their account balance.
2. Operation: You want to ensure that the update to account_balance happens in a strictly serialized manner
across all replicas, irrespective of which data center the request hits.
Objective: Ensure that a user profile update is consistent within a single data center, but not necessarily across
multiple data centers.
1. Application Requirement: The same e-commerce application with user profiles. Now, you want to ensure
strong consistency for profile updates, but only within a single data center.
2. Operation: You are fine with eventual consistency across data centers but need strong consistency within a
single data center to avoid conflicts during high traffic.
Examples of read consistency levels:
Single datacenter cluster with 3 replica nodes and consistency set to ONE
A two datacenter cluster with a consistency level of
QUORUM
● "Anti-entropy"is a process used to ensure consistency across different nodes in the cluster by resolving
discrepancies in data.
● Specifically, it’s a way to repair differences (or "conflicts") that may occur due to the eventual consistency
model that Cassandra uses.
● Since Cassandra allows data to be written to different nodes (replicas) and doesn't require an immediate sync
between them, sometimes inconsistencies can arise. This might happen if one replica has stale or missing data,
or if a network partition occurred and nodes couldn't communicate properly.
● Anti-entropy repair in Cassandra is the mechanism that tries to fix these inconsistencies. It uses a process
known as Merkle Trees to compare the data between two replicas and identify differences.
Read Repair in Anti Entropy:
● Read Repair in Cassandra is an automatic process that ensures data consistency across replicas during a read
operation.
● When a client queries the database, Cassandra may detect that different replicas have inconsistent data for the
same piece of information.
● Read Repair is triggered to fix this inconsistency.
● Cassandra compares the data returned from different replicas for the same row. If discrepancies are detected
between these replicas (i.e., they have different versions of the same data), Cassandra will repair the
inconsistency by updating the outdated replicas with the most recent version of the data.
● It ensures that all replicas are consistent with the most recent write and helps bring the system closer to
consistency over time.
How does Read Repair Work?
1. Client Issues a Read Request: A client requests data (for example, a row with a specific row_id) from the
Cassandra cluster.
2. Coordinator Queries Replicas: The coordinator node (which receives the read request) queries multiple
replicas (e.g., nodes in the cluster) for the data.
3. Replicas Return Data: The replicas return their version of the requested data. Each replica stores data with a
timestamp indicating when the data was last updated.
4. Detecting Inconsistent Data: The coordinator compares the data returned from the replicas. If any
discrepancies are found (e.g., replicas have different values for the same data), the coordinator identifies that
the replicas are inconsistent.
5. Repairing the Inconsistent Data: The coordinator will then use the data from the replica with the most
recent timestamp (the most up-to-date version) and propagate it to the other replicas that have outdated data.
This ensures that all replicas are consistent with the latest value.
6. Returning Data to the Client: After the read repair is performed, the client receives the most recent (and
consistent) data from the coordinator.
Role of Merkle Tree in Read Repair:
● Merkle Trees are used as an efficient way to detect data inconsistencies between replicas during the read
repair process.
● They enable Cassandra to quickly identify which part of the data needs to be repaired without having to
compare the entire dataset on each node.
● A Merkle tree, also known as a Hash tree, is a data structure that efficiently verifies the integrity of large
datasets by hashing data into a tree-like structure, where each leaf node represents a hash of a data block and
non-leaf nodes represent hashes of their child nodes.
Merkle Tree:
EXAMPLE:
Step 2: Combine adjacent hashes and hash
Insert the following data blocks into a Merkle tree: them again
● D1=8, D2=16, D3=32, D4=64 Next, we combine the adjacent hashes and apply
the simple hash function to the concatenated
H(N)=(∑of digits of N) mod10 result.
Repair Types:
● Full Repair: Compares and repairs all SSTables (Sorted String Tables) in a node, ensuring complete consistency.
● Incremental Repair: Focuses on SSTables that have not been previously repaired, using metadata to track repaired
versus unrepaired data. It’s more efficient if repairs are run frequently.
Compaction Strategies:
● Anti-Compaction: Compaction is the process of merging multiple SSTables into a single SSTable to improve read
performance and reclaim disk space. Anti-compaction is typically triggered as part of a repair operation.
Repair Modes:
● Sequential Repair: Repairs nodes one at a time. It creates snapshots of SSTables, minimizing data changes during
repair.
● Parallel Repair: Repairs multiple nodes simultaneously, improving repair speed but increasing network load.
Full Repair vs Incremental Repair:
1. Full Repair:
● Scope: Compares and repairs all SSTables on a node.
● Process: Generates Merkle trees for the entire dataset and identifies discrepancies.
● Resource Usage: High, as it involves complete data processing and comparison.
● Duration: Generally longer due to the comprehensive nature of the repair.
● Use Case: Best for initial consistency checks or when a thorough repair is needed.
2. Incremental Repair:
● Scope: Focuses on SSTables that have not been repaired before, using metadata to track repaired and
unrepaired data.
● Process: Builds Merkle trees only for unrepaired SSTables, compares them, and corrects discrepancies.
● Resource Usage: Lower, as it deals with a subset of data.
● Duration: Shorter and more efficient due to its targeted approach.
● Use Case: Suitable for routine maintenance and frequent repairs to minimize performance impact.
Why is Read Repair Needed?
Cassandra is designed for eventual consistency, which means that nodes might have different versions of the same
data due to various reasons, such as:
● Network partitions: When nodes are temporarily unable to communicate, they may diverge and store
different versions of the same data.
● Node failures: If a node is down during writes, it might miss some updates.
● Writes happening at different times on different nodes, causing temporary inconsistency.
Read Repair ensures that even if data is inconsistent between replicas, it gets fixed during subsequent read
operations, ensuring that the system eventually converges to consistency.
Tunable
Consistency:
Tunable Consistency:
In Apache Cassandra, "tunable consistency" refers to the ability to configure the consistency level of read and write
operations to balance between data consistency and system performance. This flexibility allows you to make
trade-offs based on your application's needs.
Trade-offs
● Higher Consistency: Requires more replicas to acknowledge a request, which means stronger guarantees that
the data is accurate but can affect performance and latency.
● Lower Consistency: Faster and can handle more load, but offers weaker guarantees that the data is accurate.
Examples of Tunable Consistency in Practice
Example 1: Balancing Consistency and Performance
Suppose you have a Cassandra cluster with a replication factor of 3 (RF=3). For a write operation, you might
choose:
● Write Consistency Level: QUORUM (which requires 2 out of 3 replicas to acknowledge the write)
● Read Consistency Level: QUORUM (which also requires 2 out of 3 replicas to respond)
This configuration balances consistency and performance. It ensures that most of the replicas have the latest data
and reduces the risk of stale reads while maintaining a reasonable level of availability.
● Write Consistency Level: ONE (only one replica needs to acknowledge the write)
● Read Consistency Level: ONE (only one replica needs to respond to the read request)
This configuration maximizes availability but at the cost of potentially higher risk of data inconsistency if some
replicas are out of sync.
Example 3: Ensuring Strong Consistency
For applications requiring strong consistency, such as financial systems, you might use:
● Write Consistency Level: ALL (all replicas must acknowledge the write)
● Read Consistency Level: ALL (all replicas must respond to the read request)
This setup ensures that all replicas are in sync and provides the highest level of consistency but may impact performance
and availability, particularly if some replicas are down.
What happens if Consistency is not maintained?
Failure Scenarios: Node Failure
Cassandra is highly fault tolerant. The effects of node failure are as follows:
Schema Flexibility
Using collections can make the schema more flexible and dynamic. Instead of creating separate tables to manage
related data, you can use collections to embed these relationships directly within a single row. This allows you to:
● Avoid Joins: Reduce the need for complex queries and joins by embedding related data within a single
column.
● Adapt to Data Changes: Easily adjust the schema as your data requirements evolve without needing to
redesign your database structure.
Lists are ordered collections of elements where each element can appear multiple times. They maintain the order of
insertion, meaning that the order in which elements are added is preserved.
Cassandra Map Collection
Maps are collections of key-value pairs where each key is unique, and each key maps to a single value. Maps are
useful for scenarios where you need to associate a specific value with a unique key.
Cassandra JMX(Java Management Extensions)
Authentication:
● JMX authentication ensures that only authorized users can access management and
monitoring interfaces.
● Without authentication, anyone with network access could potentially monitor or
manipulate your Cassandra nodes, which could lead to security breaches or
unauthorized data access.
Configure Authentication and Authorization:
● In Cassandra, by default authentication and authorization options are disabled.
● You have to configure Cassandra.yaml file for enabling authentication and authorization.
● Open Cassandra.yaml file and uncomment lines that deals with internal authentication and
authorization.
Logging in
● Now authentication is enabled, if you try to access any keyspace, Cassandra will return an error.
● By default, Cassandra provides the super account with username ‘cassandra’ and password ‘cassandra’.
By logging in to ‘Cassandra’ account, you can do whatever you want.
● Let’s see the below screenshot for this, where it will not allow you to login if you are not using the default
Cassandra “username” and “password”.
● Now, in the second screenshot, you can see after using Cassandra default login credential, you are
able to login.
● You can also create another user with this account. It is recommended to change the password from
the default.
● Here is the example of login Cassandra user and change default password.
Create New User
New accounts can be created with the ‘Cassandra’ account.
For creating a new user, login, the password is specified along with whether the user is super user or not. Only Super
user can create new users.
You can get a list of all users by the following syntax: list users;
Authorization
Authorization is the assigning permission to users that what action a particular user can perform.
Here is the generic syntax for assigning permission to users.
GRANT permission ON resource TO user
There are following types of permission that can be granted to the user.
1. ALL
2. ALTER
3. AUTHORIZE
4. CREATE
5. DROP
6. MODIFY
7. SELECT
Here are examples of assigning permission to the user.
Create user laura with password 'newhire';
grant all on dev.emp to laura;
revoke all on dev.emp to laura;
grant select on dev.emp to laura;
You can also list all the permission on the resource. Here is the example of getting permission from a table.
Enabling JMX Authentication
With the default settings of Cassandra, JMX can only be accessed from the localhost. If you want to
access JMX remotely, change the LOCAL_JMX setting in Cassandra-env.sh and enable authentication
or SSL.
After enabling JMX authentication, make sure OpsCenter and nodetool are configured to use
authentication.
Procedure
There are following steps for enabling JMX authentication.
1. In the cassandra-env.sh file, add or update following lines.
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.authenticate=true"
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.password.file=/etc/cassandra/jmxremote.password"