Cloud Unit III
Cloud Unit III
The evolution of data storage technology has moved from bulky and slow
mechanical methods to compact and efficient digital solutions. Early forms like
punch cards and magnetic tapes gave way to hard disk drives (HDDs), which were
then challenged by the faster and more durable solid-state drives (SSDs). Cloud
storage represents a more recent shift, offering accessibility and scalability, while
also presenting new challenges in terms of data security and privacy.
There are several different types of data storage throughout history including punch
cards, magnetic tape, hard disk drives, and cloud storage.
Punch Cards:
The earliest form of digital data storage, invented in the 19th century and used
extensively in early computers for storing and processing information.
Magnetic Tape:
Introduced in the 1950s, magnetic tape offered higher storage capacity and was
used for data archiving and backup.
Magnetic Storage:
Hard Disk Drives (HDDs):
Developed in the 1950s, HDDs provided faster access to data compared to tape,
becoming a standard for personal computers.
Floppy Disks:
Introduced in the 1970s, floppy disks offered portable storage, though with limited
capacity.
Zip Drives:
A later iteration of removable storage with higher capacity than floppy disks, but
ultimately not as successful as hard drives, according to Platinum Data Recovery.
Optical Storage:
CDs, DVDs, Blu-ray: These technologies used lasers to read and write data onto
optical discs, offering higher storage capacities than previous media.
Solid-State Storage:
Flash Memory and Solid-State Drives (SSDs): Flash memory, like that found in
SSDs, offers faster data access speeds and greater durability than HDDs, becoming
increasingly popular for laptops and desktops.
Cloud Storage:
Cloud computing offers remote storage solutions, providing accessibility and
scalability while potentially reducing the need for local storage devices.
Cloud Storage:
Cloud computing and cloud storage refers to the practice of saving data on remote
servers accessible via the internet, instead of on local storage devices or personal
computers.
The primary benefits of cloud storage include scalability, cost-effectiveness, and
accessibility, as it allows users to store vast amounts of data while offering the
flexibility to access it from any location with internet connectivity.
STORAGE MODELS:
Storage models define how data is organized and accessed in a data store. They can
be broadly categorized into data models (logical aspects) and storage models
(physical layout). Common types include file storage, block storage, and object
storage, each suited for different needs and applications. Cloud storage adds
another layer with models like public, private, and hybrid clouds, offering various
levels of security and control.
● Data Model:
Focuses on the logical structure of data within a database, defining how data
is organized and related.
● Storage Model:
Describes the physical layout of data on storage media (e.g., hard drive,
cloud).
● Instance Storage:
Virtual disks in the cloud, often used in traditional virtualization
environments.
● Volume Storage:
Storage accessed via a network, similar to a SAN (Storage Area Network)
but in the cloud.
● Object Storage:
Web-scale storage optimized for handling large amounts of unstructured
data.
Solid State Drives (SSDs) are storage devices that use flash memory to store data,
unlike traditional Hard Disk Drives (HDDs) which use spinning disks and moving
parts. This difference in technology gives SSDs several advantages, including
faster performance, greater durability, and lower power consumption.
Speed:
SSDs offer significantly faster read and write speeds compared to HDDs, leading
to quicker boot times, faster application loading, and improved overall system
responsiveness.
Durability:
Without moving parts, SSDs are more resistant to physical shock and vibration,
making them more reliable for portable devices like laptops.
Power Consumption:
SSDs consume less power than HDDs, which can translate to longer battery life in
laptops and reduced energy costs in data centers.
Noise:
SSDs are silent, as they have no moving parts that can create noise, unlike HDDs.
Drawbacks of SSDs:
Cost: SSDs are generally more expensive per gigabyte compared to HDDs.
Storage Capacity: While SSD capacities are increasing, they may not yet match the
vast storage options available with HDDs.
Limited Write Cycles: Flash memory used in SSDs has a finite number of write
cycles before it begins to degrade, though this is less of a concern with modern
SSDs.
A file system and a DBMS are two kinds of data management systems that are
used in different capacities and possess different characteristics.
A File System is a way of organizing files into groups and folders and then storing
them in a storage device. It provides the media that stores data as well as enables
users to perform procedures such as reading, writing, and even erasure.
On the other hand, DBMS is a more elaborate software application that is solely
charged with the responsibility of managing large amounts of structured data. It
provides functionalities such as query, index, transaction, as well as data integrity.
Although the file system serves well for the purpose of data storage for
applications where data is to be stored simply and does not require any great
organization, DBMS is more appropriate for applications where data needs to be
stored and optimized for organizational and structural needs, security, etc.
File System
The file system is basically a way of arranging the files in a storage medium like a
hard disk. The file system organizes the files and helps in the retrieval of files
when they are required. File systems consist of different files which are grouped
into directories. The directories further contain other folders and files. The file
system performs basic operations like management, file naming, giving access
rules, etc.
Example: NTFS(New Technology File System) , EXT(Extended File System).
Example:
Efficient query
There is no efficient query
Query processing processing is there in
processing in the file system.
DBMS.
There is more data
There is less data consistency in
Consistency consistency because of the
the file system.
process of normalization .
It has a comparatively
Cost It is less expensive than DBMS. higher cost than a file
system.
In DBMS data
independence exists,
mainly of two types:
Data Independence There is no data independence.
1) Logical Data
Independence .
2)Physical Data
Independence.
Only one user can access data at a Multiple users can access
User Access
time. data at a time.
A general parallel file system in cloud computing allows multiple computing nodes to access the
same file system concurrently, enabling high-performance data access for large-scale
applications.
Examples include GPFS (now IBM Storage Scale) and Lustre, which are used to manage vast
amounts of data in distributed environments.
● Scalability:
They can handle growing data volumes and increasing performance demands by adding
more storage nodes.
● High Performance:
Parallel access to data across multiple nodes maximizes throughput and minimizes
latency, crucial for applications like scientific computing and data analytics.
● Data Distribution:
Files are typically broken into smaller blocks that are distributed across multiple storage
devices, enabling parallel access.
● High Availability:
Features like data replication and redundancy ensure data durability and availability even
if some nodes fail.
● Compatibility and Integration:
They are designed to integrate with various cloud platforms and applications, supporting
diverse workloads and access patterns.
Google Inc. developed the Google File System (GFS), a scalable distributed file
system (DFS), to meet the company's growing data processing needs. GFS offers
fault tolerance, dependability, scalability, availability, and performance to big
networks and connected nodes. GFS is made up of a number of storage systems
constructed from inexpensive commodity hardware parts. The search engine,
which creates enormous volumes of data that must be kept, is only one example of
how it is customized to meet Google's various data use and storage requirements.
The Google File System reduced hardware flaws while gains of commercially
available servers.
GoogleFS is another name for GFS. It manages two types of data namely File
metadata and File Data.
The GFS node cluster consists of a single master and several chunk servers that
various client systems regularly access. On local discs, chunk servers keep data in
the form of Linux files. Large (64 MB) pieces of the stored data are split up and
replicated at least three times around the network. Reduced network overhead
results from the greater chunk size.
More than 1,000 nodes with 300 TB of disc storage capacity make up the largest
GFS clusters. This is available for constant access by hundreds of clients.
Components of GFS
GFS Clients: They can be computer programs or applications which may be used
to request files. Requests may be made to access and modify already-existing files
or add new files to the system.
GFS Master Server: It serves as the cluster's coordinator. It preserves a record of
the cluster's actions in an operation log. Additionally, it keeps track of the data that
describes chunks, or metadata. The chunks' place in the overall file and which files
they belong to are indicated by the metadata to the master server.
GFS Chunk Servers: They are the GFS's workhorses. They keep 64 MB-sized file
chunks. The master server does not receive any chunks from the chunk servers.
Instead, they directly deliver the client the desired chunks. The GFS makes
numerous copies of each chunk and stores them on various chunk servers in order
to assure stability; the default is three copies. Every replica is referred to as one.
Features of GFS
● Namespace management and locking.
● Fault tolerance.
● Reduced client and master interaction because of large chunk server size.
● High availability.
● Critical data replication.
● Automatic and efficient data recovery.
● High aggregate throughput.
Advantages of GFS
Disadvantages of GFS
LOCKS:
Locks are essential for maintaining data consistency and preventing race conditions
in distributed systems.
Chubby provides a robust mechanism for implementing distributed locks, ensuring
that only one client can hold a lock on a resource at a time.
Distributed systems often need to synchronize access to shared resources (e.g.,
files, databases, caches) to prevent data corruption or inconsistencies.
Types of locks:
● Advisory locks: These locks don't prevent access to resources but rather
indicate that a client is using the resource.
● Enforced locks: These locks prevent access to resources based on the lock
status.
Chubby is a robust and reliable distributed lock service that plays a crucial role in the
operation of many Google services and other large-scale distributed systems by
providing a foundation for coordination and consistency.
Features:
● Fault Tolerance:
Chubby is designed to be highly available and reliable, with the ability to elect
a new master if the current one fails.
● Coarse-grained Locking:
Chubby's locks are relatively large, meaning they control access to entire files
or directories rather than smaller pieces of data.
● Event Notifications:
Clients can subscribe to events related to files and locks, such as modification
of file contents or changes in lock status.
● Low-volume Storage:
Chubby can also be used as a repository for configuration data and other small
amounts of information.
How it Works:
● Master Election:
A distributed consensus protocol (like Paxos) is used to elect a master server
within the Chubby cell.
● Client Interaction:
Clients interact with the master server to acquire and release locks, read and
write data, and subscribe to events.
● Event Handling:
When a client subscribes to an event, the Chubby server notifies the client
when the event occurs, allowing the client to react accordingly.
● Cache Invalidation:
Clients cache file data and metadata, and updates invalidate the cache,
requiring clients to refresh their data.
NoSQL DATABASES:
A NoSQL (Not Only SQL) database is a type of database that deviates from the
traditional relational database model. It's designed to handle large volumes of
diverse, unstructured, and rapidly changing data, offering flexibility and scalability
not always found in SQL databases.
● Non-relational:
● Scalability:
NoSQL databases are designed to handle large volumes of data and high
traffic loads by distributing data across multiple servers (horizontal scaling).
● BASE compliance:
NoSQL databases often prioritize availability, eventual consistency, and soft
state (temporary inconsistencies) over strict ACID (Atomicity, Consistency,
Isolation, Durability) compliance found in SQL databases.
NoSQL databases can support SQL-like query languages or have their own
query languages optimized for their specific data models.
Characteristics of OLTP:
● Enable multi-user access to the same data, while ensuring data integrity:
OLTP systems rely on concurrency algorithms to ensure that no two users
can change the same data at the same time and that all transactions are
carried out in the proper order. This prevents people from using online
reservation systems from double-booking the same room and protects
holders of jointly held bank accounts from accidental overdrafts.
● Relational Databases:
Traditional relational databases like MySQL, PostgreSQL, and Oracle are
commonly used for OLTP due to their proven reliability and support for
complex transactions.
● NoSQL Databases:
Some NoSQL databases, like those using key-value or document models, are
also suitable for specific OLTP workloads, particularly those requiring high
scalability and flexible schemas.
● In-Memory Databases:
In-memory OLTP solutions leverage system memory to store and process
transaction data, further reducing latency and improving performance.
BIGTABLE:
Bigtable is a fully managed, scalable NoSQL database service offered by Google
Cloud Platform (GCP). It's designed for handling massive amounts of structured
and semi-structured data with low latency and high throughput, making it suitable
for a wide range of applications like machine learning, operational analytics, and
user-facing applications
Characteristic of BigTable:
● NoSQL Database:
Bigtable is a NoSQL database, meaning it doesn't use the traditional
relational database model (tables with rows and columns).
● Wide-Column Store:
It's a wide-column store, where data is organized into columns that are
grouped into column families.
● Key-Value Store:
Data is accessed using a key-value pair structure, with the row key serving
as the primary index.
● Scalability:
Bigtable is designed to scale from terabytes to petabytes of data, handling
billions of rows and thousands of columns.
● Low Latency:
It offers low latency access to data, making it suitable for real-time
applications and high-speed data processing.
● Fully Managed:
Google handles the infrastructure, management, and maintenance of
Bigtable, allowing users to focus on their applications.
● Replication:
Bigtable supports data replication across multiple regions for high
availability and disaster recovery.
● Integration:
It integrates with other Google Cloud services like BigQuery, Dataflow, and
Dataproc, as well as the open-source HBase API.
Use Cases:
● Machine Learning:
Bigtable can be used as a storage engine for large datasets used in machine
learning models.
● Operational Analytics:
It's suitable for analyzing real-time data streams and generating insights for
operational monitoring and decision-making.
● Data Warehousing:
Bigtable can be used for data warehousing, especially when dealing with
large volumes of non-relational data and requiring real-time insights.
MEGASTORE:
● Scalability:
Megastore leverages data partitioning and replication to handle large
volumes of data and high traffic.
● High Availability:
It achieves high availability through synchronous replication across data
centers and seamless failover mechanisms.
● Strong Consistency:
Megastore provides ACID (Atomicity, Consistency, Isolation, Durability)
transactions, ensuring data consistency across multiple data centers.
● Entity Groups:
Data is organized into entity groups, allowing for efficient transaction
management and strong consistency guarantees within these groups.
● Modified Paxos:
Megastore uses a modified Paxos algorithm for consensus and replication
across data centers, ensuring fault tolerance and strong consistency.
● Built on Bigtable and Chubby:
Megastore is built on top of Google's Bigtable and Chubby infrastructure,
leveraging their scalability and reliability.
● Wide Range of Production Services:
It has been used to support a variety of Google's production services.
Storage reliability at scale refers to the ability of a storage system to consistently and
accurately store and retrieve data, even when dealing with a large volume of data and
a large number of storage components. It's crucial for large-scale systems to ensure
data integrity and availability despite the potential for component failures.