Big Data Storage Concepts
Lecture 5: Chapter 5 Part 1
Big Data Storage and Data Models
• Data and storage models are the basis for big data ecosystems
• While storage model captures the physical aspects and features
for data storage, data model captures the logical representation
and structures for data processing and management
• Understanding storage and data model together is essential for
understanding big data ecosystems
• In this chapter we are going to investigate and compare the key
storage and data models in the spectrum of big data frameworks
Big Data Storage Models
• A storage model is the core of any big-data related
systems
• It affects the scalability, data-structures, programming
and computational models for the systems that are built
on top of any big data-related systems
Big Data Main Storage Models
• Block-based storage
• File-based Storage
• Object-based Storage
Block-based storage
• Data is stored as blocks which normally have a fixed size yet with no
additional information (metadata)
• Each block has a unique identifier, stored in a data lookup table
• Block based storage focus on performance and scalability to store and
access very large-scale data
• When data needs to be retrieved, the data lookup table is used to find
the required blocks, which are then reassembled into their original
form
• Block-based storage is usually used as a low-level storage paradigm
which are widely used for higher level storage systems such as File-
based systems, Object-based systems and Transactional Databases
Block-based storage Architecture
A simple model of block-based storage can
be seen in this Figure 1
• Basically, data are stored as blocks which
normally have a fixed size yet with no
additional information (metadata)
• A unique identifier is used to access each
block
• The identifier is mapped to the exact
location of actual data blocks through
access interfaces
• Traditionally, block-based storage is bound
to physical storage protocols, such as Figure 1: Block-based storage model
SCSI, iSCSI, ATA and SATA
Block-based storage Architecture Cont.
• With the development of distributed computing
and big data, block-based storage model are
also developed to support distributed and
cloud-based environments
• As shown in Figure 2, the architecture of a
distributed block-storage system is composed
of the block server and a group of block nodes
• The block server is responsible for maintaining
the mapping or indexing from block IDs to the
actual data blocks in the block nodes
• The block nodes are responsible for storing the Figure 2: Architecture of distributed
actual data into fixed-size partitions, each of Block-based storage
which is considered as a block.
File-Based Storage
• File-based storage inherits from the traditional file system
architecture, considers data as files that are maintained
in a hierarchical structure
• It is the most common storage model and is relatively
easy to implement and use
• In big data scenario, a file-based storage system could be
built on some other low-level abstraction to improve its
performance and scalability
File-Based Storage Architecture
• The file-based storage
paradigm is shown in Figure 3
• File paths are organized in a
hierarchy and are used as the
entries for accessing data in
the physical storage
Figure 3: File-based storage model
File-Based Storage Architecture Cont.
• For a big data scenario, Distributed File
Systems (DFS) are commonly used as
basic storage Systems
• Figure 4 shows a typical architecture of a
distributed file system which normally
contains one or several name nodes and a
bunch of data nodes
• The name node is responsible for
maintaining the file entries hierarchy for
the entire system while the data nodes are Figure 4: Architecture of distributed file systems
responsible for the persistence of file data
File-Based Storage Architecture Cont.
• For a distributed infrastructure,
replication is very important for
providing fault tolerance in file-
based systems
• Normally, every file has multiple
copies stored on the underlying
storage nodes. And if one of the
copies is lost or failed, the name
node can automatically find the
next available copy to make the
failure transparent for users Figure 5: Architecture of Hadoop distributed file systems
HDFS: Hadoop Distributed File System
• As shown in Figure 5, the architecture of HDFS consists of a name
node and a set of data nodes
• Name node manages the file system namespace, regulates the
access to files and also executes some file system operations
such as renaming, closing, etc.
• Data node performs read-write operations on the actual data
stored in each node and also performs operations such as block
creation, deletion, and replication according to the instructions of
the name node
HDFS: Hadoop Distributed File System Cont.
• Data in HDFS is seen as files and automatically partitioned and
replicated within the cluster
• The capacity of storage for HDFS grows almost linearly by adding
new data nodes into the cluster
• HDFS also provides an automated balancer to improve the
utilization of cluster storage
• In addition, recent versions of HDFS have introduced a backup
node to solve the problem caused by single-node failure of the
primary name node
HDFS: Hadoop Distributed File System Cont.
• HDFS is an open-source distributed file system written in Java
• HDFS is the open-source implementation of Google File System (GFS)
• HDFS is the core storage for Hadoop ecosystems and the majority of
the existing big data platforms
• HDFS inherits the design principles from GFS to provide highly scalable
and reliable data storage across a large set of commodity server nodes
• HDFS has demonstrated production scalability of up to 200 PB of
storage and a single cluster of 4500 servers, supporting close to a
billion files and blocks
HDFS: Hadoop Distributed File System Cont.
HDFS is designed to serve the following goals:
• Fault detection and recovery: since HDFS includes a large number
of commodity hardware, failure of components is expected to be
frequent. Therefore, HDFS have mechanisms for quick and
automatic fault detection and recovery
• Huge datasets: HDFS should have hundreds of nodes per cluster
to manage the applications having huge datasets
• Hardware at data: a requested task can be done efficiently, when
the computation takes place near the data. Especially where huge
datasets are involved, it reduces the network traffic and increases
the throughput
Object-Based Storage
• In the object-based storage model, data
is managed as objects. As shown in
Figure 6, every object includes the data
itself, some meta-data, attributes and a
globally unique object identifier (OID)
• Object-based storage model abstracts
the lower layers of storage away from
the administrators and applications
Figure 6: Object-based storage model
Object-Based Storage Architecture
The typical architecture of an object-based storage system is shown in Figure 7
Figure 7: Architecture of object-based storage
Object-Based Storage Architecture Cont.
• The object-based storage system normally uses a flat namespace,
in which the identifier of data and their locations are usually
maintained as key-value pairs in the object server
• The object server provides location-independent addressing and
constant lookup latency for reading every object
• Meta-data of the data is separated from data and is also
maintained as objects in a meta-data server
• As a result, it provides a standard and easier way of processing, analyzing
and manipulating of the meta-data without affecting the data itself
Object-Based Storage Architecture Cont.
• Due to the flat architecture, it is very easy to scale out object-
based storage systems by adding additional storage nodes to the
system
• Besides, the added storage can be automatically expanded as
capacity that is available for all users
• Drawing on the object container and meta-data maintained, it is
also able to provide much more flexible and fine-grained data
policies at different levels