Big Data Analytics Unit3
Big Data Analytics Unit3
Source: https://www.geeksforgeeks.org/what-is-dfsdistributed-file-system/
A Distributed File System (DFS) is a file system that is distributed on multiple file servers or multiple
locations. It allows programs to access or store isolated files, allowing programmers to access files
from any network or computer.
The main purpose of the Distributed File System (DFS) is to allow users of physically distributed
systems to share their data and resources by using a Common File System. A collection of
workstations and mainframes connected by a Local Area Network (LAN) is a configuration on
Distributed File System. A DFS is executed as a part of the operating system. In DFS, a namespace is
created and this process is transparent for the clients.
In the case of failure and heavy load, these components together improve data availability by
allowing the sharing of data in different locations to be logically grouped under one folder, which is
known as the “DFS root”.
It is not necessary to use both the components of DFS together, it is possible to use the namespace
component without using the file replication component and it is perfectly possible to use the file
replication component without using the namespace component between servers.
Early iterations of DFS made use of Microsoft’s File Replication Service (FRS), which allowed for
straightforward file replication between servers. The most recent iterations of the whole file are
distributed to all servers by FRS, which recognises new or updated files.
“DFS Replication” was developed by Windows Server 2003 R2 (DFSR). By only copying the portions of
files that have changed and minimising network traffic with data compression, it helps to improve
FRS. Additionally, it provides users with flexible configuration options to manage network traffic on a
configurable schedule.
o Structure transparency – There is no need for the client to know about the number
or locations of file servers and the storage devices. Multiple file servers should be
provided for performance, adaptability, and dependability.
1
o Access transparency – Both local and remote files should be accessible in the same
manner. The file system should be automatically located on the accessed file and
send it to the client’s side.
o Naming transparency – There should not be any hint in the name of the file to the
location of the file. Once a name is given to the file, it should not be changed during
transferring from one node to another.
• User mobility: The user can login from any location. It will automatically bring the user’s
home directory to the node where the user logs in.
• Simplicity and ease of use: The user interface of a file system should be simple and the
number of commands in the file should be minimal.
• High availability: A Distributed File System should be able to continue operations in case of
any partial failures like a link failure, a node failure, or a storage drive crash.
• Scalability: A DFS needs to enable scaling by adding new machines or combining two
networks to meet the scalability requirements over time. Meanwhile, it needs to ensure
availability and performance.
• High Reliability: The likelihood of data loss should be minimized to permissible levels in a
distributed file system. Users should not feel the need to make backup copies of their files in
order to handle the risk of data loss. Rather, a file system should create backup copies of key
files that can be used if the originals are lost. Many file systems employ stable storage as a
high-reliability strategy.
• Data Integrity: Multiple users share and access a file system. The integrity of data saved in a
shared file must be guaranteed by the file system. That is, concurrent access requests from
many users who are competing for access to the same file must be correctly synchronized
using a concurrency control method.
• Security: A distributed file system should be secure. It should safeguard the data from
unwanted & unauthorized access, and incorporate adequate data security mechanisms.
Unix: Network File System (NFS) is a distributed file system protocol originally developed by Sun
Microsystems (Sun) in 1984, allowing a user on a client computer to access files over a computer
network much like local storage is accessed.
2
Windows: The server component of the Distributed File System was initially introduced as an add-on
feature. It was added to Windows NT 4.0 Server and was known as “DFS 4.1”. Then later on it was
included as a standard component for all editions of Windows 2000 Server. Client-side support has
been included in Windows NT 4.0 and also in later on version of Windows.
Linux: Linux kernels 2.6.14 and versions after it come with an SMB (Server Message Block) client and
VFS (Virtual File System) known as “CIFS” (Common Internet File System) which supports DFS.
Mac OS: Mac OS X 10.7 (lion) and onwards supports Mac OS X DFS.
• File Transparency - users can access files without knowing where they are physically stored
on the network.
• Load Balancing - the file system can distribute file access requests across multiple computers
to improve performance and reliability.
• Data Replication - the file system can store copies of files on multiple computers to ensure
that the files are available even if one of the computers fails.
• Security - the file system can enforce access control policies to ensure that only authorized
users can access files.
• Scalability - the file system can support a large number of users and a large number of files.
• Concurrent Access - multiple users can access and modify the same file at the same time.
• Fault Tolerance - the file system can continue to operate even if one or more of its
components fail.
• Data Integrity - the file system can ensure that the data stored in the files is accurate and has
not been corrupted.
• File Migration - the file system can move files from one location to another without
interrupting access to the files.
• Data Consistency - changes made to a file by one user are immediately visible to all other
users.
• Support for different file types - the file system can support a wide range of file types,
including text files, image files, and video files.
3
Additional Information:
• NFS – NFS stands for Network File System. It is a client-server architecture that allows a
computer user to view, store, and update files remotely. The protocol of NFS is one of the
several distributed file system standards for Network-Attached Storage (NAS).
• CIFS – CIFS stands for Common Internet File System. CIFS is an application of SMB (Server
Message Block) protocol, designed by Microsoft.
• SMB – SMB stands for Server Message Block. It is a protocol for sharing a file and was
invented by IBM. The SMB protocol was created to allow computers to perform read and
write operations on files to a remote host over a Local Area Network (LAN). The directories
present in the remote host can be accessed via SMB and are called as “shares”.
Working of DFS:
4
Advantages:
Disadvantages:
• In Distributed File System nodes and connections need to be secured. This is very
challenging. Therefore, we can say that security is at stake.
• There is a possibility of loss of messages and data in the network while movement from one
node to another.
• Database connection in case of Distributed File System is complicated.
• Also handling of the database is not easy in Distributed File System as compared to a single
user system.
• There are chances of overloading when all nodes try to send data at once.
One of the challenges of working with big data is that it is too big to manage on a single server
(irrespective of the storage capacity or computing power of the server). This is because, it is not
feasible (economically as well as technically) to add more and more capacity to a single server.
Instead, the data needs to be distributed across multiple clusters (also called nodes) by scaling out to
make use of the computing power of each cluster.
A distributed file system (DFS) enables businesses to manage the accessing of big data across
multiple clusters or nodes, allowing them to read big data quickly and perform multiple parallel reads
and writes.
• Transparent Local Access — Data is accessed as if it’s on a user’s own device or computer.
• Location Independence — Users may have no idea where file data physically resides.
• Massive Scaling — Teams can add as many machines as they want to a DFS to scale out.
• Fault Tolerance — A DFS will continue to operate even if some of its servers or disks fail
because machines are connected and the DFS can gracefully failover.
5
3.2 Google File System
Google File System (GFS or GoogleFS) is a scalable distributed file system (DFS) developed at Google
Inc., to meet the company’s growing data processing needs. GFS offers fault tolerance, dependability,
scalability, availability, and performance across the network and all connected nodes. GFS is made up
of a number of storage systems constructed from inexpensive commodity hardware parts.
GFS was originally designed for Google’s use case of searching and indexing the web. GFS addresses
the following concerns:
• Fault Tolerance: Google uses commodity machines because they are cheap and easy to
acquire, but the software behind GFS needs to be robust to handle failures of machines,
disks, and networks.
• Large Files: It’s assumed most files are large (i.e. ≥ 100 MB). Small files are supported, but
not optimized for.
• Optimized for Reads + Appends: The system is optimized for reading (specifically large
streaming reads) or appending because web crawling and indexing heavily relied on these
operations.
• High and Consistent Bandwidth: It’s acceptable to have slow operations now and then, but
the overall amount of data flowing through the system should be consistent. This aligns with
Google’s crawling and indexing purposes.
GFS manages two types of data namely File Metadata and File Data.
The GFS node cluster consists of a single master and several chunk servers that various client systems
regularly access. On local discs, chunk servers keep data in the form of Linux files. Large (64 MB)
pieces of the stored data are split up and replicated at least three times around the network.
Reduced network overhead results from the greater chunk size.
GFS meets Google’s huge cluster requirements. Hierarchical directories with path names are used to
store files. The master is responsible of managing metadata, including namespace, access control,
and mapping data. The master communicates with each chunk server by timed heartbeat messages
and keeps track of its status updates.
More than 1,000 nodes with 300 TB of disc storage capacity make up the largest GFS clusters. This is
available for constant access by hundreds of clients.
6
Components of GFS
A group of connected computers (also known as a cluster) is what constitutes GFS. There could be
hundreds or even thousands of computers in each cluster. There are three basic entities included in
any GFS cluster as follows:
1. GFS Clients: They are computer programs or applications which are used to request files.
Requests are made to access and modify already-existing files or add new files to the system.
2. GFS Master Server: It serves as the cluster’s coordinator. It preserves a record of the cluster’s
actions in an operation log. Additionally, it keeps track of the data that describes chunks, or
metadata. The chunks’ place in the overall file and which files they belong to are indicated by
the metadata to the master server.
3. GFS Chunk Servers: They are the GFS’s workhorses, responsible for storing and serving
chunks of data. They keep 64 MB-sized file chunks. The master server does not receive any
chunks from the chunk servers. Instead, chunk servers directly deliver the client the desired
chunks. The GFS makes numerous copies of each chunk and stores them on various chunk
servers in order to assure stability; the default is three copies.
Features of GFS
• Namespace management and locking (applying read or write lock at a node or file or
directory level)
• Fault tolerance
• Optimal network traffic - Reduced client-master interaction
• High availability
• Critical data replication
• Automatic and efficient data recovery
• High aggregate throughput.
7
Advantages of GFS
1. High accessibility - Data is still accessible even if a few nodes fail (because of replication
component).
2. Throughput - Provides excessive throughput because many nodes operate concurrently.
3. Dependability - Data that has been corrupted is found and duplicated. Hence dependability
on data retrieval is very high.
Disadvantages of GFS
1. Not the best fit for small files. Performs poorly as a general-purpose file system.
2. Master is the single point of failure. If the master fails to operate, the whole service fails.
3. Performs well with sequential access (I/O) but has very limited support for random access.
4. Suitable only for written once and only read (or appended) later kind of operations.
Apache Nutch is a highly extensible and scalable open-source web crawler software project. Nutch is
coded entirely in the Java programming language. It stores data in a language-independent format. It
has a highly modular architecture, allowing developers to create plug-ins for parsing, data retrieval,
querying and clustering.
Apache Nutch was initiated or founded by Doug Cutting (creator of both Apache Lucene and Apache
Hadoop) and Mike Cafarella. They wrote the web crawler (also known as “fetcher” or "robot") from
scratch for Apache Nutch project. In June, 2003, they used it to crawl 100-million web pages.
In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a
subproject of Apache Lucene in June of that same year. Since April, 2010, Nutch has been considered
an independent, top level project of the Apache Software Foundation.
To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project
implemented a MapReduce processing model and a distributed file system. These two were later
grouped under a subproject called Hadoop.
Hadoop is an open-source framework based on Java that manages the storage and processing of
large amounts of data for applications. Hadoop uses distributed storage and parallel processing to
handle big data and analytics jobs, breaking workloads down into smaller workloads that can be run
at the same time.
Four modules comprise the primary Hadoop framework and work collectively to form the Hadoop
ecosystem. These are 1) Hadoop Distributed File System (HDFS), 2) Yet Another Resource Negotiator
(YARN), 3) Hadoop Common, and 4) MapReduce). Hadoop Common is a core component of the
Apache Hadoop framework that provides libraries and utilities for other Hadoop modules. It's a Java
library that includes additional utilities and applications like HBase, Hive, Apache Spark, Sqoop, and
Flume.
The design of HDFS is based on the Google File System. It was originally built as infrastructure for the
Apache Nutch web search engine project but has since become a member of the Hadoop Ecosystem.
8
In the earlier years of the internet, web crawlers started to pop up as a way for people to search for
information on web pages. During this era search engines such as Yahoo and Google emerged.
Nutch, a web-crawler was developed to distribute data and computations or processing across
multiple computers simultaneously. Nutch then moved to Yahoo, and was divided into two (one to
handing batch processing and the other to handle real-time data). Apache Hadoop and Spark are
these two components or separate entities. Hadoop handles batch processing, and Spark handles
real-time data.
Nowadays, Hadoop's structure and framework are managed by the Apache software foundation
which is a global community of software developers and contributors.
HDFS was born from this and is designed to replace hardware storage solutions with a better, more
efficient method - a virtual filing system. When it first came onto the scene, MapReduce was the only
distributed processing engine that could use HDFS. More recently, alternative Hadoop data services
components like HBase and Solr also utilize HDFS to store data.
What is Hadoop Distributed File System (HDFS) in the world of big data?
The term "big data" refers to all the data that's very huge in volume and very challenging to store,
process and analyze. HDFS can be used to store, process and analyze big data.
As we now know, Hadoop is a framework that works by using parallel processing and distributed
storage. This can be used to sort and store big data, as it can't be stored in traditional ways.
In fact, HDFS the most commonly used software to handle big data, and is used by companies such as
Netflix, Expedia, and British Airways.
There are five core elements of big data organized by HDFS services:
1. Fault tolerance - HDFS has been designed to detect faults and automatically recover quickly
ensuring continuity and reliability.
2. Speed – With its cluster architecture, it can assure a transfer rate of 2 GB per second.
3. Access to more types of data (specifically, streaming data) - Because of its design to handle
large amounts of data it allows for high data throughput rates making it ideal to support
streaming data (through stream processing) and stored data (through batch processing).
9
5. Scalable - HDFS enables scaling of resources according to the size of the file system. HDFS
includes vertical and horizontal scalability mechanisms.
6. Data locality – In case of HDFS, the data resides in data nodes. There is no need for the data
to move to where the central processing facility or computational unit is. Computing process
happens where the data is. By shortening the distance between the data and the computing
process, HDFS decreases network congestion and improves performance.
7. Cost efficiency – The data is stored in inexpensive commodity hardware that can be virtual.
HDFS drastically reduces file system metadata and file system namespace data storage costs.
There is no license fee because HDFS is an open-source software.
8. Stores large amounts of data - HDFS can store data of all varieties and sizes. This includes
structured, semi-structured and unstructured data.
9. Flexible. Unlike traditional databases, there is no need to process the data collected before
storing it. It is possible to store as much data, and manage it according to the needs.
1. Hadoop Distributed File System (HDFS): HDFS is the primary component of the Hadoop
ecosystem. It is a distributed file system in which individual Hadoop nodes operate on data
that resides in their local storage. This removes network latency, providing high-throughput
access to application data. In addition, administrators don’t need to define schemas up front.
2. Yet Another Resource Negotiator (YARN): YARN performs scheduling and resource allocation
across the Hadoop system. It is a resource-management platform responsible for managing
compute resources in clusters and using them to schedule users’ applications.
3. MapReduce: MapReduce is a programming model for large-scale data processing. In the
MapReduce model, subsets of larger datasets and instructions for processing the subsets are
dispatched to multiple different nodes, where each subset is processed by a node in parallel
with other processing jobs. After processing the results, individual subsets are combined into
a smaller, more manageable dataset.
4. Hadoop Common: Hadoop Common includes the libraries and utilities used and shared by
other Hadoop modules.
Beyond HDFS, YARN, and MapReduce, the entire Hadoop open-source ecosystem continues to grow
and includes many tools and applications to help collect, store, process, analyze, and manage big
data. These include Apache Pig, Apache Hive, Apache HBase, Apache Spark, Presto, and Apache
Zeppelin.
Hadoop allows for the distribution of datasets across a cluster of commodity hardware. Processing is
performed in parallel on multiple servers simultaneously.
Software clients input data into Hadoop. HDFS handles metadata and the distributed file system.
MapReduce then processes and converts the data. YARN divides the jobs across the computing
cluster.
10
All Hadoop modules are designed with a fundamental assumption that hardware failures of
individual machines or racks of machines are common and should be automatically handled in
software by the framework.
Cluster is a collection of things. A simple computer cluster is a group of various computers that are
connected with each other through LAN (Local Area Network). The nodes in a cluster share the data
pertaining to a common task. All nodes in a cluster are good enough to work together as a single
unit.
1. Scalability: Hadoop clusters allow scaling-up and scaling-down the number of nodes. This is done
by adding or removing commodity hardware. The process of scaling up or scaling down the number
of servers in the Hadoop cluster is called scalability because it enables Hadoop store and process
different volumes of data on need basis.
2. Flexibility: Hadoop cluster is flexible in terms of handling any type of data irrespective of its type
and structure. With the help of this property, Hadoop can process any type of data from online web
platforms.
11
3. Speed: Hadoop clusters are very efficient in processing data distributed among the clusters. This is
because of the architecture of MapReduce that ensures speed of processing.
4. No Data-loss: There is no chance of loss of data from any node in a Hadoop cluster because
Hadoop clusters have the ability to replicate the data. Failure of a single node does not result in loss
of data because the same data is stored as backup in two other nodes to manage such failures.
5. Economical: The Hadoop clusters are cost-efficient and economical. This is because Hadoop
involves commodity hardware and the data is distributed in a cluster among all the nodes. In order to
increase the storage, we only need to add one more additional hardware storage.
There are two types of Hadoop clusters – 1) Single Node Hadoop Cluster and 2) Multiple Node
Hadoop Cluster.
1. Single Node Hadoop Cluster: Single Node Hadoop Cluster has a single node. All Hadoop Daemons
such as Name Node, Data Node, Secondary Name Node, Resource Manager, and Node Manager run
on the same node. All corresponding Hadoop processes are handled by a single JVM (Java Virtual
Machine) instance.
2. Multiple Node Hadoop Cluster: Multiple Node Hadoop Cluster contains multiple nodes. In this
kind of cluster Hadoop Daemons can run on different nodes in the same cluster setup. In a multiple
node Hadoop cluster, it is a general practice to utilize nodes with higher processing capability for
Master i.e. Name node and Resource Manager and utilize nodes with lower processing capability for
the slave Daemon’s i.e. Node Manager and Data Node.
Source: https://www.geeksforgeeks.org/difference-between-rdbms-and-hadoop/
RDBMS (Relational Database Management System): RDBMS is a data management system, which is
based on a data model. RDBMS uses tables for data storage. Each row of the table represents a
record and column represents an attribute of data. The objective of RDBMS is to store, manage, and
retrieve structured data as quickly and reliably as possible. The way data is organized and
manipulated in RDBMS is unique and different as compared the way data is organized and
manipulated in other types of data bases. Transactions in RDBMS need to adhere to ACID (atomicity,
consistency, integrity, durability) properties.
12
Hadoop: It is an open-source software framework used for storing big data. Data processing
applications run on a group of commodity hardware. Hadoop offers capabilities to ensure large
storage capacity, scalability and high processing power. It can manage multiple concurrent processes.
It is used in processing big data for predictive analysis, data mining and machine learning. It can
handle structured, semi-structured and unstructured data. It is more flexible than RDBMS in storing,
processing, and managing data.
Traditional row-column based databases, An open-source software used for storing big
1. basically used for data storage, data and running applications or processes
manipulation and retrieval. data concurrently.
3. Best suited for OLTP environment. It is best suited for Big Data environment.
6. SQL is used to store and retrieve data. Both SQL and NoSQL are used.
9. The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.
10. High level of data integrity is essential. Data integrity can be at a low level.
11. Cost is applicable for licensed software. Free of cost, as it is an open-source software.
13
3.6 Data integrity in Hadoop
Data integrity refers to the accuracy, completeness, and quality of data. Preserving the integrity of
data is an essential and continuous process.
Data integrity is a broad discipline that influences how data is collected, stored, accessed and used.
The idea of integrity is a central element of many regulatory compliance frameworks, such as the
General Data Protection Regulation (GDPR).
Data integrity is preserved in data that's kept complete, accurate, consistent and safe throughout its
entire lifecycle. Data that has data integrity needs to have the following characteristics.
Completeness – In order to ensure completeness, data is maintained in its full form and no data
elements are filtered, truncated or lost. For example, if 100 students attend an entrance
examination, the examination results data need to reflects the results of all 100 students. The results
of students who failed or scored low marks should not be omitted or removed from the results.
Accuracy - Data should not be altered or aggregated in any way that affects the ability to perform
data analytics. For example, examination results should not be rounded up or down at random. Any
criteria or special conditions need to be well-documented and understood. Repeating the same type
of analysis at different point in time should return the same results.
Consistency – Data is consistent when it remains unchanged regardless of how, or how often, it's
accessed and no matter how long it's stored. For example, data accessed a year from now will have
to be the same data that's generated or accessed today. The same holds good for data accessed
through different systems or devices.
Safety - Data should be maintained in a secure manner and can only be accessed and used by
authorized applications or individuals. Further, data should be protected from malicious actors or
hackers. Data security involves considerations such as authentication, authorization, encryption,
backup or other data protection, and access logging.
Data Corruption & Data Integrity - Data corruption can be caused by hardware failures, human error,
a malicious action or failure of data security. Data corruption occurs when any unwanted or
unexpected changes to data take place during storage, access or processing. All such occurrences
can lead to failure or loss of data integrity.
1) Missing or inaccurate data might result in incorrect analysis and ineffective business
decisions or delayed actions.
2) Any issue in data integrity can lead to mishandling of customer requests or customer
interactions. This is because, data integrity ensures that customers are serviced
correctly (E.g., through timely complaint or issue resolution and reporting). Sensitive
data of customers must be protected from data loss or theft.
3) According to certain legal and other compliance requirements, businesses are
required to retain data for a period of time (say five years or ten years) to ensure
that business processes are followed in accordance with prevailing industry
standards and government regulations. Data integrity is vital for complete, accurate
and consistent reporting for all compliance purposes; otherwise, the business may
be out of compliance and subject to huge penalties and other legal actions.
14
Types of Data Integrity:
Data integrity can be categorized in to two broad categories – Physical Integrity and Logical Integrity.
Physical integrity - Physical integrity relates to the storing and retrieving of data -- primarily in
relation to the storage devices, memory components and any associated hardware. For example, if a
hard drive or memory device is damaged, then the stored data is affected as well. Here is a sample
list of events that can affect data integrity and damage the physical storage hardware.
Organizations can enhance the physical integrity of data by implementing hardware infrastructure,
including redundant storage subsystems such as RAID, with battery-protected write cache, using
advanced error-correcting memory devices, implementing clustered and distributed file systems, and
using error-detecting algorithms to detect data changes in transit. Organizations often adopt a
variety of hardware devices and techniques to enhance data's physical integrity.
Logical integrity - Logical integrity can be affected by poor software design, software bugs, or human
error. There are four major types of logical integrity:
1. Entity integrity - This ensures that no data element is repeated and that no critical data entry
is blank or null. This is a common logical integrity consideration in relational
database systems. For example, there can be only one student with a specific student id, and
student name cannot be null or blank.
2. Referential integrity - These rules define how data is stored and used in a database and that
only authorized changes, additions or deletions can occur. For example, one cannot delete a
department form the department table, when there are one or more employees working in
that department.
3. Domain integrity - This reflects the format, type, amount and value range or scope of
acceptable data values within a database. For example, marks scored by a student in an
examination is supposed to be numerical. An alphanumeric data is rejected because it
violates domain integrity. Such integrity checks can be pre-defined in relational data bases
when we create tables.
4. User-defined integrity - These are additional rules and constraints that are implemented in
accordance with the organization's specific needs and aren't otherwise covered by the first
three integrity types. For example, marks scored by a student has to be between 0 and 100.
Such integrity checks or validations are performed at application level.
An issue in physical integrity can impact logical integrity. For example, a null data stream captured
from an IoT sensor might violate logical or entity integrity when we attempt to store the data in a
database table. In the case, the root cause of the null data can be traced to a failed or
defective IoT sensor.
15
Data Integrity Risks
• Human errors - Data may be accidentally deleted, entered or altered inaccurately -- such as a
wrong customer address -- or left incomplete.
• Transfer errors - Data may be damaged or lost -- or even stolen -- in transit between two
systems, such as a network failure or incorrect storage destination, or between two physical
locations, such as transporting a storage device filled with data to another location.
• Malicious acts - Malware, hacking and other cyber threats can result in a loss of both data
integrity and data security.
Proper, well-documented and strongly enforced configuration standards are essential in all data
integrity and data security situations.
The consequences of data integrity loss can range from a minor annoyance to a major business
catastrophe -- depending on the amount of loss and the nature of the data involved. Business and
technology leaders invest considerable time and resources to understand and prevent data integrity
loss.
Data integrity is a broad area that involves people, processes, rules and various tools to provide
guardrails and support. While there's no single universal solution for data integrity, there are
numerous tactics that can help to build an environment that supports data integrity. Common tactics
include the following:
2. Establish an integrity culture - Data integrity is more meaningful when business leaders and
managers care about it as a business goal. Collaboration and top-down buy-in to integrity
concepts can drive better data integrity efforts.
3. Validate the data – Data needs to be checked or validated. This can be especially important
when data is acquired from third-party sources. For example, if a business is running
analytics on several different data sets, it is worth checking that all the data sets show
consistent – or at least sensible -- data.
4. Process data sensibly - Check and remove duplicate entries in data sets and ensure that any
data pre-processing -- such as data normalization or aggregation -- doesn't affect the base
data.
16
5. Protect data - Regular data backups can help to ensure data integrity by copying data to a
second location. This ensures that data is always available in the event of hardware,
infrastructure or security violations.
6. Implement strong security - Ensure that data access is protected using strong authentication
and authorization controls. Log all data access and retain logs to audit activity. Encryption for
data at rest and in-flight can help to protect it from unauthorized access or theft.
The terms integrity, security and quality are sometimes improperly used as interchangeable terms.
Although the three ideas are closely related, they possess unique attributes that distinguish them
from their companion terms.
Data quality refers to the reliability of data. Good quality data must be accurate, complete, unique
with no duplicates and timely enough to be useful.
Data security is the infrastructure, tools and rules used to ensure that only authorized applications
and users can access data; that the data is used in a business-compliant manner; and that data is
preserved or backed up against loss, theft or malfeasance.
Data integrity then provides a broader umbrella that embraces aspects of data quality and security,
ensuring proper retention, appropriate destruction, and adequate compliance with relevant industry
and government regulations.
With reference to Hadoop, data integrity is about making sure that no data is lost or corrupted
during storage or processing of big data.
The occurrence of data corruption is highly likely in Hadoop because the amount of data being
written or read is large very huge in volume.
To detect and resolve data corruption, checksum is computed when data written to the disk for the
first time and again checked while reading data from the disk. A checksum is a small-sized block of
data derived from another block of digital data to detect errors that may have been introduced
during its transmission or storage.
HDFS calculates/computes checksums for each data block and eventually stores them in a separate
hidden file in the same HDFS namespace. HDFS uses 32-bit Cyclic Redundancy Check (CRC32) as the
default checksum algorithm because of 4 bytes long and less than 1% storage overhead.
If checksum matches the original checksum, then implies that the data is not corrupted otherwise
the data is corrupted.
It is possible that it is the checksum that is corrupt, not the data, but this is very unlikely, because the
checksum is much smaller than the data.
Data Nodes are responsible for verifying the data they receive before storing the data and its
checksum. Checksum is computed for the data that they receive from clients and from other Data
Nodes during replication.
17
Hadoop can heal the corrupted data by copying one of the good replicas to produce the new replica
which is uncorrupted replica. This is why Hadoop keeps a primary and a secondary replica for each
block.
HDFS uses NameNodes and DataNodes. NameNode holds the meta data and DataNodes hold the
actual data. We will learn more about NameNode and DataNode in the next unit.
When a client detects an error while reading a block, it reports the bad block and the DataNodes it
was trying to read from. This information is reported to the NameNode before throwing a Checksum
Exception. The NameNode marks the block replica as corrupt so it doesn’t direct any more clients to
it or try to copy this replica to another DataNodes.
It provides a good copy of the block in another DataNodes to make sure that there are two replicas.
This is to make sure that the replication factor is maintained at the expected level.
To use HDFS you need to install and set up a Hadoop cluster. This can be a single node set up which is
more appropriate for first-time users, or a cluster set up for large, distributed clusters. You then need
to familiarize yourself with HDFS commands, such as the below, to operate and manage your system.
Command Description
-rm Removes file or directory
-ls Lists files with permissions and other details
-mkdir Creates a directory named path in HDFS
-cat Shows contents of the file
-rmdir Deletes a directory
-put Uploads a file or folder from a local disk to HDFS
-rmr Deletes the file identified by path or folder and subfolders
-get Moves file or folder from HDFS to local file
-count Counts number of files, number of directories, and file size
-df Shows free space
-getmerge Merges multiple files in HDFS
-chmod Changes file permissions
-copyToLocal Copies files to the local system
-stat Prints statistics about the file or directory
-head Displays the first kilobyte of a file
-usage Returns the help for an individual command
-chown Allocates a new owner and group of a file
18
How does HDFS work?
HDFS uses NameNodes and DataNodes. NameNode holds the meta data and DataNodes hold the
actual data. HDFS allows the quick transfer of data between computers or nodes. When HDFS takes
in data, it's able to break down the information into blocks, distributing them to different nodes in a
cluster.
Data is broken down into blocks and distributed among the DataNodes for storage, these blocks can
also be replicated across nodes which allows for efficient parallel processing. You can access, move
around, and view data through various commands. HDFS DFS options such as "-get" and "-put" allow
you to retrieve and move data around as necessary.
HDFS is designed to be highly alert and can detect faults quickly. The file system uses data replication
to ensure every piece of data is saved multiple times and then assigns it across individual nodes,
ensuring at least one copy is on a different rack than the other copies.
When a DataNode is no longer sending signals to the NameNode, it removes the DataNode from the
cluster and operates without it. If this data node comes back, it can be allocated to a new cluster.
Plus, since the data blocks are replicated across several DataNodes, removing one will not lead to any
file corruptions of any kind.
Installation of HDFS
There are two types of clusters - single-node and multi-node. Depending on what you require, you
can either use a single-node or multi-node cluster.
A single node cluster means only one DataNode is running. It will include the NameNode, DataNode,
resource manager, and node manager on one machine.
For some industries, a single node cluster is good enough. For example, in medical field, if you're
conducting studies and need to collect, sort, and process data in a sequence for specific research
that requires limited storage, you can use a single-node cluster. This can easily handle the data on a
smaller scale, as compared to data distributed across many hundreds of machines. To install a single-
node cluster, follow these steps:
1. Download the Java 8 Package. Save this file in your home directory.
5. Add the Hadoop and Java paths in the bash file (.bashrc).
19
10. Edit yarn-site.xml and edit the property.
When you complete all these, you should now have a successfully installed HDFS.
To access HDFS files you can download the "jar" file from HDFS to your local file system. You can also
access the HDFS using its web user interface. Simply open your browser and
type "localhost:50070" into the search bar. From there, you can see the web user interface of HDFS
and move to the utilities tab on the right-hand side. Then click "browse file system," this shows you a
full list of files located on your HDFS.
Example A
To delete a directory, you need to apply the following (note: this can only be done if the files are
empty):
Or
Example B
When you have multiple files in an HDFS, you can use a "-getmerge" command. This will merge
multiple files into a single file, which you can then download to your local file system. You can do this
with the following:
Or
Example C
When you want to upload a file from HDFS to local, you can use the "-put" command. You specify
where you want to copy from, and what file you want to copy onto HDFS. Use the below:
Or
20
Example D
The count command is used to track the number of directories, files, and file size on HDFS. You can
use the following:
Or
Example E
The "chown" command can be used to change the owner and group of a file. To activate this, use the
below:
Or
Your HDFS listing should be /user/yourUserName. To view the contents of your HDFS home directory,
enter:
As you're just getting started, you won't be able to see anything at this stage. When you want to view
the contents of a non-empty directory, input:
You can then see the names of the home directories of all the other Hadoop users.
You can now create a test directory, ‘testHDFS’ with the following command.
Now you can verify if the directory exists by using the -ls by entering the following command.
21
Copying a file
To copy a file from your local file system to HDFS, create a file you wish to copy. To do this, enter:
cat testFile
You will then need to copy the file to HDFS. To copy files from Linux into HDFS, you need to use:
Notice that you have to use the command "-copyFromLocal" because the command "-cp" is used to
copy files within HDFS.
Now you just need to confirm that the file has been copied over correctly. Do this by entering the
following:
When you copied the testfile it was put into the base home directory. Now you can move it into the
testHDFS directory you've already created. To do this, use the following command.
The first command moves your testFile from the HDFS home directory into testHDFS. The second
command then shows us that it's no longer in the HDFS home directory, and the third command
confirms that it had now been moved to the test HDFS directory.
This will show how much space you are using in your HDFS. To know how much space is available in
HDFS across the cluster, try
22
Removing a file/directory
To do this, try
This will show that you still have the testHDFS directory and testFile2. Try
You will see "rmdir: testhdfs: Directory is not empty". The directory needs to be empty before it can
be deleted. You can "rm -r" command to bypass this and do a recursive delete. Try
----
23