Module 1
Module 1
Digital data can be broadly classified into structured, semi-structured, and unstructured data:
i. Unstructured data: This is the data which does not conform to a data model or is
not in a form which can be used easily by a computer program. About 80-90%
data of an organization is in this format; for example, memos, chat rooms,
PowerPoint presentations, images, videos, letters, researches, white papers, body
of an email, etc.
ii. Semi-structured data: This is the data which does not conform to a data model
but has some structure. However, it is not in a form which can be used easily by a
computer program; for example, emails, XML, markup languages like HTML,
etc. Metadata for this data is available but is not sufficient.
iii. Structured data: This is the data which is in an organized form (e.g., in rows and
columns) and can be easily used by a computer program. Relationships exist
between entities of data, such as classes and their objects. Data stored in databases
is an example of structured data.
• The data held in RDBMS is typically structured data.
• With the Internet connecting the world, data that existed beyond one's enterprise
started to become an integral part of daily transactions.
• This data grew by leaps and bounds so much so that it became difficult for the
enterprises to ignore it.
• All of this data was not structured. A lot of it was unstructured.
• In fact, Gartner estimates that almost 80% of data generated in any enterprise today is
unstructured data. Roughly around 10% of data is in the structured and semi-
structured category.
Approximate percentage distribution of digital data
Structured Data
Most of the structured data is held in RDBMS.
An RDBMS conforms to the relational data model wherein the data is stored in
rows/columns.
The number of rows/records/tuples in a relation is called the cardinality of a relation and the
number of columns is referred to as the degree of a relation.
The first step is the design of a relation/table, the fields/columns to store the data, the type of
data that will be stored.
Think of the constraints that we would like the data to conform to.
• If data is highly structured, one can look at leveraging any of the available RDBMS
[Oracle Corp. - Oracle, IBM - DB2, Microsoft - Microsoft SQL Server, EMC-
Greenplum, Teradata - Teradata, MySQL (open source), PostgreSQL (advanced open
source), etc.] to house it.
• These databases are typically used to hold transaction/operational data generated and
collected by day-to-day business activities.
• In other words, the data of the On-Line Transaction Processing (OLTP) systems are
generally quite structured.
Semi-Structured Data
• Semi-structured data is also referred to as self-describing structure.
Characteristics of semi-structured data
• Amongst the sources for semi-structured data, the front runners are "XML" and "JSON".
• Today, unstructured data constitutes approximately 80% of the data that is being
generated in any enterprise.
• The balance is clearly shifting in favor of unstructured data
i. Data mining: First, we deal with large data sets. Second, we use methods at the
intersection of artificial intelligence, machine learning, statistics, and database
systems to unearth consistent patterns in large data sets and/or systematic
relationships between variables. It is the analysis step of the "knowl- edge
discovery in databases" process. Few popular data mining algorithms are as
follows:
➢ Association rule mining: It is also called "market basket analysis" or "affinity
analysis". It is used to determine "What goes with what?" It is about when you buy a
product, what is the other product that you are likely to purchase with it. For example,
if you pick up bread from the grocery, are you likely to pick eggs or cheese to go with
it.
➢ Regression analysis: It helps to predict the relationship between two variables. The
variable whose value needs to be predicted is called the dependent variable and the
variables which are used to predict the value are referred to as the independent
variables.
➢ Collaborative filtering: It is about predicting a user's preference or preferences based
on the preferences of a group of users. For example, take a look at Table.
Characteristics of data
i. Composition: The composition of data deals with the structure of data, that is, the
sources of data, the granularity, the types, and the nature of data as to whether it is
static or real-time streaming.
ii. Condition: The condition of data deals with the state of data, that is, "Can one use
this data as is for analysis?" or "Does it require cleansing for further enhancement
and enrichment?"
iii. Context: The context of data deals with "Where has this data been generated?"
"Why was this data generated?" "How sensitive is this data?" "What are the events
associated with this data?" and so on.
Big data is high-volume, high-velocity, and high-variety information assets that demand cost
effective, innovative forms of information processing for enhanced insight and decision
making.
i. Data today is growing at an exponential rate. Most of the data that we have today
has been generated in the last 2-–3 years. This high tide of data will continue to
rise incessantly. The key questions here are: "Will all this data be useful for
analysis?", "Do we work with all this data or a subset of it?", "How will we
separate the knowledge from the noise?", etc.
ii. Cloud computing and virtualization are here to stay. Cloud computing is the
answer to managing infrastructure for big data as far as cost-efficiency, elasticity,
and easy upgrading/downgrading is concerned. This further complicates the
decision to host big data solutions outside the enterprise.
iii. The other challenge is to decide on the period of retention of big data. Just how
long should one retain this data? A tricky question indeed as some data is useful
for making long-term decisions, whereas in few cases, the data may quickly
become irrelevant and obsolete just a few hours after having being generated.
iv. There is a dearth of skilled professionals who possess a high level of proficiency
in data sciences that is vital in implementing big data solutions.
v. Then, of course, there are other challenges with respect to capture, storage,
preparation, search, analysis, transfer, security, and visualization of big data. Big
data refers to datasets whose size is typically beyond the storage capacity of
traditional database software tools. There is no explicit definition of how big the
dataset should be for it to be considered "big data." Here we are to deal with data
that is just too big, moves way to fast, and does not fit the structures of typical
database systems. The data changes are highly dynamic and therefore there is a
need to ingest this as quickly as possible.
vi. Data visualization is becoming popular as a separate discipline. We are short by
quite a number, as far as business visualization experts are concerned.
Volume
A mountain of data
Where Does This Data get Generated?
There are a multitude of sources for big data. An XLS, a DOC, a PDF, etc. is unstructured data; a video on You
Tube, a chat conversation on Internet Messenger, a customer feedback form on an online retail website in
unstructured data; a CCTV coverage, a weather forecast report is unstructured data too.
• Typical internal data sources: Data present within an organization's firewall. It is as follows: •
Data storage: File systems, SQL (RDBMSs - Oracle, MS SQL Server, DB2, MySQL,
PostgreSQL, etc.), NoSQL (MongoDB, Cassandra, etc.), and so on.
Archives: Archives of scanned documents, paper archives, customer correspondence records,
patients' health records, students' admission records, students' assessment records, and so on.
• External data sources: Data residing outside an organization's firewall. It is as follows:
Public Web: Wikipedia, weather, regulatory, compliance, census, etc.
• Both (internal + external data sources)
Sensor data: Car sensors, smart electric meters, office buildings, air conditioning units,
refrigerators, and so on.
Machine log data: Event logs, application logs, Business process logs, audit logs, clickstream
data, etc.
Social media: Twitter, blogs, Facebook, LinkedIn, YouTube, Instagram, etc.
Business apps: ERP, CRM, HR, Google Docs, and so on.
Media: Audio, Video, Image, Podcast, etc.
Docs: Comma separated value (CSV), Word Documents, PDF, XLS, PPT, and so on.
Velocity
Variety
Variety deals with a wide range of data types and sources of data. We will study this under
three categories: Structured data, semi-structured data and unstructured data.
Classification of Analytics
There are basically two schools of thought:
i. Those that classify analytics into basic, operationalized, advanced, and monetized.
ii. Those that classify analytics into analytics 1.0, analytics 2.0, and analytics 3.0.
Let us take a closer look at analytics 1.0, analytics 2.0, and analytics 3.0. Refer Table.
Analytics 1.0, 2.0, and 3.0
Figure shows the subtle growth of analytics from Descriptive → Diagnostic → Predictive →
Prescriptive analytics.
i. Reactive - Business Intelligence: It allows the businesses to make faster and better
decisions by providing the right information to the right person at the right time in the
right format. It is about analysis of the past or historical data and then displaying the
findings of the analysis or reports in the form of enterprise dashboards, alerts,
notifications, etc. It has support for both pre-specified reports as well as ad hoc
querying.
ii. Reactive - Big Data Analytics: Here the analysis is done on huge datasets but the
approach is still reactive as it is still based on static data.
iii. Proactive - Analytics: This is to support futuristic decision making by the use of data
mining, predictive modeling, text mining, and statistical analysis. This analysis is not
on big data as it still uses the traditional database management practices on big data
and therefore has severe limitations on the storage capacity and the processing
capability.
iv. Proactive - Big Data Analytics: This is sieving through terabytes, petabytes,
exabytes of information to filter out the relevant data to analyze. This also includes
high performance analytics to gain rapid insights from big data and the ability to solve
complex problems using more data.
• Data access from non-volatile storage such as hard disk is a slow process.
• The more the data is required to be fetched from hard disk or secondary storage, the
slower the process gets.
• One way to combat this challenge is to pre-process and store data (cubes, aggregate
tables, query sets, etc.) so that the CPU has to fetch a small subset of records.
• But this requires thinking in advance as to what data will be required for analysis.
• If there is a need for different or more data, it is back to the initial process of pre-
computing and storing data or fetching it from secondary storage.
• This problem has been addressed using in-memory analytics.
• Here all the relevant data is stored in Random Access Memory (RAM) or primary
storage thus eliminating the need to access the data from hard disk.
• The advantage is faster access, rapid deployment, better insights, and minimal IT
involvement.
In-Database Processing
• In SMP, there is a single common main memory that is shared by two or more identical
processors.
• The processors have full access to all I/O devices and are controlled by a single
operating system instance.
• SMP are tightly coupled multiprocessor systems.
• Each processor has its own high-speed memory, called cache memory and are
connected using a system bus.
Parallel system
• A parallel database system is a tightly coupled system.
• The processors co-operate for query processing.
• The user is unaware of the parallelism since he/she has no access to a specific
processor of the system.
• Either the processors have access to a common memory or make use of message
passing for communication.
Parallel system
• Distributed database systems are known to be loosely coupled and are composed by
individual machines.
Distributed system
• Each of the machines can run their individual application and serve their own
respective user.
• The data is usually distributed across several machines, thereby necessitating quite a
number of machines to be accessed to answer a user query.
Distributed system
Advantages of a “Shared Nothing Architecture”
i. Fault Isolation: A "Shared Nothing Architecture" provides the benefit of isolating fault. A fault in
a single node is contained and confined to that node exclusively and exposed only through
messages (or lack of it).
ii. Scalability: Assume that the disk is a shared resource. It implies that the controller and the disk
bandwidth are also shared. Synchronization will have to be implemented to maintain a consistent
shared state. This would mean that different nodes will have to take turns to access the critical
data. This imposes a limit on how many nodes can be added to the distributed shared disk system,
thus compromising on scalability.
Brewer’s CAP
• At best you can have two of the following three - one must be sacrificed.
i. Consistency
ii. Availability
iii. Partition tolerance
CAP Theorem
i. Choose availability over consistency when your business requirements allow some flexibility
around when the data in the system synchronizes.
ii. Choose consistency over availability when your business requirements allow dynamic reads and
writes.
Figure to get a glimpse of databases that adhere to two of the three characteristics of CAP theorem
Where is it Used?
• NoSQL databases are widely used in big data and other real-time web applications.
Where to use NoSQL?
• NoSQL databases is used to stock log data which can then be pulled for analysis.
• Likewise it is used to store social media data and all such data which cannot be stored and
analyzed comfortably in RDBMS.
What is it?
What is NoSQL?
• Additional features of NoSQL. NoSQL databases:
i. Are non-relational: They do not adhere to relational data model, In fact, they are
either key-value pairs or document-oriented or column-oriented or graph-based
databases.
ii. Are distributed: They are distributed meaning the data is distributed across
several nodes in a cluster constituted of low-cost commodity hardware.
iii. Offer no support for ACID properties (Atomicity, Consistency, Isolation, and
Durability): They do not offer support for ACID properties of transactions. On
the contrary, they have adherence to Brewer's CAP (Consistency, Availability, and
Partition tolerance) theorem and are often seen compromising on consistency in
favor of availability and partition tolerance.
iv. Provide no fixed table schema: NoSQL databases are becoming increasing
popular owing to their support for flexibility to the schema. They do not mandate
for the data to strictly adhere to any schema structure at the time of storage.
Types of NoSQL Databases
Let us take a closer look at key-value and few other types of schema-less databases:
i. Key-value: It maintains a big hash table of keys and values. For example,
Dynamo, Redis, Riak, etc.
Sample Key-Value Pair in Key-Value Database
Why NoSQL?
Advantages of NoSQL
Advantages of NoSQL
i. Can easily scale up and down: NoSQL database supports scaling rapidly and
elastically and even allows to scale to the cloud.
(a) Cluster scale: It allows distribution of database across 100+ nodes often in
multiple data centers.
(b) Performance scale: It sustains over 100,000+ database reads and writes per
second.
(c) Data scale: It supports housing of 1 billion+ documents in the database.
ii. Doesn't require a pre-defined schema: NoSQL does not require any adherence
to pre-defined schema. It is pretty flexible. For example, if we look at MongoDB,
iii. Cheap, easy to implement: Deploying NoSQL properly allows for all of the
benefits of scale, high availability, fault tolerance, etc. while also lowering
operational costs.
iv. Relaxes the data consistency requirement: NoSQL databases have adherence to
CAP theorem(Consistency, Availability, and Partition tolerance). Most of the
NoSQL databases compromise on consistency in favor of availability and partition
tolerance. However, they do go for eventual consistency.
v. Data can be replicated to multiple nodes and can be partitioned: There are
two terms that we will discuss here:
(a) Sharding: Sharding is when different pieces of data are distributed across
multiple servers. NoSQL databases support auto-sharding; this means that they
can natively and automatically spread data across an arbitrary number of servers,
without requiring the application to even be aware of the composition of the
server pool. Servers can be added or removed from the data layer without
application downtime. This would mean that data and query load are
automatically balanced across servers, and when a server goes down, it can be
quickly and transparently replaced with no application disruption.
(b) Replication: Replication is when multiple copies of data are stored across the
cluster and even across data centers. This promises high availability and fault
tolerance.
• With NoSQL around, we have been able to counter the problem of scale (NoSQL scales
out).
• There is also the flexibility with respect to schema design.
NoSQL Vendors
Few popular NoSQL vendors
Few popular NoSQL vendors
NewSQL
• We need a database that has the same scalable performance of NoSQL systems for On
Line Transaction Processing (OLTP) while still maintaining the ACID guarantees of a
traditional database.
• This new modern RDBMS is called NewSQL.
• It supports relational data model and uses SQL as their primary interface.
Characteristics of NewSQL
• NewSQL is based on the shared nothing architecture with a SQL interface for application interaction.
Characteristics of NewSQL
Comparison of SQL, NoSQL, and NewSQL
Hadoop
• Hadoop is an open-source project of the Apache foundation.
• It is a framework written in Java, originally developed by Doug Cutting in 2005 who
named it after his son's toy elephant.
• He was working with Yahoo then.
• It was created to support distribution for "Nutch", the text search engine.
• Hadoop uses Google's MapReduce and Google File System technologies as its
foundation.
• Hadoop is now a core part of the computing infrastructure for companies such as Yahoo,
Facebook, LinkedIn, Twitter, etc.
Hadoop
Features of Hadoop
i. Stores data in its native format: Hadoop's data storage framework (HDFS -
Hadoop Distributed File System) can store data in its native format. There is no
structure that is imposed while keying in data or storing data. HDFS is pretty
much schema-less. It is only later when the data needs to be processed that
structure is imposed on the raw data.
ii. Scalable: Hadoop can store and distribute very large datasets (involving
thousands of terabytes of data) across hundreds of inexpensive servers that operate
in parallel.
iii. Cost-effective: Owing to its scale-out architecture, Hadoop has a much reduced
cost/terabyte of storage and processing.
iv. Resilient to failure: Hadoop is fault-tolerant. It practices replication of data
diligently which means whenever data is sent to any node, the same data also gets
replicated to other nodes in the cluster, thereby ensuring that in the event of a node
failure, there will always be another copy of data available for use.
v. Flexibility: One of the key advantages of Hadoop is its ability to work with all
kinds of data: structured, semi-structured, and unstructured data. It can help derive
meaningful business insights from email conversations, social media data, click-
stream data, etc. It can be put to several purposes such as log analysis, data
mining, recommendation systems, market campaign analysis, etc.
vi. Fast: Processing is extremely fast in Hadoop as compared to other conventional
systems owing to the "move code to data" paradigm.
Versions of Hadoop
There are two versions of Hadoop available:
i. Hadoop 1.0
ii. Hadoop 2.0
Versions of Hadoop
Hadoop 1.0
Hadoop ecosystem
• There are components available in the Hadoop ecosystem for data ingestion,
processing, and analysis.
Data Ingestion → Data Processing → Data Analysis
• Components that help with Data Ingestion are:
i. Sqoop
ii. Flume
• Components that help with Data Processing are:
i. MapReduce
ii. Spark
• Components that help with Data Analysis are:
i. Pig
ii. Hive
iii. Impala
HDFS
HBase
i. HDFS is the file system whereas HBase is a Hadoop database. It is like NTFS and MySQL.
ii. HDFS is WORM (Write once and read multiple times or many times). Latest versions support
appending of data but this feature is rarely used. However, HBase supports real-time random read
and write.
iii. HDFS is based on Google File System (GFS) whereas HBase is based on Google Big Table.
iv. HDFS supports only full table scan or partition table scan. HBase supports random small range
scan or table scan.
v. Performance of Hive on HDFS is relatively very good but for HBase it becomes 4-5 times slower.
vi. The access to data is via MapReduce job only in HDFS whereas in HBase the access is via Java
APIs, Rest, Avro, Thrift APIs.
vii. HDFS does not support dynamic storage owing to its rigid structure whereas HBase supports
dynamic storage.
viii. HDFS has high latency operations whereas HBase has low latency operations.
ix. HDFS is most suitable for batch analytics whereas HBase is for real-time analytics.
i. Sqoop: Sqoop stands for SQL to Hadoop. Its main functions are
a) Importing data from RDBMS such as MySQL, Oracle, DB2, etc. to Hadoop file system
(HDFS,HBase, Hive).
b) Exporting data from Hadoop File system (HDFS, HBase, Hive) to RDBMS (MySQL, Oracle,
DB2).
Uses of Sqoop
i. MapReduce: It is a programing paradigm that allows distributed and parallel processing of huge
datasets. It is based on Google MapReduce. Google released a paper on MapReduce programming
paradigm in 2004 and that became the genesis of Hadoop processing model. The MapReduce
framework gets the input data from HDFS. There are two main phases: Map phase and the Reduce
phase. The map phase converts the input data into another set of data (key-value pairs). This new
intermediate dataset then serves as the input to the reduce phase. The reduce phase acts on the
datasets to combine (aggregate and consolidate) and reduce them to a smaller set of tuples. The
result is then stored back in HDFS.
ii. Spark: It is both a programming model as well as a computing model. It is an open-source big
data processing framework. It was originally developed in 2009 at UC Berkeley's AmpLab and
became an open-source project in 2010. It is written in Scala. It provides in-memory computing for
Hadoop. In Spark, workloads execute in memory rather than on disk owing to which it is much
faster (10 to 100 times) than when the workload is executed on disk. However, if the datasets are
too large to fit into the available system memory, it can perform conventional disk-based
processing. It serves as a potentially faster and more flexible alternative to MapReduce. It accesses
data from HDFS (Spark does not have its own distributed file system) but bypasses the
MapReduce processing.
• Spark can be used with Hadoop coexisting smoothly with MapReduce (sitting on top of Hadoop
YARN) or used independently of Hadoop (standalone).
• As a programming model, it works well with Scala, Python (it has API connectors for using it with
Java or Python) or R programming language.
• The following are the Spark libraries:
a) Spark SQL: Spark also has support for SQL. Spark SQL uses SQL to help query data stored in
disparate applications.
b) Spark streaming: It helps to analyze and present data in real time.
c) MLib: It supports machine learning such as applying advanced statistical operations on data in
Spark Cluster.
d) GraphX: It helps in graph parallel computation.
• Spark and Hadoop are usually used together by several companies.
• Hadoop was primarily designed to house unstructured data and run batch processing operations on it.
• Spark is used extensively for its high speed in memory computing and ability to run advanced real-time
analytics.
• The two together have been giving very good results.
• Both Hive and traditional databases such as MySQL, MS SQL Server, PostgreSQL support SQL interface.
• However, Hive is better known as a datawarehouse(D/W) rather than a database.
• Difference between Hive and traditional databases as regards the schema.
i. Hive enforces schema on Read Time whereas RDBMS enforces schema on Write Time. In
RDBMS, at the time of loading/inserting data, the table's schema is enforced. If the data being
loaded does not conform to the schema then it is rejected. Thus, the schema is enforced on write
(loading the data into the database). Schema on write takes longer to load the data into the
database; however it makes up for it during data retrieval with a good query time performance.
However, Hive does not enforce the schema when the data is being loaded into the D/W. It is
enforced only when the data is being read/retrieved. This is called schema on read. It definitely
makes for fast initial load as the data load or insertion operation is just a file copy or move.
ii. Hive is based on the notion of write once and read many times whereas the RDBMS is designed
for read and write many times.
iii. Hadoop is a batch-oriented system. Hive, therefore, is not suitable for OLTP (Online Transaction
Processing) but, although not ideal, seems closer to OLAP (Online Analytical Processing). The
reason being that there is quite a latency between issuing a query and receiving a reply as the query
written in HiveQL will be converted to MapReduce jobs which are then executed on the Hadoop
cluster. RDBMS is suitable for housing day-to-day transaction data and supports all OLTP
operations with frequent insertions, modifications (updates), deletions of the data.
iv. Hive handles static data analysis which is non-real-time data. Hive is the data warehouse of
Hadoop. There are no frequent updates to the data and the query response time is not fast. RDBMS
is suited for handling dynamic data which is real time.
v. Hive can be easily scaled at a very low cost when compared to RDMS. Hive uses HDFS to store
data, thus it cannot be considered as the owner of the data, while on the other hand RDBMS is the
owner of the data responsible for storing, managing and manipulating it in the database.
vi. Hive uses the concept of parallel computing, whereas RDBMS uses serial computing.
Hive versus RDBMS
i. Hive is a MapReduce-based SQL engine that runs on top of Hadoop. HBase is a key-value
NOSQL database that runs on top of HDFS.
ii. Hive is for batch processing of big data. HBase is for real-time data streaming.
Impala
ZooKeeper
Oozie
Mahout
Chukwa
Ambari
• It is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters.
Hadoop Distributions
• Figure to get a glimpse of the leading market vendors offering integrated Hadoop
systems.
Cloud-based solutions