0% found this document useful (0 votes)
23 views40 pages

Module 1

Digital data is classified into structured, semi-structured, and unstructured categories, with unstructured data comprising 80-90% of organizational data. Structured data is organized in rows and columns, typically stored in relational databases, while semi-structured data has some organization but does not conform to a strict model. Big data refers to high-volume, high-velocity, and high-variety information assets that require innovative processing techniques for effective analysis and decision-making.

Uploaded by

dhanulokesh06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views40 pages

Module 1

Digital data is classified into structured, semi-structured, and unstructured categories, with unstructured data comprising 80-90% of organizational data. Structured data is organized in rows and columns, typically stored in relational databases, while semi-structured data has some organization but does not conform to a strict model. Big data refers to high-volume, high-velocity, and high-variety information assets that require innovative processing techniques for effective analysis and decision-making.

Uploaded by

dhanulokesh06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

MODULE-1

Big Data Analytics


Classification Of Digital Data

Classification of digital data

Digital data can be broadly classified into structured, semi-structured, and unstructured data:
i. Unstructured data: This is the data which does not conform to a data model or is
not in a form which can be used easily by a computer program. About 80-90%
data of an organization is in this format; for example, memos, chat rooms,
PowerPoint presentations, images, videos, letters, researches, white papers, body
of an email, etc.
ii. Semi-structured data: This is the data which does not conform to a data model
but has some structure. However, it is not in a form which can be used easily by a
computer program; for example, emails, XML, markup languages like HTML,
etc. Metadata for this data is available but is not sufficient.
iii. Structured data: This is the data which is in an organized form (e.g., in rows and
columns) and can be easily used by a computer program. Relationships exist
between entities of data, such as classes and their objects. Data stored in databases
is an example of structured data.
• The data held in RDBMS is typically structured data.
• With the Internet connecting the world, data that existed beyond one's enterprise
started to become an integral part of daily transactions.
• This data grew by leaps and bounds so much so that it became difficult for the
enterprises to ignore it.
• All of this data was not structured. A lot of it was unstructured.
• In fact, Gartner estimates that almost 80% of data generated in any enterprise today is
unstructured data. Roughly around 10% of data is in the structured and semi-
structured category.
Approximate percentage distribution of digital data

Structured Data
Most of the structured data is held in RDBMS.

An RDBMS conforms to the relational data model wherein the data is stored in
rows/columns.

A relation/table with rows and columns

When data conforms to a pre-defined schema/structure,it is structured data.

The number of rows/records/tuples in a relation is called the cardinality of a relation and the
number of columns is referred to as the degree of a relation.

The first step is the design of a relation/table, the fields/columns to store the data, the type of
data that will be stored.
Think of the constraints that we would like the data to conform to.

Design a table/relation structure to store the details of the employees of an enterprise.

Schema of an "Employee" table in a RDBMS such as Oracle

• Table shows the structure/schema of an "Employee" table in a RDBMS such as


Oracle. Table is an example of a good structured table (complete with table name,
meaningful column names with data types, data length, and the relevant constraints)
with absolute adherence to relational data model.
• Each record in the table will have exactly the same structure.

Sample records in the "Employee" table


• The tables in an RDBMS can also be related. For example, the above "Employee"
table is related to the "Department" table on the basis of the common column,
"DeptNo".

Relationship between "Employee" and "Department" tables

Sources of Structured Data

• If data is highly structured, one can look at leveraging any of the available RDBMS
[Oracle Corp. - Oracle, IBM - DB2, Microsoft - Microsoft SQL Server, EMC-
Greenplum, Teradata - Teradata, MySQL (open source), PostgreSQL (advanced open
source), etc.] to house it.
• These databases are typically used to hold transaction/operational data generated and
collected by day-to-day business activities.
• In other words, the data of the On-Line Transaction Processing (OLTP) systems are
generally quite structured.

Sources of structured data

Ease of Working with Structured Data

• Structured data provides the ease of working with it.


Ease of working with structured data
• The ease is with respect to the following:
i. Insert/update/delete: The Data Manipulation Language (DML) operations
provide the required ease with data input, storage, access, process, analysis, etc.
ii. Security:There are available staunch encryption and tokenization solutions to
warrant the security of information throughout its lifecycle. Organizations are able
to retain control and maintain compliance adherence by ensuring that only
authorized individuals are able to decrypt and view sensitive information.
iii. SELECT Indexing: An index is a data structure that speeds up the data retrieval
operations (primarily the DML statement) at the cost of additional writes and
storage space, but the benefits that ensue in search operation are worth the
additional writes and storage space.
iv. Scalability:The storage and processing capabilities of the traditional RDBMS can
be easily scaled up by increasing the horsepower of the database server
(increasing the primary and secondary or peripheral storage capacity, processing
capacity of the processor, etc.).
v. Transaction processing: RDBMS has support for Atomicity, Consistency,
Isolation, and Durability (ACID) properties of [Link] of the
ACID properties:
Atomicity: A transaction is atomic, means that either it happens in its entirety or
none of it at all.
Consistency: The database moves from one consistent state to another consistent
state. In other words, if the same piece of information is stored at two or more
places, they are in complete agreement.
Isolation: The resource allocation to the transaction happens such that the
transaction gets the impression that it is the only transaction happening in
isolation.
Durability: All changes made to the database during a transaction are permanent
and that accounts for the durability of the transaction.

Semi-Structured Data
• Semi-structured data is also referred to as self-describing structure.
Characteristics of semi-structured data

• It has the following features:


i. It does not conform to the data models that one typically associates with relational
databases or any other form of data tables.
ii. It uses tags to segregate semantic elements.
iii. Tags are also used to enforce hierarchies of records and fields within data.
iv. There is no separation between the data and the schema. The amount of structure
used is dictated by the purpose att hand.
v. In semi-structured data, entities belonging to the same class and also grouped
together need not necessarily have the same set of attributes. And if at all, they
have the same set of attributes, the order of attributes may not be similar and for
all practical purposes it is not important as well.

Sources of Semi-Structured Data

• Amongst the sources for semi-structured data, the front runners are "XML" and "JSON".

Sources of semi-structured data

i. XML: eXtensible Markup Language (XML) is hugely popularized by web


services developed utilizing the Simple Object Access Protocol (SOAP)
principles.
ii. JSON: Java Script Object Notation (JSON) is used to transmit data between a
server and a web application. JSON is popularized by web services developed
utilizing the Representational State Transfer (REST) - an architecture style for
creating scalable web services. MongoDB (open-source, distributed, NoSQL,
documented-oriented database) and Couchbase (originally known as Membase,
open-source, distributed, NoSQL, document-oriented database) store data natively
in JSON format.
Unstructured Data
• Unstructured data does not conform to any pre-defined data model.
• As be seen from the examples, the structure is quite unpredictable.

Few examples of disparate unstructured data


• The other sources of unstructured data.

Sources of unstructured data

Issues with "Unstructured" Data

• Unstructured data is known NOT to conform to a pre-defined data model or be organized


in a pre-defined manner, there are incidents wherein the structure of the data (placed in
the unstructured category) can still be implied.
• There could be few other reasons behind placing data in the unstructured category despite
it having some structure or being highly structured.

Issues with terminology of unstructured data


• There are situations where people argue that a text file should be in the category of semi-
structured data and not unstructured data.
• The text file does have a name, one can easily look at the properties to get information
such as the owner of the file, the date on which the file was created, the size of the file,
etc.

How to Deal with Unstructured Data?

• Today, unstructured data constitutes approximately 80% of the data that is being
generated in any enterprise.
• The balance is clearly shifting in favor of unstructured data

Unstructured data clearly constitutes a major percentage of enterprise data


• A few ways of dealing with unstructured data.

Dealing with unstructured data


The following techniques are used to find patterns in or interpret unstructured data:

i. Data mining: First, we deal with large data sets. Second, we use methods at the
intersection of artificial intelligence, machine learning, statistics, and database
systems to unearth consistent patterns in large data sets and/or systematic
relationships between variables. It is the analysis step of the "knowl- edge
discovery in databases" process. Few popular data mining algorithms are as
follows:
➢ Association rule mining: It is also called "market basket analysis" or "affinity
analysis". It is used to determine "What goes with what?" It is about when you buy a
product, what is the other product that you are likely to purchase with it. For example,
if you pick up bread from the grocery, are you likely to pick eggs or cheese to go with
it.
➢ Regression analysis: It helps to predict the relationship between two variables. The
variable whose value needs to be predicted is called the dependent variable and the
variables which are used to predict the value are referred to as the independent
variables.
➢ Collaborative filtering: It is about predicting a user's preference or preferences based
on the preferences of a group of users. For example, take a look at Table.

Sample records depicting learners' preferences for modes of learning


We are looking at predicting whether User 4 will prefer to learn using videos or is a
textual learner depending on one or a couple of his or her known preferences. We
analyze the preferences of similar user profiles and on the basis of it, predict that User
4 will also like to learn using videos and is not a textual learner.
ii. Text analytics or text mining: Compared to the structured data stored in
relational databases, text is largely unstructured, amorphous, and difficult to deal
with algorithmically. Text mining is the process of gleaning high quality and
meaningful information (through devising of patterns and trends by means of
statistical pattern learning) from text. It includes tasks such as text categorization,
text clustering, sentiment analysis, concept/entity extraction, etc.
iii. Natural language processing (NLP): It is related to the area of human computer
interaction. It is about enabling computers to understand human or natural
language input.
iv. Noisy text analytics: It is the process of extracting structured or semi-structured
information from noisy unstructured data such as chats, blogs, wikis, emails,
message-boards, text messages, etc. The noisy unstructured data usually
comprises one or more of the following: Spelling mistakes, abbreviations,
acronyms, non-standard words, missing punctuation, missing letter case, filler
words such as "uh", "um", etc.
v. Manual tagging with metadata: This is about tagging manually with adequate to
provide the requisite semantics to understand unstructured data.
vi. Part-of-speech tagging: It is also called POS or POST or grammatical tagging. It
is the process of reading text and tagging each word in the sentence as belonging
to a particular part of speech such as "noun", "verb", "adjective", etc.
vii. Unstructured Information Management Architecture (UIMA): It is an open
source platform from IBM. It is used for real-time content analytics. It is about
processing text and other unstructured data to find latent meaning and relevant
relationship buried therein.
Characteristics of Data

Characteristics of data

Data has three key characteristics:

i. Composition: The composition of data deals with the structure of data, that is, the
sources of data, the granularity, the types, and the nature of data as to whether it is
static or real-time streaming.
ii. Condition: The condition of data deals with the state of data, that is, "Can one use
this data as is for analysis?" or "Does it require cleansing for further enhancement
and enrichment?"
iii. Context: The context of data deals with "Where has this data been generated?"
"Why was this data generated?" "How sensitive is this data?" "What are the events
associated with this data?" and so on.

Evolution of Big Data

The evolution of big data

• 1970s and before was the era of mainframes.


• The data was essentially primitive and structured.
• Relational databases evolved in 1980s and 1990s.
• The era was of data intensive applications.
• The World Wide Web (WWW) and the Internet of Things (IoT) have led to an onslaught
of structured, unstructured, and multimedia data.
Definition of Big Data

Definition of big data

Big data is high-volume, high-velocity, and high-variety information assets that demand cost
effective, innovative forms of information processing for enhanced insight and decision
making.

Definition of big data


Look at the definition in three parts:

• Part I of the definition "big data is high-volume, high-velocity, and high-variety


information assets" sabout voluminous data (humongous data) that may have great
variety (a good mix of structured, i-structured, and unstructured data) and will require
a good speed/pace for storage, preparation, prosing, and analysis.
• Part II of the definition "cost effective, innovative forms of information processing"
talks about embracg new techniques and technologies to capture (ingest), store,
process, persist, integrate, and visualize the -volume, high-velocity, and high-variety
data.
• Part III of the definition "enhanced insight and decision making" talks about deriving
deeper, richer, and meaningful insights and then using these insights to make faster
and better decisions to gain business value and thus a competitive edge.

Challenges with Big Data

Challenges with big data


Following are a few challenges with big data:

i. Data today is growing at an exponential rate. Most of the data that we have today
has been generated in the last 2-–3 years. This high tide of data will continue to
rise incessantly. The key questions here are: "Will all this data be useful for
analysis?", "Do we work with all this data or a subset of it?", "How will we
separate the knowledge from the noise?", etc.
ii. Cloud computing and virtualization are here to stay. Cloud computing is the
answer to managing infrastructure for big data as far as cost-efficiency, elasticity,
and easy upgrading/downgrading is concerned. This further complicates the
decision to host big data solutions outside the enterprise.
iii. The other challenge is to decide on the period of retention of big data. Just how
long should one retain this data? A tricky question indeed as some data is useful
for making long-term decisions, whereas in few cases, the data may quickly
become irrelevant and obsolete just a few hours after having being generated.
iv. There is a dearth of skilled professionals who possess a high level of proficiency
in data sciences that is vital in implementing big data solutions.
v. Then, of course, there are other challenges with respect to capture, storage,
preparation, search, analysis, transfer, security, and visualization of big data. Big
data refers to datasets whose size is typically beyond the storage capacity of
traditional database software tools. There is no explicit definition of how big the
dataset should be for it to be considered "big data." Here we are to deal with data
that is just too big, moves way to fast, and does not fit the structures of typical
database systems. The data changes are highly dynamic and therefore there is a
need to ingest this as quickly as possible.
vi. Data visualization is becoming popular as a separate discipline. We are short by
quite a number, as far as business visualization experts are concerned.

What is Big Data?


Big data is data that is big in volume, velocity, and variety.

Data: Big in volume, variety, and velocity

Volume

We have seen it grow from bits to bytes to petabytes and exabytes.


Growth of data

A mountain of data
Where Does This Data get Generated?

There are a multitude of sources for big data. An XLS, a DOC, a PDF, etc. is unstructured data; a video on You
Tube, a chat conversation on Internet Messenger, a customer feedback form on an online retail website in
unstructured data; a CCTV coverage, a weather forecast report is unstructured data too.

• Typical internal data sources: Data present within an organization's firewall. It is as follows: •
Data storage: File systems, SQL (RDBMSs - Oracle, MS SQL Server, DB2, MySQL,
PostgreSQL, etc.), NoSQL (MongoDB, Cassandra, etc.), and so on.
Archives: Archives of scanned documents, paper archives, customer correspondence records,
patients' health records, students' admission records, students' assessment records, and so on.
• External data sources: Data residing outside an organization's firewall. It is as follows:
Public Web: Wikipedia, weather, regulatory, compliance, census, etc.
• Both (internal + external data sources)
Sensor data: Car sensors, smart electric meters, office buildings, air conditioning units,
refrigerators, and so on.
Machine log data: Event logs, application logs, Business process logs, audit logs, clickstream
data, etc.
Social media: Twitter, blogs, Facebook, LinkedIn, YouTube, Instagram, etc.
Business apps: ERP, CRM, HR, Google Docs, and so on.
Media: Audio, Video, Image, Podcast, etc.
Docs: Comma separated value (CSV), Word Documents, PDF, XLS, PPT, and so on.
Velocity

We have moved from the days of batch processing to real-time processing.

Variety

Variety deals with a wide range of data types and sources of data. We will study this under
three categories: Structured data, semi-structured data and unstructured data.

i. Structured data: From traditional transaction processing systems and RDBMS,


etc.
ii. Semi-structured data: For example Hyper Text Markup Language (HTML),
eXtensible Markup Language (XML).
iii. Unstructured data: For example unstructured text documents, audios, videos,
emails, photos, PDFs, social media, etc.

Why Big Data?


• The more data we have for analysis, the greater will be the analytical accuracy and
also the greater would be the confidence in our decisions based on these analytical
findings.
• This will entail a greater positive impact in terms of enhancing operational
efficiencies, reducing cost and time, and innovating on new products, new
services, and optimizing existing services.

Why big data?

Traditional Business Intelligence(BI) Versus Big Data


i. In traditional BI environment, all the enterprise's data is housed in a central server
whereas in a big data environment data resides in a distributed file system. The
distributed file system scales by scaling in or out horizontally as compared to
typical database server that scales vertically.
ii. In traditional BI, data is generally analyzed in an offline mode whereas in big
data, it is analyzed in both real time as well as in offline mode.
iii. Traditional BI is about structured data and it is here that data is taken to
processing functions (move data to code) whereas big data is about variety:
Structured, semi-structured, and unstructured data and here the processing
functions are taken to the data (move code to data).

A Typical Data Warehouse Environment


• Operational or transactional or day-to-day business data is gathered from Enterprise
Resource Planning (ERP) systems, Customer Relationship Management (CRM),
legacy systems, and several third party applications.
• The data from these sources may differ in format [data could have been housed in any
RDBMS such as Oracle, MS SQL Server, DB2, MySQL, and Teradata, and so on or
in spreadsheet (.xls, .xlsx, etc.) or .csv or txt].
• Data may come from data sources located in the same geography or different
geographies.
• This data is then integrated, cleaned up,transformed, and standardized through the
process of Extraction, Transformation, and Loading (ETL).
• The transformed data is then loaded into the enterprise data warehouse (available at
the enterprise level) or data marts (available at the business unit/ functional unit or
business process level).
• A host of market leading business intelligence and analytics tools are then used to
enable decision making from the use of ad-hoc queries, SQL, enterprise dashboards,
data mining, etc.

A typical data warehouse environment

A Typical Hadoop Environment


• The data sources are quite disparate from web logs to images, audios, and videos to
social media data to the various docs, pdfs, etc.
• Here the data in focus is not just the data within the company's firewall but also data
residing outside the company's firewall.
• This data is placed in Hadoop Distributed File System (HDFS).
• If need be, this can be repopulated back to operational systems or fed to the enterprise
data warehouse or data marts or Operational Data Store (ODS) to be picked for
further processing and analysis.

A typical Hadoop environment

What is Big Data Analytics?


Big Data Analytics is...

i. Technology-enabled analytics: Quite a few data analytics and visualization tools


are available in the market today from leading vendors such as IBM, Tableau,
SAS, R Analytics, Statistica, World Programming Systems (WPS), etc. to help
process and analyze your big data.
ii. About gaining a meaningful, deeper, and richer insight into your business to steer
it in the right direction- understanding the customer's demographics to cross-sell
and up-sell to them, better leveraging the services of your vendors and suppliers,
etc.
iii. About a competitive edge over your competitors by enabling you with findings
that allow quicker and decision-making.
iv. A tight handshake between three communities: IT, business users, and data
scientists. Refer Figure.
v. Working with datasets whose volume and variety exceed the current storage and
processing capabilities and infrastructure of your enterprise.
vi. About moving code to data. This makes perfect sense as the program for
distributed processing is tiny(just a few KBs) compared to the data (Terabytes or
Petabytes today and likely to be Exabytes or Zettabytes in the near future).
What is big data analytics?

Classification of Analytics
There are basically two schools of thought:
i. Those that classify analytics into basic, operationalized, advanced, and monetized.
ii. Those that classify analytics into analytics 1.0, analytics 2.0, and analytics 3.0.

First School of Thought


i. Basic analytics: This primarily is slicing and dicing of data to help with basic
business insights. This is about reporting on historical data, basic visualization,
etc.
ii. Operationalized analytics: It is operationalized analytics if it gets woven into the
enterprise's business processes.
iii. Advanced analytics: This largely is about forecasting for the future by way of
predictive and prescriptive modeling.
iv. Monetized analytics: This is analytics in use to derive direct business revenue.

Second School of Thought

Let us take a closer look at analytics 1.0, analytics 2.0, and analytics 3.0. Refer Table.
Analytics 1.0, 2.0, and 3.0

Figure shows the subtle growth of analytics from Descriptive → Diagnostic → Predictive →
Prescriptive analytics.

Analytics 1.0,2.0, and 3.0


Why is Big Data Analytics Important?
The various approaches to analysis of data and it leads to:

i. Reactive - Business Intelligence: It allows the businesses to make faster and better
decisions by providing the right information to the right person at the right time in the
right format. It is about analysis of the past or historical data and then displaying the
findings of the analysis or reports in the form of enterprise dashboards, alerts,
notifications, etc. It has support for both pre-specified reports as well as ad hoc
querying.
ii. Reactive - Big Data Analytics: Here the analysis is done on huge datasets but the
approach is still reactive as it is still based on static data.
iii. Proactive - Analytics: This is to support futuristic decision making by the use of data
mining, predictive modeling, text mining, and statistical analysis. This analysis is not
on big data as it still uses the traditional database management practices on big data
and therefore has severe limitations on the storage capacity and the processing
capability.
iv. Proactive - Big Data Analytics: This is sieving through terabytes, petabytes,
exabytes of information to filter out the relevant data to analyze. This also includes
high performance analytics to gain rapid insights from big data and the ability to solve
complex problems using more data.

Terminologies Used in Big Data Environments


In-Memory Analytics

• Data access from non-volatile storage such as hard disk is a slow process.
• The more the data is required to be fetched from hard disk or secondary storage, the
slower the process gets.
• One way to combat this challenge is to pre-process and store data (cubes, aggregate
tables, query sets, etc.) so that the CPU has to fetch a small subset of records.
• But this requires thinking in advance as to what data will be required for analysis.
• If there is a need for different or more data, it is back to the initial process of pre-
computing and storing data or fetching it from secondary storage.
• This problem has been addressed using in-memory analytics.
• Here all the relevant data is stored in Random Access Memory (RAM) or primary
storage thus eliminating the need to access the data from hard disk.
• The advantage is faster access, rapid deployment, better insights, and minimal IT
involvement.

In-Database Processing

• In-database processing is also called as in-database analytics.


• It works by fusing data warehouses with analytical systems.
• Typically the data from various enterprise On Line Transaction Processing (OLTP)
systems after cleaning up (de-duplication, scrubbing, etc.) through the process of ETL
is stored in the Enterprise Data Warehouse (EDW) or data marts.
• The huge datasets are then exported to analytical programs for complex and extensive
computations.
• With in-database processing, the database program itself can run the computations
eliminating the need for export and thereby saving on time.
• Leading database vendors are offering this feature to large businesses.
Symmetric Multiprocessor System (SMP)

• In SMP, there is a single common main memory that is shared by two or more identical
processors.
• The processors have full access to all I/O devices and are controlled by a single
operating system instance.
• SMP are tightly coupled multiprocessor systems.
• Each processor has its own high-speed memory, called cache memory and are
connected using a system bus.

Symmetric Multiprocessor System

Massively Parallel Processing

• Massive Parallel Processing (MPP) refers to the coordinated processing of programs


by a number of processors working parallel.
• The processors, each have their own operating systems and dedicated memory.
• They work on different parts of the same program.
• The MPP processors communicate using some sort of messaging interface.
• The MPP systems are more difficult to program as the application must be divided in
such a way that all the executing segments can communicate with each other.
• MPP is different from Symmetrically Multiprocessing (SMP) in that SMP works with
the processors sharing the same operating system and same memory.
• SMP is also referred to as tightly-coupled multiprocessing.
Difference Between Parallel and Distributed Systems

Parallel system
• A parallel database system is a tightly coupled system.
• The processors co-operate for query processing.
• The user is unaware of the parallelism since he/she has no access to a specific
processor of the system.
• Either the processors have access to a common memory or make use of message
passing for communication.

Parallel system
• Distributed database systems are known to be loosely coupled and are composed by
individual machines.

Distributed system
• Each of the machines can run their individual application and serve their own
respective user.
• The data is usually distributed across several machines, thereby necessitating quite a
number of machines to be accessed to answer a user query.
Distributed system
Advantages of a “Shared Nothing Architecture”

i. Fault Isolation: A "Shared Nothing Architecture" provides the benefit of isolating fault. A fault in
a single node is contained and confined to that node exclusively and exposed only through
messages (or lack of it).
ii. Scalability: Assume that the disk is a shared resource. It implies that the controller and the disk
bandwidth are also shared. Synchronization will have to be implemented to maintain a consistent
shared state. This would mean that different nodes will have to take turns to access the critical
data. This imposes a limit on how many nodes can be added to the distributed shared disk system,
thus compromising on scalability.

CAP Theorem Explained

• The CAP theorem is also called the Brewer's Theorem.


• It states that in a distributed computing environment (a collection of interconnected nodes
that share data), it is impossible to provide the following guarantees.

Brewer’s CAP
• At best you can have two of the following three - one must be sacrificed.
i. Consistency
ii. Availability
iii. Partition tolerance
CAP Theorem

i. Consistency implies that every read fetches the last write.


ii. Availability implies that reads and writes always succeed. In other words, each non-failing node
will return a response in a reasonable amount of time.
iii. Partition tolerance implies that the system will continue to function when network partition occurs.
When to choose consistency over availability and vice-versa...

i. Choose availability over consistency when your business requirements allow some flexibility
around when the data in the system synchronizes.
ii. Choose consistency over availability when your business requirements allow dynamic reads and
writes.

Examples of databases that follow one of the possible three combinations:

i. Availability and Partition Tolerance (AP)


ii. Consistency and Partition Tolerance (CP)
iii. Consistency and Availability (CA)

Figure to get a glimpse of databases that adhere to two of the three characteristics of CAP theorem

Databases and CAP

NoSQL(Not Only SQL)


• The term NoSQL was first coined by Carlo Strozzi in 1998 to name his lightweight, open-
source, relational database that did not expose the standard SQL interface.
• Johan Oskarsson, who was then a developer at last. fm, in 2009 reintroduced the term
NoSQL at an event called to discuss open-source distributed network.
• The #NoSQL was coined by Eric Evans and few other database people at the event found
it suitable to describe these non-relational databases.
• Few features of NoSQL databases are as follows:
i. They are open source.
ii. They are non-relational.
iii. They are distributed.
iv. They are schema-less.
v. They are cluster friendly.
vi. They are born out of 21" century web applications

Where is it Used?

• NoSQL databases are widely used in big data and other real-time web applications.
Where to use NoSQL?
• NoSQL databases is used to stock log data which can then be pulled for analysis.
• Likewise it is used to store social media data and all such data which cannot be stored and
analyzed comfortably in RDBMS.

What is it?

• NoSQL stands for Not Only SQL.


• These are non-relational, open source, distributed databases.
• They are hugely popular today owing to their ability to scale out or scale horizontally and
the adeptness at dealing with a rich variety of data: structured, semi-structured and
unstructured data.

What is NoSQL?
• Additional features of NoSQL. NoSQL databases:
i. Are non-relational: They do not adhere to relational data model, In fact, they are
either key-value pairs or document-oriented or column-oriented or graph-based
databases.
ii. Are distributed: They are distributed meaning the data is distributed across
several nodes in a cluster constituted of low-cost commodity hardware.
iii. Offer no support for ACID properties (Atomicity, Consistency, Isolation, and
Durability): They do not offer support for ACID properties of transactions. On
the contrary, they have adherence to Brewer's CAP (Consistency, Availability, and
Partition tolerance) theorem and are often seen compromising on consistency in
favor of availability and partition tolerance.
iv. Provide no fixed table schema: NoSQL databases are becoming increasing
popular owing to their support for flexibility to the schema. They do not mandate
for the data to strictly adhere to any schema structure at the time of storage.
Types of NoSQL Databases

• NoSQL databases are non-relational.


• They can be broadly classified into the following:
i. Key-value or the big hash table.
ii. Schema-less.
Types of NoSQL databases

Let us take a closer look at key-value and few other types of schema-less databases:

i. Key-value: It maintains a big hash table of keys and values. For example,
Dynamo, Redis, Riak, etc.
Sample Key-Value Pair in Key-Value Database

ii. Document: It maintains data in collections constituted of documents. For


example, MongoDB, Apache CouchDB, Couchbase, MarkLogic, etc.
Sample Document in Document Database
{
"Book Name": "Fundamentals of Business Analytics",
"Publisher": "Wiley India",
"Year of Publication": "2011"
}
iii. Column: Each storage block has data from only one column. For example:
Cassandra, HBase, etc.
iv. Graph: They are also called network database. A graph stores data in nodes. For
example, Neo4j, HyperGraphDB, etc.
Sample Graph in Graph Database

Refer Table for popular schema-less databases.


Table: Popular schema-less databases

Why NoSQL?

i. It has scale out architecture instead of the monolithic architecture of relational


databases.
ii. It can house large volumes of structured, semi-structured, and unstructured data.
iii. Dynamic schema: NoSQL database allows insertion of data without a pre-defined
schema. In other words, it facilitates application changes in real time, which thus
supports faster development, easy code integration, and requires less database
administration.
iv. Auto-sharding: It automatically spreads data across an arbitrary number of
servers. The application in question is more often not even aware of the
composition of the server pool. It balances the load of data and query on the
available servers; and if and when a server goes down, it is quickly replaced
without any major activity disruptions.
v. Replication: It offers good support for replication which in turn guarantees high
availability, fault tolerance, and disaster recovery.

Advantages of NoSQL

Advantages of NoSQL

i. Can easily scale up and down: NoSQL database supports scaling rapidly and
elastically and even allows to scale to the cloud.
(a) Cluster scale: It allows distribution of database across 100+ nodes often in
multiple data centers.
(b) Performance scale: It sustains over 100,000+ database reads and writes per
second.
(c) Data scale: It supports housing of 1 billion+ documents in the database.

ii. Doesn't require a pre-defined schema: NoSQL does not require any adherence
to pre-defined schema. It is pretty flexible. For example, if we look at MongoDB,

the documents (equivalent of records in RDBMS) in a collection (equivalent of

table in RDBMS) can have different sets of key-value pairs.

{_id: 101,"BookName"; "Fundamentals of Business Analytics", "AuthorName":


"Seema Acharya", "Publisher": "Wiley India"}

{_id:102, "BookName";"Big Data and Analytics"}

iii. Cheap, easy to implement: Deploying NoSQL properly allows for all of the
benefits of scale, high availability, fault tolerance, etc. while also lowering
operational costs.
iv. Relaxes the data consistency requirement: NoSQL databases have adherence to
CAP theorem(Consistency, Availability, and Partition tolerance). Most of the
NoSQL databases compromise on consistency in favor of availability and partition
tolerance. However, they do go for eventual consistency.
v. Data can be replicated to multiple nodes and can be partitioned: There are
two terms that we will discuss here:
(a) Sharding: Sharding is when different pieces of data are distributed across
multiple servers. NoSQL databases support auto-sharding; this means that they
can natively and automatically spread data across an arbitrary number of servers,
without requiring the application to even be aware of the composition of the
server pool. Servers can be added or removed from the data layer without
application downtime. This would mean that data and query load are
automatically balanced across servers, and when a server goes down, it can be
quickly and transparently replaced with no application disruption.
(b) Replication: Replication is when multiple copies of data are stored across the
cluster and even across data centers. This promises high availability and fault
tolerance.

What we Miss With NoSQL?

• With NoSQL around, we have been able to counter the problem of scale (NoSQL scales
out).
• There is also the flexibility with respect to schema design.

What we miss with NoSQL?


• NoSQL does not support joins.
• However, it compensates for it by allowing embedded documents as in MongoDB.
• It does not have provision for ACID properties of transactions.
• However, it obeys the Eric Brewer's CAP theorem.
• NoSQL does not have a standard SQL interface but NoSQL databases such as MongoDB
and Cassandra have their own rich query language [MongoDB query language and
Cassandra query language (CQL)] to compensate for the lack of it.
• One thing which is dearly missed is the easy integration with other applications that
support SQL.

Use of NoSQL in Industry

• NoSQL is being put to use in varied industries.


• They are used to support analysis for applications such as web user data analysis, log
analysis, sensor feed analysis, making recommendations for upsell and cross-sell. etc.

Use of NoSQL in industry

NoSQL Vendors
Few popular NoSQL vendors
Few popular NoSQL vendors

SQL versus NoSQL

Few salient differences between SQL and NoSQL.


SQL versus NoSQL

NewSQL

• We need a database that has the same scalable performance of NoSQL systems for On
Line Transaction Processing (OLTP) while still maintaining the ACID guarantees of a
traditional database.
• This new modern RDBMS is called NewSQL.
• It supports relational data model and uses SQL as their primary interface.
Characteristics of NewSQL

• NewSQL is based on the shared nothing architecture with a SQL interface for application interaction.

Characteristics of NewSQL
Comparison of SQL, NoSQL, and NewSQL

Comparative study of SQL, NoSQL and NewSQL

Hadoop
• Hadoop is an open-source project of the Apache foundation.
• It is a framework written in Java, originally developed by Doug Cutting in 2005 who
named it after his son's toy elephant.
• He was working with Yahoo then.
• It was created to support distribution for "Nutch", the text search engine.
• Hadoop uses Google's MapReduce and Google File System technologies as its
foundation.
• Hadoop is now a core part of the computing infrastructure for companies such as Yahoo,
Facebook, LinkedIn, Twitter, etc.

Hadoop

Features of Hadoop

i. It is optimized to handle massive quantities of structured, semi-structured, and


unstructured data, using commodity hardware, that is, relatively inexpensive
computers.
ii. Hadoop has a shared nothing architecture.
iii. It replicates its data across multiple computers so that if one goes down, the data
can still be processed from another machine that stores its replica.
iv. Hadoop is for high throughput rather than low latency. It is a batch operation
handling massive quantities of data; therefore the response time is not immediate.
v. It complements On-Line Transaction Processing (OLTP) and On-Line Analytical
Processing (OLAP). However, it is not a replacement for a relational database
management system.
vi. It is NOT good when work cannot be parallelized or when there are dependencies
within the data.
vii. It is NOT good for processing small files. It works best with huge data files and
datasets.
Key Advantages of Hadoop

Key advantages of Hadoop

i. Stores data in its native format: Hadoop's data storage framework (HDFS -
Hadoop Distributed File System) can store data in its native format. There is no
structure that is imposed while keying in data or storing data. HDFS is pretty
much schema-less. It is only later when the data needs to be processed that
structure is imposed on the raw data.
ii. Scalable: Hadoop can store and distribute very large datasets (involving
thousands of terabytes of data) across hundreds of inexpensive servers that operate
in parallel.
iii. Cost-effective: Owing to its scale-out architecture, Hadoop has a much reduced
cost/terabyte of storage and processing.
iv. Resilient to failure: Hadoop is fault-tolerant. It practices replication of data
diligently which means whenever data is sent to any node, the same data also gets
replicated to other nodes in the cluster, thereby ensuring that in the event of a node
failure, there will always be another copy of data available for use.
v. Flexibility: One of the key advantages of Hadoop is its ability to work with all
kinds of data: structured, semi-structured, and unstructured data. It can help derive
meaningful business insights from email conversations, social media data, click-
stream data, etc. It can be put to several purposes such as log analysis, data
mining, recommendation systems, market campaign analysis, etc.
vi. Fast: Processing is extremely fast in Hadoop as compared to other conventional
systems owing to the "move code to data" paradigm.

Hadoop has a shared-nothing architecture.

Versions of Hadoop
There are two versions of Hadoop available:

i. Hadoop 1.0
ii. Hadoop 2.0

Versions of Hadoop
Hadoop 1.0

• It has two main parts:


i. Data storage framework: It is a general-purpose file system called Hadoop Distributed File
System (HDFS). HDFS is schema-less. It simply stores data files. These data files can be in just
about any format. The idea is to store files as close to their original form as possible. This is turn
provides the business units and the organization the much needed flexibility and agility without
being overly worried by what it can implement.
ii. Data processing framework: This is a simple functional programming model initially
popularized by Google as MapReduce. It essentially uses two functions: the MAP and the
REDUCE functions to process data. The "Mappers" take in a set of key-value pairs and generate
intermediate data (which is another list of key-value pairs). The "Reducers" then act on this input
to produce the output data. The two functions seemingly work in isolation from one another, thus
enabling the processing to be highly distributed in a highly-parallel, fault-tolerant, and scalable
way.
• There were, however, a few limitations of Hadoop 1.0. They are as follows:
i. The first limitation was the requirement for MapReduce programming expertise along with
proficiency required in other programming languages, notably Java.
ii. It supported only batch processing which although is suitable for tasks such as log analysis, large-
scale data mining projects but pretty much unsuitable for other kinds of projects.
iii. One major limitation was that Hadoop 1.0 was tightly computationally coupled with MapReduce,
which meant that the established data management vendors were left with two options: Either
rewrite their functionality in MapReduce so that it could be executed in Hadoop or extract the data
from HDFS and process it outside of Hadoop. None of the options were viable as it led to process
inefficiencies caused by the data being moved in and out of the Hadoop cluster.
Hadoop 2.0

• In Hadoop 2.0,HDFS continues to be the data storage framework.


• However, a new and separate resource management framework called Yet Another Resource Negotiator
(YARN) has been added.
• Any application capable of dividing itself into parallel tasks is supported by YARN.
• YARN coordinates the allocation of subtasks of the submitted application, thereby further enhancing the
flexibility, scalability, and efficiency of the applications.
• It works by having an ApplicationMaster in place of the erstwhile Job Tracker, running applications on
resources governed by a new NodeManager (in place of the erstwhile TaskTracker).
• ApplicationMaster is able to run any application and not just MapReduce.
• This, in other words, means that the MapReduce Programming expertise is no longer required.
• Furthermore, it not only supports batch processing but also real-time processing.
• MapReduce is no longer only data processing option; other alternative data processing functions such as
data standardization, master data management can now be performed natively in HDFS.

Overview of Hadoop Ecosystems

• The components of the Hadoop ecosystem are shown in Figure

Hadoop ecosystem
• There are components available in the Hadoop ecosystem for data ingestion,
processing, and analysis.
Data Ingestion → Data Processing → Data Analysis
• Components that help with Data Ingestion are:
i. Sqoop
ii. Flume
• Components that help with Data Processing are:
i. MapReduce
ii. Spark
• Components that help with Data Analysis are:
i. Pig
ii. Hive
iii. Impala
HDFS

• It is the distributed storage unit of Hadoop.


• It provides streaming access to file system data as well as file permissions and authentication.
• It is based on GFS (Google File System).
• It is used to scale a single cluster node to hundreds and thousands of nodes.
• It handles large datasets running on commodity hardware.
• HDFS is highly fault-tolerant.
• It stores files across multiple machines.
• These files are stored in redundant fashion to allow for data recovery in case of failure.

HBase

• It stores data in HDFS.


• It is the first non-batch component of the Hadoop Ecosystem.
• It is a database on top of HDFS.
• It provides a quick random access to the stored data.
• It has very low latency compared to HDFS.
• It is a NoSQL database, is non-relational and is a column-oriented database.
• A table can have thousands of columns.
• A table can have multiple rows.
• Each row can have several column families.
• Each column family can have several columns.
• Each column can have several key values.
• It is based on Google Big Table.
• This is widely used by Facebook, Twitter, Yahoo, etc.

Difference between HBase and Hadoop/HDFS

i. HDFS is the file system whereas HBase is a Hadoop database. It is like NTFS and MySQL.
ii. HDFS is WORM (Write once and read multiple times or many times). Latest versions support
appending of data but this feature is rarely used. However, HBase supports real-time random read
and write.
iii. HDFS is based on Google File System (GFS) whereas HBase is based on Google Big Table.
iv. HDFS supports only full table scan or partition table scan. HBase supports random small range
scan or table scan.
v. Performance of Hive on HDFS is relatively very good but for HBase it becomes 4-5 times slower.
vi. The access to data is via MapReduce job only in HDFS whereas in HBase the access is via Java
APIs, Rest, Avro, Thrift APIs.
vii. HDFS does not support dynamic storage owing to its rigid structure whereas HBase supports
dynamic storage.
viii. HDFS has high latency operations whereas HBase has low latency operations.
ix. HDFS is most suitable for batch analytics whereas HBase is for real-time analytics.

Hadoop Ecosystem Components for Data Ingestion

i. Sqoop: Sqoop stands for SQL to Hadoop. Its main functions are
a) Importing data from RDBMS such as MySQL, Oracle, DB2, etc. to Hadoop file system
(HDFS,HBase, Hive).
b) Exporting data from Hadoop File system (HDFS, HBase, Hive) to RDBMS (MySQL, Oracle,
DB2).

Uses of Sqoop

a) It has a connector-based architecture to allow plug-ins to connect to external systems such as


MySQL, Oracle, DB2, etc.
b) It can provision the data from external system on to HDFS and populate tables in Hive and
HBase.
c) It integrates with Oozie allowing you to schedule and automate import and export tasks.
ii. Flume: Flume is an important log aggregator (aggregates logs from different machines and places
them in HDFS) component in the Hadoop ecosystem. Flume has been developed by Cloudera. It is
designed for high volume ingestion of event-based data into Hadoop. The default destination in
Flume (called as sink in flume parlance) is HDFS. However it can also write to HBase or Solr.

Hadoop Ecosystem Components for Data Processing

i. MapReduce: It is a programing paradigm that allows distributed and parallel processing of huge
datasets. It is based on Google MapReduce. Google released a paper on MapReduce programming
paradigm in 2004 and that became the genesis of Hadoop processing model. The MapReduce
framework gets the input data from HDFS. There are two main phases: Map phase and the Reduce
phase. The map phase converts the input data into another set of data (key-value pairs). This new
intermediate dataset then serves as the input to the reduce phase. The reduce phase acts on the
datasets to combine (aggregate and consolidate) and reduce them to a smaller set of tuples. The
result is then stored back in HDFS.
ii. Spark: It is both a programming model as well as a computing model. It is an open-source big
data processing framework. It was originally developed in 2009 at UC Berkeley's AmpLab and
became an open-source project in 2010. It is written in Scala. It provides in-memory computing for
Hadoop. In Spark, workloads execute in memory rather than on disk owing to which it is much
faster (10 to 100 times) than when the workload is executed on disk. However, if the datasets are
too large to fit into the available system memory, it can perform conventional disk-based
processing. It serves as a potentially faster and more flexible alternative to MapReduce. It accesses
data from HDFS (Spark does not have its own distributed file system) but bypasses the
MapReduce processing.
• Spark can be used with Hadoop coexisting smoothly with MapReduce (sitting on top of Hadoop
YARN) or used independently of Hadoop (standalone).
• As a programming model, it works well with Scala, Python (it has API connectors for using it with
Java or Python) or R programming language.
• The following are the Spark libraries:
a) Spark SQL: Spark also has support for SQL. Spark SQL uses SQL to help query data stored in
disparate applications.
b) Spark streaming: It helps to analyze and present data in real time.
c) MLib: It supports machine learning such as applying advanced statistical operations on data in
Spark Cluster.
d) GraphX: It helps in graph parallel computation.
• Spark and Hadoop are usually used together by several companies.
• Hadoop was primarily designed to house unstructured data and run batch processing operations on it.
• Spark is used extensively for its high speed in memory computing and ability to run advanced real-time
analytics.
• The two together have been giving very good results.

Hadoop Ecosystem Components for Data Analysis

i. Pig: It is a high-level scripting language used with Hadoop. It serves as an alternative to


MapReduce. It has two parts:
(a) Pig Latin: It is SQL-like scripting language. Pig Latin scripts are translated into MapReduce
jobs which can then run on YARN and process data in the HDFS cluster. It was initially
developed by Yahoo. It is immensely popular with developers who are not comfortable with
MapReduce. However, SQL developers may have a preference for Hive.
There is a "Load" command available to load the data from "HDFS" into Pig. Then one can
perform functions such as grouping, filtering, sorting, joining etc. The processed or computed
data can then be either displayed on screen or placed back into HDFS.
It gives you a platform for building data flow for ETL (Extract, Transform and Load),
processing and analyzing huge data sets.
(b) Pig runtime: It is the runtime environment.
ii. Hive: Hive is a data warehouse software project built on top of Hadoop. Three main tasks
performed by Hive are summarization, querying and analysis. It supports queries written in a
language called HQL or HiveQL which is a declarative SQL-like language. It converts the SQL-
style queries into MapReduce jobs which are then executed on the Hadoop platform.

Difference between Hive and RDBMS

• Both Hive and traditional databases such as MySQL, MS SQL Server, PostgreSQL support SQL interface.
• However, Hive is better known as a datawarehouse(D/W) rather than a database.
• Difference between Hive and traditional databases as regards the schema.
i. Hive enforces schema on Read Time whereas RDBMS enforces schema on Write Time. In
RDBMS, at the time of loading/inserting data, the table's schema is enforced. If the data being
loaded does not conform to the schema then it is rejected. Thus, the schema is enforced on write
(loading the data into the database). Schema on write takes longer to load the data into the
database; however it makes up for it during data retrieval with a good query time performance.
However, Hive does not enforce the schema when the data is being loaded into the D/W. It is
enforced only when the data is being read/retrieved. This is called schema on read. It definitely
makes for fast initial load as the data load or insertion operation is just a file copy or move.
ii. Hive is based on the notion of write once and read many times whereas the RDBMS is designed
for read and write many times.
iii. Hadoop is a batch-oriented system. Hive, therefore, is not suitable for OLTP (Online Transaction
Processing) but, although not ideal, seems closer to OLAP (Online Analytical Processing). The
reason being that there is quite a latency between issuing a query and receiving a reply as the query
written in HiveQL will be converted to MapReduce jobs which are then executed on the Hadoop
cluster. RDBMS is suitable for housing day-to-day transaction data and supports all OLTP
operations with frequent insertions, modifications (updates), deletions of the data.
iv. Hive handles static data analysis which is non-real-time data. Hive is the data warehouse of
Hadoop. There are no frequent updates to the data and the query response time is not fast. RDBMS
is suited for handling dynamic data which is real time.
v. Hive can be easily scaled at a very low cost when compared to RDMS. Hive uses HDFS to store
data, thus it cannot be considered as the owner of the data, while on the other hand RDBMS is the
owner of the data responsible for storing, managing and manipulating it in the database.
vi. Hive uses the concept of parallel computing, whereas RDBMS uses serial computing.
Hive versus RDBMS

Difference between Hive and HBase

i. Hive is a MapReduce-based SQL engine that runs on top of Hadoop. HBase is a key-value
NOSQL database that runs on top of HDFS.
ii. Hive is for batch processing of big data. HBase is for real-time data streaming.

Impala

• It is a high performance SQL engine that runs on Hadoop cluster.


• It is ideal for interactive analysis.
• It has very low latency measured in milliseconds.
• It supports a dialect of SQL called Impala SQL.

ZooKeeper

• It is a coordination service for distributed applications.

Oozie

• It is a workflow scheduler system to manage Apache Hadoop jobs.

Mahout

• It is a scalable machine learning and data mining library.

Chukwa

• It is a data collection system for managing large distributed systems.

Ambari

• It is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters.

Hadoop Distributions

• Hadoop is an open-source Apache project.


• Anyone can freely download the core aspects of Hadoop.
• The core aspects of Hadoop include the following:
i. Hadoop Common
ii. Hadoop Distributed File System (HDFS)
iii. Hadoop YARN (Yet Another Resource Negotiator)
iv. Hadoop MapReduce
• There are few companies such as IBM, Amazon Web Services, Microsoft, Teradata,
Hortonworks, Cloudera. etc. that have packaged Hadoop into a more easily consumable
distributions or services.
• Although each of these companies have a slightly different strategy, the key essence
remains its ability to distribute data and workloads across potentially thousands of servers
thus making big data manageable data.
• A few Hadoop distributions are given in Figure.
Hadoop distributions
Hadoop versus SQL

• Table lists the differences between Hadoop and SQL.


Hadoop versus SQL

Integrated Hadoop Systems Offered by Leading Market Vendors

• Figure to get a glimpse of the leading market vendors offering integrated Hadoop
systems.

Integrated Hadoop systems

Cloud-Based Hadoop Solutions

• Amazon Web Services holds out a comprehensive, end-to-end portfolio of cloud


computing services to help manage big data.
• The aim is to achieve this and more along with retaining the emphasis on reducing costs,
scaling to meet demand, and accelerating the speed of innovation.
• The Google Cloud Storage connector for Hadoop empowers one to perform MapReduce
jobs directly on data in Google Cloud Storage, without the need to copy it to local disk
and running it in the Hadooр Distributed File System (HDFS).
• The connector simplifies Hadoop deployment, and at the same time reduces cost and
provides performance comparable to HDFS, all this while increasing reliability by
eliminating the single point of failure of the name node.

Cloud-based solutions

You might also like