0% found this document useful (0 votes)
18 views57 pages

Module 1 Notes

Uploaded by

shreekd2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views57 pages

Module 1 Notes

Uploaded by

shreekd2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MODULE-1

CHAPTER-1

Topics

Classification of data, Characteristics, Evolution and definition of Big data,


What is Big data, Why Big data, Traditional Business Intelligence Vs Big
Data, Typical data warehouse and Hadoop environment. Big Data Analytics:
What is Big data Analytics, Classification of Analytics, Importance of Big
Data Analytics, Technologies used in Big data Environments, Few Top
Analytical Tools , NoSQL, Hadoop.

INTRODUCTION:
Today, data is very important for all kinds of businesses — big or small. It is found both inside and outside the
company, and it comes in different forms from many sources. To make good decisions, we need to:
• Collect the data

• Understand it

• Use it properly to get useful information

1.CLASSIFICATION OF DATA
Digital data can be broadly classified into structured , semi-structured, and unstructured data.

1. Unstructured data:

• This type of data has no specific format or structure.


• Computers can't easily understand or process it directly.

Prof. Babitha P K 1 CIT, Ponnampet


• Examples: Word files, emails, images, videos, PDFs, research papers, presentations.
• Around 80–90% of data in companies is unstructured.

2. Semi-Structured Data

• This data has some structure, but not as organized as structured data.
• It's not very easy for computers to use directly.
• Examples: Emails, XML files, HTML pages.
• Has metadata (data about data), but that’s not enough to fully structure it.

3. Structured Data

• This data is well-organized in tables (like rows and columns).


• Computers can easily read and use it.
• Example: Data stored in databases like student records, sales data, etc.
• Relationships between data are also defined (e.g., student and their marks).

Since the 1980s, companies have stored most of their data in something called relational databases(RDBMS).
These databases organize data in tables (like in Excel), using rows and columns. They use tools like primary
keys and foreign keys to manage the data.

Over time, Relational Database Management Systems (RDBMS) became better, cheaper, and easier to use.
People got used to working with them because they made storing and using data much simpler. These systems
mainly handle structured data — data that’s neatly organized.

But when the Internet became popular, companies started dealing with a lot more data — and much of it was
unstructured (like emails, videos, social media posts). It was hard to ignore because it was growing so fast.

In fact, a company called Gartner says that today, about 80% of the data in companies is unstructured, and
only about 10% is structured or semi-structured.

1.1 Structured Data

Data can be structured, semi-structured, or unstructured. Structured data makes up 80% of digital data and is
stored in organized formats like tables with rows and columns, usually in databases.
Relational Databases (RDBMS)
• Most structured data is stored in databases called RDBMS (Relational Database Management Systems),
such as Oracle, MySQL, and SQL Server.
• Data in these databases is kept in tables. Each table has rows (each row is one record or entry)
and columns (each column stores a particular type of information, such as employee name or number).

Prof. Babitha P K 2 CIT, Ponnampet


Terms:
Cardinality: Number of rows in a table.
Degree: Number of columns in a table.

Creating a Table
This table shows how to design a structured table for storing employee information in an RDBMS.

How Tables Work

• Each table represents a type of object or entity, for example, “Employee”.


• Every column has a specific meaning and data type (like a number or text) and can have rules, such as
“cannot be empty” or “must be unique”.
• Example schema for an Employee table:
• EmpNo (Employee Number): unique for each person
• EmpName (Employee Name)
• Designation (Job Title)
• DeptNo (Department Number)
• ContactNo (Phone Number)

• EmpNo must be unique for each person (PRIMARY KEY)


• Designation and ContactNo cannot be left blank (NOT NULL)
• Each piece of data has a set size or type (for example, Varchar(10) is text up to 10 characters)

5. Table Relationships
• Tables can be linked to each other. For example, each Employee record has a DeptNo, which connects
to the Department table. This shows which department the employee works in. This relationship keeps
data organized and avoids duplication.

Prof. Babitha P K 3 CIT, Ponnampet


6. Where Structured Data Comes From
• Structured data is stored in:
• Databases (like Oracle, MySQL)
• Spreadsheets (like Excel)
• OLTP (Online Transaction Processing) systems

1.1.1 Sources of Structured Data

Structured data is information that is organized in a clear and orderly way, so it’s easy to find and use.
Here’s where it usually comes from:
1. Databases
These are special computer systems designed to store lots of organized data. Some common examples are:
• Oracle
• MySQL
• Microsoft SQL Server
• IBM DB2
• PostgreSQL
Companies use databases for things like keeping track of employees, products, or sales.
2. Spreadsheets
Tools like Microsoft Excel or Google Sheets are also common sources. In a spreadsheet, information is put into
rows and columns, just like in a table. For example, you might have a list of students with their grades.
(Online Transaction Processing)
These are systems businesses use for everyday work, like processing orders, recording payments, or managing
bookings. The data they use and store is highly organized, which makes it easy to update and search quickly.

1.1.2 Ease of working with Structured data

Prof. Babitha P K 4 CIT, Ponnampet


Structured data is easy to handle because it is stored in an organized way (like tables).
1. Insert / Update / Delete
You can easily add, change, or remove data using commands (called DML – Data Manipulation Language).
2. Security
Structured data systems have strong security features like encryption. Only authorized people can see or change
the data.
3. Indexing (Searching Made Faster)
Indexes help find data quickly, just like an index in a book. It uses extra space but makes searching faster.
4. Scalability
You can increase the size and performance of the system easily when you have more data or need faster access.
5. Transaction Processing (ACID Properties)
This ensures that data remains correct and safe during updates:
• Atomicity: A task either happens completely or not at all.
• Consistency: Data stays correct even after changes.
• Isolation: Each task works independently.
• Durability: Once saved, data stays even if the system crashes.

1.2 Semi-Structured Data


Semi-structured data is between structured and unstructured. It’s more flexible and doesn’t follow strict
table rules.
Key Features:
1. No Fixed Tables: It doesn’t follow regular database format.
2. Uses Tags: Data is stored with tags (like in XML or JSON), making it easier to understand.

Prof. Babitha P K 5 CIT, Ponnampet


3. Tags and Hierarchies
• Tags (like <name>) are used to show the structure and hierarchy of data.
4. Schema Mixed with Data
• Information about the structure (schema) is mixed along with the actual data values.
5. Unknown or Varying Attributes
• Data may have different attributes, and we might not know these in advance.
• Items in the same group don't need to have the same properties.

1.2.1 Sources of Semi-structured Data

The most common formats for semi-structured data are:


1. XML (eXtensible Markup Language)
• Used by web services (especially with SOAP).
• Stores data with opening and closing tags.
• Example:
xml
CopyEdit
<name>John</name>

2. JSON (JavaScript Object Notation)


• Used to transfer data between a web server and a browser.
• Common in modern web applications (using REST).
• Also used in NoSQL databases like MongoDB and Couchbase.
• Example:
json
CopyEdit
{ "name": "John" }

Prof. Babitha P K 6 CIT, Ponnampet


AN EXAMPLE OF HTML IS AS FOLLOWS:

<HTML> : Enclose the entire HTML document. This indicates the start and end of the HTML code.
<HEAD> : Contain meta-information about the document, such as its title and links to stylesheets or scripts.
<TITLE>Place your title here</TITLE>: Sets the page title, which appears in the browser’s title bar or tab.
<BODY BGCOLOR="FFFFFF">: Defines the main content area of the webpage.
The BGCOLOR="FFFFFF" attribute sets the background colour of the page to white using the hexadecimal
colour code FFFFFF.

<CENTER><IMG SRC="clouds.jpg" ALIGN="BOTTOM"></CENTER>


<HR>
<a href="http://bigdatauniversity.com">Link Name</a>
<H1>this is a Header</H1>
<H2>this is a sub Header</H2>
Send me mail at <a href="mailto:[email protected]">[email protected]</a>.
<P>a new paragraph!
<P><B>a new paragraph!</B>
<BR><B><I>this is a new sentence without a paragraph break, in bold italics.</I></B>
<HR>
</BODY>
</HTML>

MEANING
• <CENTER><IMG SRC="clouds.jpg" ALIGN="BOTTOM"></CENTER>
Displays an image "clouds.jpg" centered horizontally on the page, aligned to the bottom of the line.
(Note: <CENTER> and ALIGN attribute are outdated; CSS is currently recommended.)
• <HR>
Renders a horizontal line to separate content sections visually.
• <a href="http://bigdatauniversity.com">Link Name</a>
Creates a clickable hyperlink with the text "Link Name" that, when clicked, leads to the
website http://bigdatauniversity.com.
• <H1>this is a Header</H1>
Displays a large, prominent header (Heading 1 level) that reads: "this is a Header". Usually used for
main titles.
• <H2>this is a sub Header</H2>
Displays a smaller heading (Heading 2 level), typically a subtitle or section title: "this is a sub Header".
• Send me mail at <a href="mailto:[email protected]">[email protected]</a>.
Shows the text "Send me mail at" followed by an email address that is a clickable link. Clicking it opens
the user's email client with the address populated.
• <P>a new paragraph!
Starts a new paragraph with the text "a new paragraph!".
• <P><B>a new paragraph!</B>
Starts another new paragraph where the text "a new paragraph!" is displayed in bold.
• <BR><B><I>this is a new sentence without a paragraph break, in bold italics.</I></B>
The <BR> tag inserts a line break (new line but within the same paragraph).
The text "this is a new sentence without a paragraph break, in bold italics." is
both bold and italicized and appears right after the line break.
• <HR>
Another horizontal line to mark the end or separate further content visually.
• </BODY> and </HTML>
These close the main content section and the entire HTML document respectively.

Prof. Babitha P K 7 CIT, Ponnampet


SAMPLE JSON DOCUMENT

{
"_id": 9,
"BookTitle": "Fundamentals of Business Analytics",
"AuthorName": "Seema Acharya",
"Publisher": "Wiley India",
"YearofPublication": "2011"
}

MEANING

• _id: 9
This is a unique identifier for this record, often used in databases to tell entries apart.
• BookTitle: "Fundamentals of Business Analytics"
This is the title of the book.
• AuthorName: "Seema Acharya"
This specifies the author of the book.
• Publisher: "Wiley India"
This shows which company published the book.
• YearofPublication: "2011"
This gives the year in which the book was published.

1.3 UnStructured Data


• Unstructured data refers to information that does not conform to a predefined model or structure.
• It’s unpredictable, free-form, and often varies widely from one instance to another.
• Examples include social media posts, emails, and logs.
• Sometimes, patterns exist in unstructured data, leading to debates about whether some of it is actually
"semi-structured."

Issues with "Unstructured" Data


• Not Completely Random:
Although we call it "unstructured," sometimes certain patterns or structure can be implied even in such
data. For example, a log entry might always start with a date and an IP address.
• Semi-Structured Data:
Some argue that certain file types—like plain text files (for example, logs)—have some structure,
putting them somewhere between structured and unstructured data (often called "semi-structured").
• Categorization Debate:
There are debates about when something is truly unstructured, semi-structured, or even structured. The
table tries to give a sense of this "gray area".
Examples (from Table 1.4)
Twitter
Message "Feeling miffed®. Victim of twishing."

Facebook
Post "LOL. C ya. BFN"

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0"


Log Files 200 2326 ...

Prof. Babitha P K 8 CIT, Ponnampet


"Hey Joan, possible to send across the first cut on the Hadoop chapter by
Email Friday EOD or maybe we can meet up over a cup of coffee. Best regards, Tom"

The table shows typical unstructured data examples:


• Twitter Message:
"Feeling miffed®. Victim of twishing."
(Short, freeform, unpredictable content)
• Facebook Post:
"LOL. C ya. BFN"
(Abbreviated, informal text—again, unpredictable)
• Log Files:
E.g.,
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 ...
(While logs have repeated structure (format), the contents can be variable; often classified as semi-
structured.)
• Email:
text
Hey Joan, possible to send across the first cut on the Hadoop chapter by Friday EOD or maybe we can meet up
over a cup of coffee. Best regards, Tom
(Long, open-format, human message; content and structure varies each time.)

1.3.1 Sources of Unstructured Data


This diagram lists typical origins of unstructured data, showing that it comes from a wide variety of places:
• Web pages: The actual content of web pages, which is often complex and not neatly organized.

Prof. Babitha P K 9 CIT, Ponnampet


• Images: Photographs, diagrams, and pictures.
• Free form text: Any text that isn’t organized into records or tables, such as essays or reports.
• Audios: Recordings and voice files.
• Videos: Multimedia files combining images and sound.
• Body of Email: The main content area of emails—not the sender/recipient or time fields, but what
people actually write.
• Text messages: SMS or instant messaging text.
• Chats: Conversations from online chat applications.
• Social media data: Posts, comments, reactions on platforms like Facebook, Twitter, Instagram, etc.
• Word documents: Files created with word processors, often with varying structure.
Key point:
Unstructured data covers a vast and varied spectrum and generally lacks a fixed schema or format.

Issues with Terminology of Unstructured Data


This chart highlights how "unstructured data" is not a precise term:
1. Implied Structure:
Sometimes, there's structure present (e.g., date at the start of a log entry) even if it wasn't pre-defined.
2. Structure Not Helpful:
Data might have some internal structure, but if that structure isn't useful for a given task, it's still treated
as unstructured.
3. Unexpected/Unstated Structure:
Data may be more structured than we realize, but if it's not anticipated or announced, it's called
unstructured.

Prof. Babitha P K 10 CIT, Ponnampet


This image shows a seesaw with “Unstructured data” on one side and “Structured data” on the other, tilting
heavily towards unstructured.
Meaning: Unstructured data makes up the majority of enterprise (business/organizational) information,
outweighing structured data.

Techniques to Find Patterns in Unstructured Data


1.Data Mining:
Data mining is the analysis of large data sets to identify consistent patterns or relationships between variables. It
draws upon artificial intelligence, machine learning, statistics, and database systems.
Think of it as the "analysis step" in the process called "knowledge discovery in databases."
Popular Data Mining Algorithms:
• Association Rule Mining:
• Also called: Market basket analysis or affinity analysis
• Purpose: Determines "What goes with what?"
For example, if someone buys bread, do they also tend to buy eggs or cheese?
• Use: Helps stores recommend or place products together based on previous purchases.
• Regression Analysis:
• Purpose: Predicts the relationship between two variables.
• How: One variable (dependent variable) is predicted using other variables (independent
variables).
• Use: Estimate outcomes, trends, or values based on related data.

Shows user preferences for learning modes (audio, video, text). For example:
User Learning using Audios Learning using Videos Textual
Learners

User 1 Yes Yes No

User 2 Yes Yes Yes

User 3 Yes Yes No

User 4 Yes ? ?

The goal is to predict if User 4 likes videos or texts using known preferences and user similarities.
Other Pattern-Finding Techniques (Summarized):
• Collaborative Filtering:
Predicts what a user will like based on the preferences of similar users. (E.g., Netflix suggestions)

Prof. Babitha P K 11 CIT, Ponnampet


• Text Analytics/Text Mining:
Extracts meaningful information from unstructured text (like social media or emails). Tasks include
categorization, clustering, sentiment, or entity extraction.
• Natural Language Processing (NLP):
Enables computers to understand and interpret human language.
• Noisy Text Analytics:
Deals with messy data (chats, messages) that may have errors or informal language.
• Manual Tagging with Metadata:
Attaching manual tags/labels to data to add meaning/structure.
• Part-of-Speech Tagging:
Tagging text with its grammatical parts (noun, verb, adjective, etc.)
• Unstructured Information Management Architecture (UIMA):
A platform to process unstructured content (text, audio) in real time for extracting relevant meaning.

2.CHARACTERISTICS OF DATA
Data has three key characteristics.
1. Composition
This refers to the structure of data: its sources, granularity, types, and whether it is static or involves real-time
streaming.
It answers questions like: What is the origin of the data? Is it organized as batches or streams? Is it highly
granular or summarized?

2. Condition
• This addresses the state or quality of the data.
• It focuses on whether the data is ready for analysis or if it needs to be cleansed or improved through
enrichment. Typical questions include: Is the data suitable for immediate use? Does it need
preprocessing?

3.Context
o Context covers the background in which data was generated or is being used.
o It helps answer where, why, and how the data came about, its sensitivity, and associated events.
For example: Where did this data originate? Why was it created? What events are linked to it?

Traditional ("small") data largely involved certainty—known sources and little change in composition or
context. In contrast, modern "big data" deals with greater complexity, with much more focus on understanding
why, how, and in what circumstances data was generated, making these three characteristics crucial for
effective data usage.

Prof. Babitha P K 12 CIT, Ponnampet


3.EVOLUTION OF BIG DATA

• Before 1980: Simple, structured data stored in mainframes, limited usage.


• 1980s-1990s: Relational databases enable more sophisticated, relational data storage and some data
analysis.
• 2000s-now: Boom of the internet and new technologies generates vast, varied data (structured,
unstructured, multimedia); data is now a strategic asset driving decisions and innovations.

4.DEFINITION OF BIG DATA

Key Points on Defining Big Data


Flexible Definitions:
• Beyond Human/Technical Limits: Anything exceeding current human or technical infrastructure for
storage, processing, and analysis.
• Relativity: What is considered "big" today may be normal tomorrow, showing how fast the landscape
evolves.
• Massive Scale: Sometimes simply defined as terabytes, petabytes, or even zettabytes of data.
• Three Vs: Most commonly, Big Data is described using the "3 Vs": Volume, Velocity, and Variety.

Prof. Babitha P K 13 CIT, Ponnampet


Aspect Description

Volume Enormous amounts of data, both structured and unstructured.

Velocity Speed at which new data is generated and must be processed.

Variety Diversity in data types: text, images, videos, logs, streams, etc.

Big data is high-volume, high-velocity, and high-variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.
— Gartner IT Glossary

Prof. Babitha P K 14 CIT, Ponnampet


Gartner's definition of big data through a simple flow diagram.
• High-volume, high-velocity, high-variety data
→ Need for cost-effective, innovative information processing
→ Leads to enhanced insight and better decision making
• Cost-effective and Innovative Processing:
Big Data requires new technologies and approaches to ingest, store, and analyze huge, fast-flowing, and
diverse data sets.
• Enhanced Insight and Decision-Making:
The ultimate goal is to derive deeper, richer, and more actionable insights—turning data into
information, then actionable intelligence, leading to better decisions and greater business value.
This chain is summarized as:
Data → Information → Actionable intelligence → Better decisions → Enhanced business value
Big Data isn't just about size; it's about complexity, speed, diversity, and the ability to draw deeper insights to
achieve a competitive edge in decision making.

5.CHALLENGES WITH BIG DATA

Main Challenges with Big Data


1.Exponential Data Growth
• Data is growing at an exponential rate, with most existing data generated in just the last few years.
• Key questions include: Will all this data be useful? Should we analyze all data or just a subset? How do
we distinguish valuable insights from noise?
Meaning:
Data is increasing very fast, with most of it created in the last few years.
• The challenge is:
o Do we really need all of it?
o Should we analyze everything or only the important parts?
o How do we separate useful insights from unnecessary noise?

2.Cloud Computing and Virtualization


• Managing big data infrastructure often involves cloud computing, which provides cost-efficiency and
flexibility.

Prof. Babitha P K 15 CIT, Ponnampet


• However, deciding whether to host data solutions inside or outside the enterprise adds complexity due to
concerns about control and security.
Meaning:
• Companies usually use cloud platforms to store and process big data because it’s cheaper and
flexible.
• The issue is deciding where to keep the data:
o Inside the company (more control, but costly), or
o Outside in the cloud (cheaper, but security concerns).

3.Retention Decisions
• Determining how long to keep data is challenging. Some data may only be relevant for a short period,
while other data could have long-term value.
• There is always a balance between useful retention and quickly obsolete information

Meaning:
Some data is only useful for a short time, while some may be valuable for years.
The problem is finding the right balance:
• Keep too much → storage cost goes up.
• Delete too quickly → risk losing important information.

4.Lack of Skilled Professionals


There is a shortage of experts in data science, which is crucial for implementing effective big data solutions.
5.Technical Issues
Big data involves datasets too large for traditional databases.
No clear limit defines when data becomes "big,” and new methods are needed as data changes rapidly and in
unpredictable ways.
Challenges include not only storage, but also capturing, preparing, transferring, securing, and visualizing the
data.
Meaning:
Big data is so large and complex that normal databases can’t handle it.
There’s no exact point where data becomes “big” — it depends on how fast it grows and how
unpredictable it is.
The problems are not just about storing data, but also about:
• Collecting it (from many sources)

Prof. Babitha P K 16 CIT, Ponnampet


• Cleaning and preparing it (removing errors, formatting)
• Moving it (fast transfer across systems)
• Protecting it (security)
• Presenting it (easy to understand).

6.Need for Data Visualization


Clear and effective visualization is essential to making sense of vast datasets.
There aren’t enough specialists in data or business visualization to meet demand.
Meaning:
Big data is useless if people can’t understand it.
Charts, graphs, and dashboards make large datasets easier to analyze and use.
The challenge: there aren’t enough experts who can create good visualizations to meet the growing
demand.

Figure 2.4: Visualizing Big Data Challenges


The diagram highlights the following core challenges in handling big data:
• Capture (gathering data from multiple sources)
• Storage (handling massive volumes of information)
• Curation (organizing and maintaining data quality)
• Search (efficiently finding relevant information)
• Analysis (extracting insights)
• Transfer (moving data across locations or systems)
• Visualization (presenting data in understandable formats)
• Privacy Violations (ensuring data security and privacy)
• +

6.WHAT IS BIG DATA?


Big data refers to data that is extremely large in volume, moves at a high velocity, and comes in a wide variety
of forms. The concept of big data is usually captured by the "3 Vs":
6.1 Volume:
Massive amounts of data, ranging from terabytes to yottabytes.
Growth of Data (Volume)
Data grows from small units (bits, bytes) to massive scales, as shown in the growth path:
Bits → Bytes → Kilobytes → Megabytes → Gigabytes → Terabytes → Petabytes → Exabytes → Zettabytes →
Yottabytes
Velocity: The speed at which data is generated and processed, from batch to real-time streams.
Variety: Diversity in data sources and formats (structured, unstructured—like text, video, databases, etc.).

Prof. Babitha P K 17 CIT, Ponnampet


Unit Size (in bytes)

Bits 0 or 1

Bytes 8 bits

Kilobytes 1,024 bytes

Megabytes 1,024² bytes

Gigabytes 1,024³ bytes

Terabytes 1,024⁴ bytes

Petabytes 1,024⁵ bytes

Exabytes 1,024⁶ bytes

Zettabytes 1,024⁷ bytes

Yottabytes 1,024⁸ bytes

Prof. Babitha P K 18 CIT, Ponnampet


6.1.1 Where is Big Data Generated?
Big data can be generated from a multitude of sources, both internal and external:
• Files and Documents: XLS, DOC, PDF files (often unstructured).
• Multimedia: Video (YouTube), audio, social media.
• Communication: Chat messages, customer feedback forms.
• Other Examples: CCTV footage, weather forecasts, mobile data.
Internal Data Sources in Organizations
• Data Storage: File systems, relational databases (RDBMS like Oracle, MS SQL Server, MySQL,
PostgreSQL), NoSQL databases (MongoDB, Cassandra).
• Archives: Scanned documents, customer records, patient health records, student records, etc.

Visual Summary of the "3 Vs" of Big Data


The second image illustrates that big data is not just about sheer volume, but also about its velocity (e.g., real-
time vs. batch) and variety (e.g., from mobile devices, social networks, audio, video, databases).

SOURCES OF BIG DATA


Big data is therefore characterized by its large volume, high velocity, and multiple varieties, and is generated
from numerous sources—ranging from structured databases to unstructured social media, videos, and
organizational records.

6.2 Velocity
Velocity describes the speed at which data is generated, collected, and needs to be processed. In the past, data
used to be processed in batches—meaning all data was collected over a period and then analyzed together (for
example, payroll calculations). Today, data increasingly needs to be processed in real time, or near real time, as
it arrives. This evolution is summarized as:
Batch → Periodic → Near real time → Real-time processing
Modern organizations now expect their systems to process and respond to data instantly or within seconds,
rather than waiting for slow, scheduled processing.

Prof. Babitha P K 19 CIT, Ponnampet


• Velocity means how fast data is created, collected, and processed.
• In the past, companies used batch processing → they collected data for some time and then
analyzed it all at once (example: payroll done at the end of the month).
• Now, data comes in very fast (like social media posts, online transactions, stock market updates,
IoT sensor data).
• Businesses need to process this data immediately (real-time) or within a few seconds (near real-
time).

Evolution of Data Processing


1. Batch → Data processed after a long gap (hours/days).
2. Periodic → Processed at shorter fixed intervals (like every hour).
3. Near real-time → Data processed almost instantly but with a tiny delay (a few seconds/minutes).
4. Real-time → Processed instantly, as soon as data arrives.

6.3 Variety
Variety refers to the diversity of data types and sources that organizations must handle. It is categorized into
three types:
1. Structured Data:
• Highly organized and easily searchable (e.g., data stored in relational databases like RDBMS,
traditional transaction processing systems).
2. Semi-Structured Data:
• Not as rigidly organized, but contains tags or markers to separate elements (e.g., HTML, XML).
3. Unstructured Data:
• No predefined structure. Examples include text documents, emails, audios, videos, social media
posts, PDFs, and photos. Unstructured data is the most challenging but also the biggest source of
insights.
Variety means organizations must be able to manage everything from traditional database records and
spreadsheets to social media posts, sensor logs, images, and more—all of which may require specialized
processing techniques.
7.WHY BIG DATA?

Big data is important because the more data we have for analysis, the more accurate our analytical results
become. This increased accuracy boosts our confidence in the decisions we make based on these analyses. With
this greater confidence, organizations can realize significant positive outcomes, namely:
• Enhanced operational efficiency
• Reduced costs
• Less time spent on processes
• Increased innovation in developing new products and services
• Optimization of existing offerings
• The process can be visualized as a sequence:
More data → More accurate analysis → Greater confidence in decision making → Greater operational
efficiencies, cost reduction, time reduction, new product development, and optimized offerings

Prof. Babitha P K 20 CIT, Ponnampet


8. TRADITIONAL BUSINESS INTELLIGENCE VS BIG DATA
1. Data Storage and Architecture
• Traditional BI:
All enterprise data is stored on a central server (usually on a single or a few large database servers).
• Big Data:
Data is stored in a distributed file system (spread across many servers or nodes). Distributed systems
can scale “horizontally” by adding more servers (nodes), rather than making a single server bigger
(“vertical” scaling).
2. Data Analysis Mode
• Traditional BI:
Data analysis usually happens in offline mode, meaning data is collected and then analyzed at a later
time (batch processing).
• Big Data:
Analysis can happen both in real time and in offline (batch) mode.
3. Data Type and Processing Method
• Traditional BI:
Deals mostly with structured data (data that fits neatly into tables, like databases). The typical
approach is to move data to the processing function (“move data to code”).
• Big Data:
Handles all types of data: structured, semi-structured, and unstructured (such as logs, images, social
media text, etc.). In Big Data systems, it is more common to move the processing function to where
the data is (“move code to data”).

9. TYPICAL DATA WAREHOUSE ENVIRONMENT

Prof. Babitha P K 21 CIT, Ponnampet


A Data Warehouse is a central place where a business collects and manages its data from different sources.
Here's how it works, step by step:
It’s like a big storage room where a company keeps all its important data (from many sources) in one
organized place, so it can be used for analysis and decision-making.

Step 1: Data Collection (Sources)


• Data comes from different systems inside and outside the company, such as:
o ERP systems (finance, HR, inventory)
o CRM systems (customer details, sales info)
o Old legacy systems (still in use)
o Third-party apps (external software)
• This data can be in many formats:
o Databases (Oracle, SQL Server, MySQL)
o Excel sheets
o Text/CSV files

Step 2: Data Integration (ETL Process)


• Since data comes in different formats, it must be:
o Extracted (taken out from sources)
o Transformed (cleaned and converted into a common format)
o Loaded (sent into the warehouse)
• This process is called ETL.

Step 3: Loading Data


• After ETL, the cleaned data goes into the Data Warehouse.
• It’s stored at the enterprise level (for the whole company).
• Sometimes, smaller warehouses called Data Marts are made for specific teams (like sales, HR).

Step 4: Business Intelligence & Analytics


• Once data is ready in the warehouse, companies can use tools to:
o Run quick queries (questions to the database)
o Create dashboards (visual summaries)
o Do data mining (find patterns and trends)
o Generate reports
• This helps managers make smarter, faster, data-driven decisions.

Prof. Babitha P K 22 CIT, Ponnampet


• In short:
A Data Warehouse is like a super-organized library of business data.
• Collect from different sources → Clean & combine (ETL) → Store in warehouse → Analyze using BI
tools.

9. TYPICAL HADOOP ENVIRONMENT


Key Differences Between Hadoop and Data Warehousing
1.Source and Type of Data
• Hadoop:
Collects data from a wide and diverse set of sources—web logs, images, videos, social media content
(Twitter, Facebook, etc.), documents, PDFs, and more. It is designed to handle not just structured data,
but also semi-structured and unstructured data. This includes data both within and outside the company's
firewall.
• Data Warehouse:
Traditionally focuses on structured data from well-defined business applications like ERP, CRM, or
legacy systems, typically within the organization's boundaries.
2. Storage Mechanism
• Hadoop:
Uses the Hadoop Distributed File System (HDFS) to store data reliably across many servers. Data of
various types and sizes is kept in this distributed file system, which is highly scalable and fault tolerant.
• Data Warehouse:
Uses relational databases or similar systems where data is stored in tables with fixed schemas (rows and
columns).

3. Processing and Output


• Hadoop:
Processing is done via MapReduce, a programming model that allows massive scalability by dividing
tasks across multiple nodes. After processing, data can be sent to different destinations: back to
operational systems, to data warehouses, data marts, or operational data stores (ODS) for further
analysis.
• Data Warehouse:
Processing is done using SQL queries, and data is mostly kept in place for analytics and reporting.
4. Integration and Use
• In Hadoop environments, data can flow from many types of source systems (logs, media, social
platforms, documents) into Hadoop, where it is processed and then routed to the relevant business
destinations (operational systems, warehouses, marts, ODS) for final use or reporting

• Hadoop is built for handling massive volumes and varieties of data (structured and
unstructured, internal and external), storing it in a distributed fashion with flexible processing
pipelines.
• Traditional Data Warehousing excels at managing structured, business-critical data in
centralized, organized storage.

Prof. Babitha P K 23 CIT, Ponnampet


Prof. Babitha P K 24 CIT, Ponnampet
CHAPTER-2
Big Data Analytics:
1.What is Big data Analytics?

Big Data Analytics refers to techniques and technologies for analyzing extremely large and complex
data sets to extract meaningful information and support better business decisions.

Big Data Analytics is………..

1. Technology-enabled analytics
Modern tools like Tableau, IBM, R Analytics, SAS are used to help process and analyze your big data.

2. Deeper insights into business


Big data can tell companies what customers like, dislike, and need.
Example: Knowing which age group buys sports shoes most often → so companies can target ads better.

3. Example of personalization
Online stores (like Amazon/Flipkart) recommend products because they remember your past purchases and
browsing history.
If you bought a phone, the site may suggest a cover or earphones next.

4. Competitive edge
Businesses that use data wisely make faster and smarter decisions than competitors.
Example: A supermarket adjusting stock immediately when it sees that bread sales are rising.

5. Collaboration across teams


To use big data well, people from different fields need to work together:
• IT experts → manage systems
• Data scientists → analyze data
• Business users → make decisions from results

6. Handling huge and complex data


Big data is not only huge in size (terabytes, petabytes, etc.), but also comes in many types (text, videos, social
media, transactions).
Normal software cannot handle it → so we need special storage and tools.

7. Distributed processing
When data grows too big, a single computer cannot process it.
Instead, the work is split across many computers working together (like dividing a big task among friends).
This is how companies handle exabytes or zettabytes of data efficiently.

Prof. Babitha P K 25 CIT, Ponnampet


2. CLASSIFICATION OF ANALYTICS
There are two major schools of thought for classifying analytics:
2.1 First School of Thought
This school divides analytics based on maturity and business value:
Basic analytics:
Focus: Simple data slicing and dicing for basic business insights.
Methods: Historical reports and basic visualizations.
Operationalized analytics:
Focus: Analytics integrated into business processes for routine use.
Advanced analytics:
Focus: Using predictive and prescriptive models to forecast the future.
Monetized analytics:
Focus: Directly generating business revenue with analytics.

2.2 Second School of Thought


This school classifies analytics by historical and technological evolution: Analytics 1.0, 2.0, and 3.0.
Table Summary:

Analytics 1.0 Analytics 2.0 Analytics 3.0

Era 1950s to 2009 2005 to 2012 2012 to present

Descriptive + Predictive Descriptive + Predictive +


Focus Descriptive statistics statistics Prescriptive

What will happen? When will it


Key What happened? Why What will happen? Why happen? Why? What should we
Questions did it happen? will it happen? do?

Legacy, structured, Big data (often Blend of big data, legacy,


Data internal unstructured, external) internal & external sources

Hadoop, distributed Advanced in-memory, machine


Technology Relational DBs clusters learning, cloud, NoSQL

Prof. Babitha P K 26 CIT, Ponnampet


Growth of Analytics Types (Figure 3.6)
Analytics evolves from hindsight to foresight:
• Descriptive Analytics: What happened? (Past)
• Diagnostic Analytics: Why did it happen? (Past)
• Predictive Analytics: What will happen? (Future)
• Prescriptive Analytics: How can we make it happen? (Actionable future)
Each step adds more value, moving from simple reporting to actionable insights that drive decisions.

• First school: Stages of business adoption (basic → monetized)


• Second school: Analytical maturity (1.0 → 3.0) and questions increasing in complexity (from what/why
happened to what to do next)
• Big data growth is a cycle: more data produced → more stored → more analyzed → better predictions
→ more insights → steady growth.

3.IMPORTANCE OF BIG DATA ANALYTICS

1. Reactive – Business Intelligence:


• Focuses on analyzing past or historical data.
• Delivers reports and dashboards to help businesses make better decisions by providing the right
information at the right time.
• Primarily supports pre-defined and ad hoc reporting but stays limited to static data and trends from
the past.
2. Reactive – Big Data Analytics:
• Uses large datasets (big data), but the approach remains reactive, meaning it's still based on
analyzing historical (static) data rather than anticipating future outcomes.
• The scale is larger, but the mindset is similar to traditional business intelligence—focused on what
has already happened.
3.Proactive – Analytics:
• Moves beyond the past, supporting decision-making about the future.
• Utilizes data mining, predictive modeling, text mining, and statistical analysis.
• Still faces limitations if traditional databases are used on big data, as they may struggle with
storage and processing at this scale.
4.Proactive – Big Data Analytics:

Prof. Babitha P K 27 CIT, Ponnampet


• Focuses on filtering and analyzing huge volumes of data—from terabytes to exabytes.
• Enables rapid, advanced insights by leveraging modern tools to identify relevant data and solve
complex, large-scale business problems.
• Supports real-time or near-real-time analysis, empowering businesses to act quickly and
proactively, not just react after the fact.

Summary of Evolution:
• Reactive approaches ask: "What happened?" and "Why did it happen?"
• Proactive approaches ask: "What is likely to happen next?" and "How can we use the data to improve
outcomes?"
• Big data analytics scales these approaches, enabling deeper, faster, and more actionable insights using
enormous and complex datasets.

4.TERMINOLOGIES USED IN BIG DATA ENVIRONMENTS

4.1 In-Memory Analytics


The Problem with Old Method (Traditional Storage)
Data is stored on hard disks.
Hard disks are slow compared to memory.
So when the CPU needs data, it takes a long time to fetch it.
To save time, companies sometimes pre-calculate summaries or reports in advance (like totals or averages).
But if later you need different details, you must go back to the slow disk again → wastes time.
Example: Imagine reading a big book from a library shelf every time you need an answer → it’s slow.
4.2 In-Database Processing?
In-database processing (also called in-database analytics) is a technique where analytics and computation are
performed directly within the database where the data is stored, rather than exporting the data to a separate
analytics tool or environment.
How Does It Work?
1. Traditional Approach:
• Data from enterprise systems (like OLTP—Online Transaction Processing systems) is first
cleaned up (removing duplicates, scrubbing, etc.) using processes like ETL (Extract, Transform,
Load).
• The cleaned data is then loaded into data warehouses (EDW—Enterprise Data Warehouses) or
data marts.
• For advanced analysis, the data is exported out of the database into external analytical tools or
programs.
• Problem: Exporting large datasets is time-consuming and often inefficient.
2. In-Database Processing:
• Instead of exporting, the database itself performs the analytics.
• Analytical programs and computations run inside the database engine.
• This eliminates the need for data export and significantly saves time, especially with huge
datasets.
• As a result, data is kept secure and analytics are faster.
Benefits
• Speed: Dramatically faster analytics since data doesn’t move out of the database.
• Efficiency: Reduces data movement and duplication.
• Scalability: Can handle very large datasets efficiently.
• Security: Data stays within the secure database environment.
Who uses it?
Leading database vendors (like Oracle, Microsoft SQL Server, IBM, etc.) now include in-database analytics
features, especially for large businesses that need fast and secure analytics on big data.

Prof. Babitha P K 28 CIT, Ponnampet


In summary, in-database processing allows analytical computations to be performed right where the
data lives, making analytics faster, more efficient, and secure—ideal for handling big data in modern
organizations.
4.3 Symmetric Multiprocessor System (SMP)
A Symmetric Multiprocessor System (SMP) is a computer architecture where two or more identical processors
share a single, common main memory. In this setup:
• All processors have equal access to all input/output (I/O) devices.
• The entire system is controlled by a single operating system instance.
• These processors are tightly connected, meaning each processor has its own high-speed cache memory,
but all are linked to the common main memory via a system bus.
• SMP systems are often used for tasks requiring high performance and reliability, as multiple processors
can work simultaneously on different parts of a task or multiple tasks

How it works (step by step)


 Processors work in parallel → all can do tasks at the same time.
 When a processor needs data:
 It first checks its cache (fast memory).
 If not found, it goes to the main memory through the system bus.
 All processors share the same memory and I/O devices, so the bus arbiter manages access.
 This setup makes the system faster (multiple processors working together) and reliable (if one processor
fails, others still work).

4.4 Massively Parallel Processing (MPP)

MPP is a computer setup where many processors work at the same time on different parts of a task.
 Key Features
 Each processor has its own memory + operating system → Unlike SMP (where all processors share one
memory), in MPP each processor has its own private memory.
 Parallel work → A big task is broken into smaller tasks, and each processor works on its piece at the same
time.
 Communication → Since processors are separate, they talk to each other through a messaging interface.
 Programming is harder → Because tasks must be split correctly, and processors need to share results
properly.

Prof. Babitha P K 29 CIT, Ponnampet


4.5 Difference between Parallel and Distributed Systems

Parallel Systems
• Tightly Coupled: All processors are closely connected, either sharing the same physical memory or
using high-speed links to coordinate.
• Single System Image: Users see the entire setup as one system and aren’t aware of which processor is
handling their request.
• Cooperation: Processors (like P1, P2, P3 in Figure 3.10) work together on the same task—such as
processing a database query—at the same time, dividing up the workload for faster results.
• Shared Memory: As shown in Figure 3.11, processors can access a common memory area, allowing for
very fast communication.
• Use Case: Speeding up large tasks by breaking them into smaller parts that are solved at the same time.
Distributed Systems
• Loosely Coupled: The system is made up of multiple separate machines (computers) connected over a
network.
• Individual Operation: Each machine can run its own applications and serve its own users independently.
• Distribution: Data and work are spread across many machines. When a user makes a request, multiple
machines may need to be contacted to provide a complete answer.
• Communication: Machines communicate with each other over a network, often using messages.
• Use Case: Handling tasks that are naturally separated (like different departments of a company working
on different parts of a project), or when resources are spread geographically.

In short:
• Parallel systems: Many processors working closely together as one unit.
• Distributed systems: Many independent machines working together over a network.

Prof. Babitha P K 30 CIT, Ponnampet


 users → People who give input/queries.
 Front-end computer → Acts like a “manager” that takes requests from users and distributes work.
 Back-end parallel system (P1, P2, P3) → Multiple processors (P1, P2, P3), each with its own storage.
 Each processor can work independently on a part of the task.
 Together, they process faster because tasks are divided among them.
 Example:
Imagine students (P1, P2, P3) in a classroom, each with their own notebook.
The teacher (front-end) gives different parts of a big problem to each student.
All solve their parts simultaneously → work finishes quickly.

 Figure 3.12: Distributed System (Database Focus)


 This figure shows three separate machines (P1, P2, and P3).
 Each machine:
 Has its own users.
 Has its own storage/database.
 Is connected via a network.
 How it works:
Each machine can serve its own users independently. If a query needs data from more than one
machine, the system communicates through the network to gather results from multiple destinations.

Prof. Babitha P K 31 CIT, Ponnampet


• Each box has a processor (CPU) and its own memory.
• Unlike parallel systems with shared memory, here each processor has its own private memory.
• The processors are connected to each other through a network (links shown with arrows).
They work together by sending messages to share data or coordinate tasks

How it Works
Think of a group project:
• Each student (processor) has their own notebook (memory).
• They cannot directly use each other’s notebooks.
• Instead, if one student needs information, they must ask (send a message) and the other student replies.
• Together, by communicating, they complete the big project.SS

4.6 Shared Nothing Architecture (SNA)


In large multiprocessor or distributed systems, there are three typical architectures:
1. Shared Memory (SM): All processors share a central memory.
2. Shared Disk (SD): All processors have their own private memory but share one or more disks for
storage.
3. Shared Nothing (SN): Neither memory nor disk is shared among processors. Each processor (or
node) has its own memory and storage.
Shared Nothing Architecture means each node is independent. It does not share memory or disk with other
nodes. Nodes communicate only via high-level messages (such as over a network).
4.7.1 Advantages of Shared Nothing Architecture
1. Fault Isolation
• If a node fails, the problem and its effects are contained only within that node.
• Other nodes are unaffected.
• Faults do not spread (because nodes don't share hardware resources), making the system more robust
and easier to recover.
2. Scalability
• Because nodes don’t have to share memory or disks, adding more nodes is simple.
• Each new node brings its own resources.
• There's no single resource bottleneck (unlike, say, a shared disk or shared memory, where eventually
too many nodes will be fighting for access).

Prof. Babitha P K 32 CIT, Ponnampet


• This makes scaling up to many nodes much easier. Systems can grow by just adding more
independent nodes.
Architecture Memory Disk Scalability Fault
Isolation

Shared Memory (SM) Shared May be shared Limited Poor

Shared Disk (SD) Separate Shared Some limits Moderate

Shared Nothing (SN) Separate Separate Excellent Excellent


In Simple Terms
• Shared Nothing = Each node is an island!
It brings its own everything (CPU, memory, disk) and talks to other nodes via messages only.
This keeps the system simple to scale and safe from ripple effects when one node fails.

4.7 CAP Theorem Explained


states that in a distributed system (a group of computers or nodes that share data), it is impossible to guarantee
all three of the following at the same time:
1. Consistency – Every read receives the most recent write or an error.
2. Availability – Every request receives a (non-error) response, without guarantee that it contains the most
recent write.
3. Partition
Tolerance – The system
continues to operate even if
messages are lost or delayed
between parts of the system.
You can choose only two
out of these three in any real
distributed system. One must
be given up, especially
during network failures or
partitions.

1.Consistency – Means: All users see the same data at the same time.
2.Availability – Means: The system always responds to requests (it doesn’t hang or fail).
3.Partition Tolerance – Means: The system still works even if the network breaks between some
computers (nodes can’t talk to each other properly).

Real-life Analogy (Training Institute Example)


Suppose you work at a training institute with many instructors. There is an office administrator (Amey) who
keeps schedules. Sometimes another person (Joey) is added to share the job as requests increase.
• If only Amey handles updates: Consistency is easy—all information is in one place.
• Once Joey is added:
• If both don’t frequently synchronize changes, their records may differ. This leads
to inconsistency—because a training update with one admin may not be available with the
other.
• If both try to keep schedules updated but are sometimes out of sync, you get
either Availability or Consistency—but not both, especially if there's a communication
breakdown (network partition).
Example Situation
• You call Amey for your schedule, he says nothing is at 3pm.

Prof. Babitha P K 33 CIT, Ponnampet


• You know the schedule was updated for you at 3pm, but maybe Joey got the update instead.
• When they check, Joey indeed has you scheduled at 3pm—inconsistent views between the two admins.
To fix this, they agree:
Whenever one updates a schedule, both must update their respective files. But if one is on leave, all updates
must be shared via email and applied upon return.
• This tries to provide both Consistency and Availability.
• But if the two admins are not talking (network partition), either updates stop flowing (partition
tolerance, but not available), or they must work independently (and get out of sync—
sacrificing consistency).
Key Lesson from CAP Theorem
• In any distributed setup:
• You can have Consistency and Availability as long as there’s no network partition.
• If a partition occurs (e.g., a network split), you must sacrifice either Consistency (let responses
go out even if they may be outdated) or Availability (refuse to respond until you’re sure of a
consistent view).
Summary:
• Consistency ↔ Availability ↔ Partition tolerance
(Pick any two during a failure, not all three.)
Bottom Line
The CAP Theorem is CRUCIAL for understanding trade-offs in designing distributed databases and systems
(like NoSQL databases, cloud storage, etc.).
• You have to think: “Which property are you willing to sacrifice in case of a network problem—
immediate consistency, high availability, or working through partitions?”

There's a triangle diagram with points labeled A, C, and P and categories of existing systems shown:
• The triangle says: “Pick any two!!” You must sacrifice one.
Positions on Triangle (and example databases)
1. CA (Consistency + Availability, NOT Partition Tolerant)
• Examples: Traditional RDBMS like PostgreSQL, MySQL.
• Explanation: They offer strict consistency and availability as long as there’s no network failure.
If a partition (network split) occurs, they cannot function correctly.
2. CP (Consistency + Partition Tolerance, NOT Always Available)
• Examples: HBase, MongoDB, Redis, MemcacheDB, BigTable.

Prof. Babitha P K 34 CIT, Ponnampet


• Explanation: Always consistent and partition-tolerant. In case of a partition, the system might
become unavailable rather than return incorrect data.
3. AP (Availability + Partition Tolerance, NOT Always Consistent)
• Examples: Riak, Cassandra, CouchDB, Dynamo.
• Explanation: The system remains available and tolerates partitions, but consistency may
sometimes be sacrificed (eventually consistent).
When to Choose Consistency or Availability? (Text Section)
Pick availability over consistency if:
• Your business can tolerate some temporary inconsistency
• It’s fine if data takes a little time to synchronize
• For example, user profile updates that can take a few seconds to show everywhere
Pick consistency over availability if:
• Your business needs immediate, atomic updates
• Data must be 100% correct at all times
• For example, bank account transfers (double withdrawal must not happen!)
Summary Table from the Text
Combo What you get Examples (from diagram)

AP Availability & Partition Tolerance Cassandra, Riak, CouchDB, Dynamo

CP Consistency & Partition Tolerance MongoDB, HBase, Redis, BigTable

CA Consistency & Availability MySQL, PostgreSQL, traditional RDBMS


Key Point
No distributed system can guarantee all three (C, A, P) at once during a network partition.
You must choose the two most important for your use case, and be aware of the trade-offs.
In Simple Terms:
• Want system always available, even if network fails? Must tolerate temporary inconsistency.
• Want perfect consistency and survival in network failures? Must sometimes sacrifice availability.
• Want consistency and always-on responses? Can't survive network splits.

5.Few Top Analytical Tools


5.1 NoSQL (NOT ONLY SQL)

NoSQL databases refer to a class of database management systems distinct from traditional relational (SQL-
based) databases. The term "NoSQL" was first used by Carlo Strozzi in 1998 and later popularized in 2009.
NoSQL databases are designed to handle large volumes of varied data types, especially for modern web and big
data applications.
Key Features of NoSQL Databases
• Open source: Most NoSQL databases are open source, promoting community involvement and
flexibility.
• Non-relational: They don't use the traditional tabular relational data model. Instead, they store data in
formats like key-value pairs, document-oriented, column-oriented, or graph-based structures.
• Distributed: Data is distributed across several nodes in a cluster, allowing high availability and fault
tolerance with commodity hardware.
• Schema-less: NoSQL databases offer flexibility by not enforcing a fixed schema; the structure of data
can be dynamic and change over time.
• Cluster friendly: They are designed to scale out horizontally, which means adding more servers to
handle increased load.
• Born from 21st-century web applications: These databases cater to the demands of modern, big data-
driven, cloud-native applications that require scalability, flexibility, and high availability.
5.1.1 Where is NoSQL Used? (Figure 4.1)
NoSQL databases are widely used in:

Prof. Babitha P K 35 CIT, Ponnampet


• Log analysis: Storing and analyzing system logs.
Every system (like servers, apps, or websites) automatically records what happens in the background—
called logs.
• Social networking feeds: Handling continuous streams of social media posts.
• Time-based data: Data types that are not efficiently analyzed in traditional RDBMS, such as event logs
and sensor data.
• Data collected over time at regular intervals. Traditional RDBMS (like MySQL) aren’t efficient for this
because the data is huge, fast, and needs time-based queries (like “find all errors in the last 5 minutes”
or “temperature trends over a week”).
Specialized databases like Time-series DBs (InfluxDB, TimescaleDB) handle this better.

5.1.2 What is NoSQL? (Figure 4.2)

NoSQL databases are non-relational databases designed to handle large, unstructured, and fast-changing data
that traditional relational databases (like MySQL, Oracle) struggle with.

Features shown in the diagram:


1. Non-relational data storage systems
o Unlike SQL databases (which use tables, rows, and columns), NoSQL stores data in other
formats:
▪ Key-Value pairs (like a dictionary) → Redis
▪ Documents (like JSON) → MongoDB
▪ Graphs (nodes & edges) → Neo4j
▪ Column-based → Cassandra

2. No fixed table schema


o In SQL, you must define a strict table structure (columns, data types) before storing data.
o In NoSQL, data can be stored flexibly without a predefined schema.
o Example: In MongoDB, one document can have {name, age}, another can have {name, email,
address} — no problem.

3. No joins
o In SQL, data from multiple tables is combined using JOIN queries.
o NoSQL avoids joins to improve speed and scalability.
o Instead, data is often denormalized (kept together in one place), which makes queries faster.

4. No multi-document transactions
o In SQL, you can update multiple rows/tables in a single transaction (all-or-nothing).
o Many NoSQL databases don’t support this fully, or they provide limited transactions to keep
things simple and faster.
o Instead, they often use eventual consistency rather than strict consistency.

5. Relaxes one or more ACID properties


o SQL databases follow ACID (Atomicity, Consistency, Isolation, Durability).
o NoSQL databases often relax these to achieve better scalability and speed, especially in
distributed systems.
o They usually follow the CAP theorem (Consistency, Availability, Partition Tolerance).
o Example: MongoDB prioritizes availability & partition tolerance over strict consistency.

Prof. Babitha P K 36 CIT, Ponnampet


5.1.3 Types of NoSQL Database

NoSQL databases are broadly classified into four main types, based on how they store and organize data:
1. Key-Value Stores
• How they work: Data is stored as a big hash table of unique keys and their corresponding values, like a
dictionary.
• Use cases: Fast retrieval of values using a specific key.
• Examples: Dynamo, Redis, Riak, Amazon S3 (Dynamo), Scalaris.
• Sample:

Key Value
First Name Simmonds
Last Name David
2. Document Stores
• How they work: Data is stored in documents (usually JSON, BSON, or XML) instead of rows and
columns. Each document can have a different structure.

Prof. Babitha P K 37 CIT, Ponnampet


• Use cases: Applications needing flexible, evolving data structures, like content management systems.
• Examples: MongoDB, CouchDB, Couchbase, MarkLogic.
• Sample:
json
{
"Book Name": "Fundamentals of Business Analytics",
"Publisher": "Wiley India",
"Year of Publication": "2011"
}
3. Column Stores (Column-Family Databases)
• How they work: Data is stored in columns instead of rows. Each storage block contains data from only
one column.
• Use cases: Good for analytical queries over large datasets.
• Examples: Cassandra, HBase.
4. Graph Databases
• How they work: Data is stored as nodes (entities) and edges (relationships), creating a network or
graph structure.
• Use cases: Perfect for scenarios with complex relationships, like social networks, fraud detection,
recommendation engines.
• Examples: Neo4j, HyperGraphDB.
• Sample:
Imagine three nodes:
• John (ID: 1001, Age: 28)
• Joe (ID: 1002, Age: 32)
• Group (ID: 1003, Name: AAA)
And the relationships:
• John knows Joe since 2002
• John and Joe are both members of Group since 2002/2003

What the diagram shows:


 Circles (Nodes):
Each circle is an entity (a person, group, or object).
 Node 1: John (ID: 1001, Age: 28)
 Node 2: Joe (ID: 1002, Age: 32)
 Node 3: Group (ID: 1003, Name: AAA)
 Arrows (Edges):
Each arrow represents a relationship between nodes. Relationships also have labels (extra
information).
 John knows Joe since 2002

Prof. Babitha P K 38 CIT, Ponnampet


 Joe knows John since 2002
 John is a member of Group AAA since 2003
 Joe is a member of Group AAA since 2002

In Simple Terms:
• Key-Value: Like a dictionary.
• Document: Like a folder with documents of different formats.
• Column: Like a spreadsheet where each column is stored separately.
• Graph: Like a network diagram showing connections between entities.
5.1.4 Why NoSQL
which explains the main reasons for using NoSQL databases instead of traditional relational (SQL) ones.
1. Scalability (Scale-Out Architecture)
• What it means:
SQL databases usually scale up → buy a bigger, more powerful server.
NoSQL scales out → just add more normal servers, and the database spreads across them.
• Why it matters:
Easier and cheaper to handle millions of users and huge data.
• Example:
Facebook or Amazon can’t run on a single giant server—they use thousands of smaller servers with
NoSQL.

2. Handling Large and Varied Data


• What it means:
SQL is good only for structured data (tables, rows, columns).
NoSQL can handle all kinds of data:
o Structured (tables)
o Semi-structured (JSON, XML)
o Unstructured (images, videos, social media posts)
• Why it matters:
Modern apps (social media, IoT, e-commerce) deal with all types of data, not just tables.
• Example:
YouTube stores videos, comments, likes, metadata → perfect for NoSQL.

3. Dynamic Schema
• What it means:
In SQL, you must define a fixed schema (columns, types) before adding data.
In NoSQL, you can add data without a fixed structure.
• Why it matters:
Flexible → developers can update the app without changing the whole database.
• Example:
In MongoDB:
• { "name": "John", "age": 25 }
and later:
{ "name": "John", "email": "[email protected]" }
→ Both are valid, no need to redesign.

4. Auto-sharding
• What it means:
NoSQL automatically splits data across multiple servers (shards).
• Why it matters:
o Balances load (no single server is overloaded)
o Easy to replace or add servers
o Applications don’t need to worry about “where” the data is stored

Prof. Babitha P K 39 CIT, Ponnampet


• Example:
A social app with millions of users → some users’ data is stored on Server A, some on Server B, etc.
The app still sees it as one database.

5. Replication
• What it means:
Data is copied to multiple servers.
• Why it matters:
o If one server crashes, another takes over (High Availability).
o Ensures fault tolerance and disaster recovery.
• Example:
In Amazon or Netflix, even if one data center goes down, your account and history are still available
from another copy.

In short:
• Scalable → easily grow with users.
• Handles all kinds of data → structured, semi-structured, unstructured.
• Flexible schema → no rigid tables.
• Auto-sharding → data spread automatically.
• Replication → always available, fault-tolerant

5.1.5 Advantages NoSQL

Advantages of NoSQL Databases


NoSQL databases offer several important benefits over traditional relational databases—especially for modern,
large-scale, and distributed applications. Here’s what the book covers:
1. Easy Scalability (Scale Up & Down)
• NoSQL databases can easily expand (scale out) or shrink as needed.
• This makes handling increased load or changing needs simple—just add or remove servers (nodes).
• Supports "cloud scaling": can grow with your needs.
2. No Pre-defined Schema Needed
• You don’t need to define your data structure ahead of time.
• NoSQL databases are "schema-less", allowing you to store data with any fields and structure.
• You can add new fields at any time—great for fast-evolving projects.
3. Cheap, Easy to Implement
• Deploying NoSQL is generally more cost-effective.
• Uses commodity (regular, inexpensive) hardware.
• Built to be fault-tolerant and highly available, which lowers operational costs.
4. Easy to Distribute
• Data can be spread across servers automatically (sharding).
• Most NoSQL databases support built-in distribution and automatic failover.
• This makes them well-suited for large, geographically distributed applications.
5. Data Replication & Partitioning
• Data can be replicated (copied) to multiple nodes for safety.
• Supports partitioning (splitting data across clusters) for managing very large datasets.
• This means higher fault tolerance and availability.
6. Relaxes Data Consistency Requirement
• NoSQL may favor availability and partition tolerance over strict consistency (per the CAP
theorem).
• Most go for "eventual consistency," which is often okay for web-scale apps (e.g., social media
timelines).
Diagram Reference (Figure 4.4):
The diagram summarizes these points visually:

Prof. Babitha P K 40 CIT, Ponnampet


Here are the advantages of NoSQL
 Cheap and easy to start: It doesn’t cost much and is simple to use.
 Easy to share data: You can spread the data out over many computers easily.
 Can grow or shrink easily: You can add more storage or remove it without problems.
 Not strict about matching data: It lets you store data even if it's not exactly the same every time.
 No need to set up a plan for the data at the start: You don’t have to design a strict structure for your data
before using it.
 Can copy data and split it up: You can have copies of your data in many places and break it into parts to
make it faster and safer.

Table Reference (Table 4.1):

In Simple Terms:
NoSQL databases are flexible, easy to grow, and ideal for today’s big, distributed data needs. They don’t
force fixed data structures, spread data for performance and safety, and are built with modern
applications in mind.

5.1.6 What We Miss With NoSQL?


• Joins: NoSQL can’t easily connect data from different places like traditional databases can.
• Group by: It’s harder to get summaries (like totals or averages) from the data.
• ACID properties: NoSQL doesn’t always make sure data is always accurate and safe, unlike old
databases that always try to do this.
• SQL: You can't use the familiar SQL language to work with the data.
• Easy integration: It’s not as simple to connect NoSQL with other tools or software that expect
traditional databases.

Prof. Babitha P K 41 CIT, Ponnampet


5.1.7 Use of NoSQL in industry

NoSQL databases are widely used in many modern industries due to their flexibility, scalability, and ability to
handle large amounts of varied data. Here’s what the images and paragraph convey:

• Key–Value Pairs:
Used for things like shopping carts and tracking what users do on websites. Big companies like Amazon
and LinkedIn use it for fast lookups and simple data storage.
• Column-oriented:
Helps analyze a lot of web user actions and data coming from sensors. Companies like Facebook,
Twitter, eBay, and Netflix use this to handle and quickly search through massive amounts of activity
data.
• Document-based:
Good for real-time analytics, keeping logs, and managing lots of stored documents. Useful for searching
and updating lots of text-based information as it changes.
• Graph-based:
Used for connecting things—like building social networks, recommending products, and figuring out
relationships between different items (such as in Walmart for upselling or cross-selling).

Prof. Babitha P K 42 CIT, Ponnampet


5.1.8 NoSQL vendors.

Table 4.2: Few Popular NoSQL Vendors


This table gives an overview of some of the world’s leading NoSQL database products, their creating
companies, and who uses them.
Company Product Most Widely Used by

Amazon DynamoDB LinkedIn, Mozilla

Facebook Cassandra Netflix, Twitter, eBay

Google BigTable Adobe Photoshop


5.1.9 SQL versus NoSQL

What does this table show?


• Real-world trust: The world’s biggest tech companies build and run their essential processes on
NoSQL databases from these vendors.
• Scalability & reliability: NoSQL databases are proven to work at massive scale, supporting millions
(or even billions) of users and vast amounts of data.

Prof. Babitha P K 43 CIT, Ponnampet


5.2.0 What is NewSQL
NewSQL is a modern type of Relational Database Management System (RDBMS) designed to provide the best
features of both traditional SQL databases and NoSQL systems. It offers the scalability and performance found
in NoSQL databases—ideal for large-scale, online transaction processing (OLTP)—while still maintaining
the ACID guarantees (Atomicity, Consistency, Isolation, Durability) of traditional databases and supporting
SQL as the primary query interface

5.2.1 Characteristic of NewSQL

• Uses SQL language: You can talk to it using the familiar SQL commands.
• Keeps data safe (ACID support): Makes sure all transactions are safe, reliable, and correct.
• Fast performance: Designed to be quicker than old-school databases, especially on each computer.
• Grows easily (scalable): Can add more computers or servers as needed, with each working separately.
• No waiting for writes (non-locking): Reading and writing data happens at the same time without
causing problems—real-time reads won't get slowed down by writes.

5.2.2Comparison: SQL vs. NoSQL vs. NewSQL


Feature SQL NoSQL NewSQL

Adherence to ACID Yes No Yes

OLTP/OLAP Yes No Yes

Schema Rigidity Yes No Maybe

Adherence to Data Model Relational No Maybe

Data Format Flexibility No Yes Maybe

Scalability Scale up, Vertical Scale out, Horizontal Scale out

Distributed Computing Yes Yes Yes

Prof. Babitha P K 44 CIT, Ponnampet


Feature SQL NoSQL NewSQL

Slowly
Community Support Huge Growing growing

6.HADOOP

What is Hadoop?
• Open-source project from the Apache Foundation.
• Written in Java.
• Created by Doug Cutting in 2005.
• Name "Hadoop" was inspired by his son's toy elephant.
• Initially developed to support a search engine called "Nutch."
• He worked at Yahoo during its development.
Why was Hadoop created?
Hadoop was built to solve problems of storing and processing extremely large data sets in
a distributed and fault-tolerant way, which was required for search engines and large internet companies.
Core Technologies used/inspired
Hadoop was inspired by:
• Google MapReduce: A programming model for processing large data sets with a parallel, distributed
algorithm.
• Google File System (GFS): A way of storing large files across machines.
Hadoop implemented its own versions:
• Hadoop Distributed File System (HDFS): Distributed storage.
• MapReduce: Distributed data processing.
Usage
Today, Hadoop is a key part of the backend infrastructure for major companies (Yahoo, Facebook, LinkedIn,
Twitter, etc.) because it allows them to:
• Store massive amounts of data across many computers
• Process that data efficiently and reliably

Figure 4.8
• At the top: Hadoop (an Apache open-source framework)

Prof. Babitha P K 45 CIT, Ponnampet


• Inspired by Google MapReduce and Google File System.
• Below: Hadoop Distributed File System and MapReduce form the core components.
In summary:
Hadoop is a powerful framework that allows big organizations to store and process huge amounts of
data reliably using clusters of ordinary computers, inspired by Google's own solutions.

6.1 Features of Hadoop


1. Handles Massive Data (Cost-Effectively):
• Hadoop is designed to work with huge amounts of data—structured, semi-structured, or
unstructured—using regular, inexpensive computers (commodity hardware).
2. Shared-Nothing Architecture:
• Each computer in the Hadoop cluster works independently and doesn’t share storage or memory
with others. This makes the system scalable and reliable.
3. Data Replication and Fault Tolerance:
• When Hadoop stores data, it keeps copies (replicas) on multiple machines. If one machine fails,
Hadoop can still access the data from another machine, ensuring reliability and preventing data
loss.
4. High Throughput, Not Low Latency:
• Hadoop is designed for efficient processing of large amounts of data in “batches.” It focuses on
getting lots of work done (high throughput), rather than responding instantly (low latency). So,
it’s great for big jobs, but not for tasks that need a quick response.
5. Complements OLTP & OLAP:
• Hadoop can be used alongside traditional databases for transaction processing (OLTP) and
analytical processing (OLAP), but it does not replace a regular relational database system.
6. Not Good When Data Can’t Be Parallelized:
• Hadoop works best when a task can be split into smaller parts to be processed simultaneously. If
there are lots of dependencies between parts of the job, Hadoop isn’t a good fit.
7. Not Suitable for Small Jobs:
• Hadoop isn’t efficient for small files or small data processing tasks. It shines with huge files and
very big datasets.
In Short
Hadoop is a cost-effective, scalable, and fault-tolerant system for processing massive data in batches. It’s best
when jobs can be parallelized, but isn’t the right tool for quick responses, single-machine tasks, or small
datasets.
6.2 Key Advantages of Hadoop
1. Stores Data in Its Native Format
• What it means: Hadoop (specifically, its storage layer HDFS) can keep data exactly as it comes in,
without needing to convert or fit it into a pre-defined structure (schema).
• Why it’s good: You don’t lose any information during storage, and you can process the data into
structured form only when you actually need it.
2. Scalability
• What it means: Hadoop can handle and process very large amounts of data, scaling up to thousands of
terabytes (petabytes) across many machines.
• Why it’s good: Companies like Facebook and Yahoo have successfully used Hadoop to manage truly
massive datasets, proving its ability to scale.
3. Cost-effective
• What it means: Hadoop runs on regular, cheap (commodity) hardware, so storing and processing huge
amounts of data is much less expensive per terabyte compared to traditional systems.
• Why it’s good: You can handle “big data” without huge investment in specialized, expensive servers.
4. Resilient to Failure (Fault Tolerant)
• What it means: Hadoop automatically makes copies (replicas) of your data and stores them on different
machines in the cluster.
• Why it’s good: If one computer fails, another still has the data—so you don’t lose anything and the
system keeps working.

Prof. Babitha P K 46 CIT, Ponnampet


5. Flexibility
• What it means: Hadoop can store and process any kind of data—structured (like databases), semi-
structured (like logs), or unstructured (like text or images).
• Why it’s good: Useful for many different tasks such as:
• Log analysis
• Market analysis
• Recommendation systems
• Social network data mining

6. Fast
• What it means: Hadoop is designed to process data where it lives (moves code to the data), instead of
moving large datasets around the network.
• Why it’s good: This approach is much faster than older systems that required moving data to a central
processing location.
7. No Information Loss, No Forced Translation
• What it means: Because you store data as-is, you don’t have to worry about losing information from
converting or squeezing the data into a specific format.
• Why it’s good: You keep all your raw data for future processing.
8. Hardware Agility
• What it means: You can add or remove hardware (servers) from your Hadoop cluster without a major
disruption.
• Why it’s good: Makes it easier to grow over time or replace parts when hardware fails.
In summary:
Hadoop is powerful because it is flexible, scalable, fault-tolerant, fast, cost-effective, and can handle all types
of data as-is—making it ideal for the modern “big data” world.

6.2 Versions of Hadoop


Versions of Hadoop
There are two main versions:
1. Hadoop 1.0

Prof. Babitha P K 47 CIT, Ponnampet


2. Hadoop 2.0
Hadoop 1.0 Architecture
Hadoop 1.0 had two primary components:
1. HDFS (Hadoop Distributed File System):
• Stores data across multiple machines in a redundant and reliable way.
• It is schema-less, simply storing files as-is ("native format").
• Provides fault tolerance—if a machine fails, data is still safe and accessible.
2. MapReduce:
• A programming model for processing large data sets in parallel.
• Consists of two main steps:
• Mappers: process input data and generate intermediate key-value pairs.
• Reducers: process intermediate data to generate final results.
• Handles data processing and also acts as the cluster resource manager (controlling how
computation resources are allocated).

Diagram of Hadoop 1.0 (from the image)


• Two layers:
• HDFS (data storage, redundant, reliable)
• MapReduce (cluster resource manager and data processing)
Limitations of Hadoop 1.0
1. Requires MapReduce Expertise (Java):
• You must know MapReduce (and often Java) to process data.
2. Only Batch Processing:
• Only supports batch jobs, not real-time or other types of processing (like streaming or graph
processing).
• Good for large-scale tasks such as log analysis but unsuitable for interactive or low-latency jobs.
3. Tightly Coupled Storage and Processing:
• Data processing (MapReduce) is tightly connected to storage (HDFS).
• Other data management and processing tools cannot easily integrate.
• If you need different types of processing, you must either re-implement them using MapReduce or
move your data out of Hadoop—both inefficient and problematic.
What Changed in Hadoop 2.0?
Hadoop 2.0 introduced a key new component—YARN (Yet Another Resource Negotiator):
• YARN becomes the resource manager, decoupling resource management from MapReduce.
• MapReduce becomes just one of many possible data processing engines; others can be plugged in (like
Spark, machine learning, graph processing, etc.).
• HDFS remains the core storage system.

Hadoop 2.0
Hadoop 2.0: Key Points
1. HDFS Continues as Storage

Prof. Babitha P K 48 CIT, Ponnampet


• Hadoop 2.0 still uses HDFS (Hadoop Distributed File System) for storing data reliably across
clusters.
2. YARN: The Big New Addition
• YARN stands for Yet Another Resource Negotiator.
• It is a resource management framework, added to improve Hadoop’s flexibility and scalability.
3. How YARN Works
• YARN allows applications to be split into parallel tasks and manages the allocation of resources
(CPU, memory, etc.) for these tasks.
• Instead of just a single kind of job (MapReduce), YARN can support many different job types by
providing:
• ApplicationMaster: Manages each application’s execution (like job scheduling and
monitoring).
• NodeManager: Handles resource management and task execution on individual
machines.
• This design replaces the older JobTracker and TaskTracker components from Hadoop 1.0.
4. No Longer Just MapReduce
• YARN is not limited to MapReduce!
• The ApplicationMaster can run many kinds of applications (e.g., Spark, streaming,
graph processing).
• This means you don’t need MapReduce programming skills anymore just to run jobs on Hadoop.
5. Batch and Real-Time Processing
• Hadoop 2.0 (via YARN) can support both traditional batch processing and real-time data
processing.
• MapReduce becomes only one of many possible ways to process data.
6. Better Data Management in HDFS
• Alternative data processing and data management tasks (like data standardization or master data
management) can now be done natively within HDFS, thanks to the flexibility YARN provides.
In Simple Words
• Hadoop 1.0: Only supported batch jobs with MapReduce.
• Hadoop 2.0: Introduces YARN, which allows multiple processing frameworks (not just MapReduce) to
run on the same Hadoop cluster, making Hadoop much more powerful, flexible, and easier to use for all
kinds of big data jobs.
Bottom Line:
Hadoop 2.0 separates resource management (YARN) from data processing (MapReduce and others), allowing
you to run a wide range of applications efficiently on your Hadoop cluster—not just MapReduce batch jobs!

Diagram of Hadoop 2.0 (from the image):


• HDFS at the bottom (storage)
• YARN in the middle (resource management)
• At the top: multiple processing engines (MapReduce, Others)
SUMMARY TABLE
Hadoop 1.0 Hadoop 2.0

Storage HDFS HDFS

Processing MapReduce only (batch) MapReduce + Others (Spark, etc.)

Resource Manager MapReduce itself YARN (general purpose)

Flexibility Low (tightly coupled) High (supports pluggable frameworks)

Prof. Babitha P K 49 CIT, Ponnampet


In Simple Words
• Hadoop 1.0:
You store files in HDFS and process them with MapReduce. Everything is tightly coupled, and you
need to use MapReduce for any processing.
• Hadoop 2.0:
You still store files in HDFS, but now you can use different processing frameworks (not just
MapReduce) thanks to YARN, which manages resources for all jobs.
Bottom line:
Hadoop 2.0 is much more flexible and powerful. It overcomes the main limitations of Hadoop 1.0 by allowing
more processing models, better scalability, and more efficient resource use.
6.3 Hadoop ecosystem
Major Steps in Hadoop Data Flow:
1. Data Ingestion: Getting data into Hadoop from external sources (databases, logs, etc.)
2. Data Processing: Parsing, cleaning, transforming, or crunching that data.
3. Data Analysis: Running queries/analytics to extract insights.
4. Data Storage & Management: Core file system and indexing.
Main Components :
1. Data Ingestion
Getting data into Hadoop
• Sqoop:
Moves relational data (MySQL, Oracle, DB2, etc.) in and out of Hadoop (HDFS, HBase, Hive)
• Import: Data from RDBMS → Hadoop (HDFS, HBase, Hive)
• Export: Data from Hadoop → RDBMS
• Connector-based architecture: Supports many different systems.
• Flume/Chukwa:
Collects and aggregates large amounts of log data from many sources into HDFS
2. Data Processing
How Hadoop processes/analyzes data.
• MapReduce:
Distributed computation framework (crunches data in parallel across the cluster)
• Spark:
Not shown in this specific image but mentioned in text. Faster, general-purpose computation engine for
both batch and real-time analytics.
3. Data Analysis
Querying and analyzing processed data.
• Pig:
High-level scripting for data flows (good for ETL, data transformation). Uses a language called Pig
Latin.
• Hive:
Data warehouse system on Hadoop. Lets you query big data sets using SQL-like syntax (HiveQL). Good
for analysts.
• Impala:
Not shown in that image, but mentioned in the text. Supports fast, interactive SQL queries on Hadoop
data.
• R:
Statistical analysis language, integrated in Hadoop for analytics.
4. Data Storage & Table Store
• HDFS (Hadoop Distributed File System):
• Core storage layer—stores all data in a distributed, redundant way.
• HBase:
• NoSQL, column-oriented database built on top of HDFS (for random, real-time read/write)*
• Used for tasks needing quick access to individual records, as opposed to scanning full
tables/files.
5. Cluster Coordination, Management, and Scheduling

Prof. Babitha P K 50 CIT, Ponnampet


• YARN: Not shown in the diagram but new versions of Hadoop use YARN for cluster resource
management (see earlier answers).
• Zookeeper:
Coordination and synchronization for distributed applications (helps manage configuration, leader
election, etc.)
• Oozie:
Workflow scheduler—a way to automate sequences of Hadoop jobs (like MapReduce, Hive, Sqoop, etc.)
• Ambari:
Provisioning, managing, and monitoring Hadoop clusters (user interface and tools for administrators).
Key Points (based on the difference between HDFS and HBase section):
• HDFS:
• Storage for large files, designed for batch processing (high throughput, high latency).
• Great for storing lots of raw/unstructured data.
• HBase:
• Database on HDFS for real-time, random reads/writes (low latency). Good for use cases where
you need to look up or update data quickly.
• HBase is good for “real-time analytics” whereas HDFS is for persistent, high-throughput storage.
In Simple Language
• Hadoop ecosystem is made up of many tools—each specializing in some part of the big data workflow.
• Data can come in from databases/log files (Sqoop, Flume), be stored in HDFS, processed
with MapReduce or Spark, queried/analyzed with Hive, Pig, or Impala, managed and scheduled
with Oozie, Ambari, and made fault-tolerant and coordinated with Zookeeper.
• HDFS stores the data, HBase provides fast database-like access, and various tools let you move,
process, and query the data as needed.

6.4 Hadoop Ecosystem – Data Processing Components


MapReduce
• What it is: A programming model for processing big data in a distributed and parallel way.
• How it works: Data flows through two main phases:
• Map phase: Converts input data into key-value pairs.
• Reduce phase: Aggregates and summarizes the mapped data to produce final results.
• Stores results back into HDFS.

Spark
• What it is: A flexible, in-memory big data processing framework (faster than MapReduce for many
tasks).
• Strengths:
• Executes most workloads in memory (not on disk), making it 10–100x faster.
• Can fall back to disk when memory is insufficient.

Prof. Babitha P K 51 CIT, Ponnampet


• Supports multiple languages (Scala, Java, Python, R).
• Can run on top of Hadoop/YARN or standalone.
• Spark Libraries:
• Spark SQL: SQL querying of large, distributed datasets.
• Spark Streaming: Real-time data analysis.
• MLlib: Distributed machine learning.
• GraphX: Graph and network computation.
• Why use Spark? For advanced analytics, machine learning, streaming, and faster performance
compared to classic MapReduce.
2. Hadoop Ecosystem – Data Analysis Components
Pig
• What it is: High-level scripting platform, alternative to writing MapReduce code.
• Pig Latin: Scripting language that translates your work to MapReduce jobs. Easy for ETL (extract,
transform, load) tasks and analyzing big datasets.
• Pig Runtime: Environment to run Pig scripts.
Hive
• What it is: Data warehouse on Hadoop for big data analysis, similar to SQL databases.
• HiveQL: SQL-like query language.
• What Hive does: Converts SQL-like queries into MapReduce jobs underneath.
• Best for: People familiar with SQL who want to analyze large datasets in Hadoop but don’t want to
write MapReduce code.
3. Hive vs Traditional RDBMS – Key Differences
1. Schema-on-Read vs Schema-on-Write
• Hive: Checks/enforces table schema only when you read/query the data (schema-on-read),
making data loading very fast.
• RDBMS: Checks schema when loading/inserting (schema-on-write), ensuring only “correct” data
is stored.
2. Usage Pattern
• Hive: Designed for "write once, read many" (append-only, batch processing).
• RDBMS: Designed for frequent read and write (OLTP).
3. Workload Type
• Hive: Closer to OLAP (analytical/batch queries), not good for OLTP (transactions).
• RDBMS: Good for day-to-day transactions, data updates, deletions, insertions.
4. Real-time vs Batch
• Hive: Not for real-time; it handles static (non-changing) data and analysis.
• RDBMS: Good for real-time, dynamic data.
5. Scalability and Ownership
• Hive: Highly scalable at low cost, uses HDFS to store data (does not “own” it).
• RDBMS: Owns and manages its data; scaling can be more complex/costly.
6. Processing Paradigm
• Hive: Parallel processing (jobs run in parallel over clusters).
• RDBMS: Typically serial (one operation at a time per table/record).
In Simple Words:
• MapReduce, Spark: For processing and transforming big data.
• Pig, Hive: For analyzing/querying big data (without needing to write complex MapReduce code).
• Hive vs RDBMS: Hive is optimized for huge, append-only datasets and batch analytics, not for
everyday transactional operations. RDBMS is the opposite: great for transactions, less suited for huge-
scale analytics.ss

1. Hive (on Hadoop) vs RDBMS: Table Summary


a) Data Variety
• Hadoop/Hive: Handles structured, semi-structured, and unstructured data (supports XML, JSON, flat
files, etc.)

Prof. Babitha P K 52 CIT, Ponnampet


• RDBMS: Only supports structured data in predefined schemas (tables, rows, columns).
b) Data Storage
• Hadoop/Hive: Designed for extremely large datasets—terabytes to petabytes.
• RDBMS: Suitable for sizes up to a few gigabytes or maybe terabytes.
c) Querying Language
• Hadoop/Hive: Uses HiveQL (similar to SQL).
• RDBMS: Uses standard SQL.
d) Query Response/Speed
• Hadoop/Hive: Query execution is slow (batch processing, higher latency). Good for analytics, not real-
time responses.
• RDBMS: Very fast, immediate query responses are possible (thanks to indexing).
e) Schema Management
• Hadoop/Hive: Schema is enforced at read time (schema-on-read), so loading data is easy and flexible.
• RDBMS: Schema is enforced at write time (schema-on-write)—data must fit the schema before being
inserted.
f) Read/Write Optimization
• Hadoop/Hive: Optimized for writing once and reading many times. Not good for random, frequent
updates.
• RDBMS: Optimized for both frequent reads and writes (great for transactional work).
g) Cost
• Hadoop/Hive: Open-source, uses regular commodity servers—very cost-effective for big data.
• RDBMS: Often proprietary, expensive software and high-end hardware for performance.
h) Use Cases
• Hadoop/Hive: Analytics, data discovery, processing of massive datasets.
• RDBMS: Online Transaction Processing (OLTP): day-to-day business transactions with real-time
requirements.
i) Scalability & Throughput
• Hadoop/Hive: Scales horizontally (add more servers, called nodes).
• RDBMS: Scales vertically (add more power to a single machine).
j) Integrity
• Hadoop/Hive: Lower data integrity; no ACID enforcement.
• RDBMS: Strong integrity, follows full ACID properties (Atomicity, Consistency, Integrity, Durability).
2. When to Use Each?
• Hive/Hadoop: When you have huge amounts of data (size of web logs, sensor data, social media, etc.),
don’t need real-time query speed, and often want to analyze/aggregate historical data.
• RDBMS: When you need fast transactional updates, strong consistency, and your data fits well into
tables.
3. Other Hadoop Ecosystem Components (Quick Definitions)
• Hive: SQL-like interface for Hadoop (big-data warehouse; converts queries to MapReduce).
• HBase: NoSQL database for fast random reads/writes on HDFS; best for real-time needs.
• Impala: Provides very fast SQL queries (interactive analytics) on Hadoop.
• ZooKeeper: Helps coordinate tasks and resources in a distributed (multi-server) environment.
• Oozie: Job scheduler; automates running Hadoop jobs in sequence.
• Mahout: Machine learning library for scalable data mining over Hadoop.
• Chukwa: System for collecting and managing huge log files.
• Ambari: Web-based management tool for Hadoop clusters.
4. Difference Between Hive and HBase
• Hive: For batch analytics, SQL-like queries, not real-time.
• HBase: For real-time lookups, insertions, updates (like a NoSQL database on HDFS).
5. Summary Table (Key Differences)
Feature Hadoop/Hive RDBMS

Data Structure Any format (Structured + Unstructured) Structured Only

Prof. Babitha P K 53 CIT, Ponnampet


Feature Hadoop/Hive RDBMS

Best For Batch analytics, large-scale processing Fast, reliable transactions

Query Language HiveQL (like SQL) Standard SQL

Scalability Horizontal (add more cheap servers) Vertical (more powerful servers)

Integrity/ACID Low High (full ACID)

Use Case Analytics, discovery, big data Online transactions, business data
IN SIMPLE WORDS:
• Use Hadoop/Hive when you have mountains of data, mostly want to run big summary reports, and don’t
care about instant answers or complex real-time updates.
• Use RDBMS when you have daily business data, need strong guarantees, and require updates and
queries in real time.

Hadoop distributions

What is a "Hadoop Distribution"?


• Just as there are different versions of
Linux by different companies (like
Ubuntu, Red Hat, etc.), there are various
companies and organizations that
package Hadoop along with compatible
tools, support, and enhancements. These
are called Hadoop distributions.
Major Hadoop Distributions (from the
image):
• Intel distribution for Apache Hadoop
• Hortonworks
• Cloudera’s distribution (includes
CDH)
• EMC Greenplum HD
• IBM InfoSphere BigInsights
• MapR M5 Edition
• Microsoft Big Data Solution
In simple words: Each of these vendors
offers its own flavor of Hadoop,
sometimes tailored for better
performance, easier management, and
extra features or support.
2. Hadoop vs SQL (Comparison Table)
Feature Hadoop SQL (relational databases)

Scaling Scale out (add more servers) Scale up (bigger/faster server)

Data Model Key-value pairs Relational tables

Prof. Babitha P K 54 CIT, Ponnampet


Feature Hadoop SQL (relational databases)

Programming Model Functional programming Declarative queries (SQL)

Processing Type Offline batch Online transaction processing


• Hadoop “scales out” by simply adding more cheap servers; SQL databases usually “scale up” by
making one server more powerful.
• Data is stored as key-value pairs in Hadoop, vs. strictly defined tables in SQL.
• Functional programming is used to write transformations in Hadoop (e.g., MapReduce); SQL uses
straightforward queries.
• Hadoop processes huge amounts of data in batches; SQL systems are designed for fast, real-time
transactions.
3. Integrated Hadoop Systems (by Vendors)

 Hadoop is an open-source framework, but companies often need ready-made solutions (hardware +
software combined) instead of building everything from scratch.
 So, big IT companies created their own integrated packages of Hadoop + additional tools to make it
easier for businesses to use.
 EMC Greenplum → A big data platform from EMC that integrates Hadoop with their database.
 EMC Greenplum = A product from EMC Corporation that combined Hadoop + a database
(Greenplum) for handling big data.
 Oracle Big Data Appliance → Oracle’s pre-built hardware + software for big data processing using
Hadoop.
 Microsoft Big Data Solution → Microsoft’s version of Hadoop integrated with their tools (like SQL
Server, Azure).
 IBM InfoSphere → IBM’s big data platform, combining Hadoop with IBM’s analytics tools.
 HP Big Data Solutions → Hewlett-Packard’s solution combining Hadoop with their servers and
storage.

Prof. Babitha P K 55 CIT, Ponnampet


4. Cloud-based Hadoop Solutions

Instead of setting up your own


servers, you can run Hadoop using
cloud providers.
• Amazon Web Services
(AWS):
• Provides scalable, on-
demand Hadoop clusters (Elastic
MapReduce, EMR).
• Google BigQuery and Cloud
Storage for Hadoop:
• Lets you process data with Hadoop tools, running in Google’s cloud, without managing
infrastructure.
Advantages: No hardware to maintain, rapid scaling, pay-as-you-go, reduced management overhead.
In summary:
• There are multiple packaged flavors of Hadoop (distributions) from big tech companies to make big
data processing easier and more reliable.
• Hadoop is very different from classic SQL systems: It’s best for large-scale, batch analytics and scales
by adding cheap servers.
• You can deploy Hadoop using vendor-integrated systems, or use cloud providers like AWS and Google
for even more convenience.

Prof. Babitha P K 56 CIT, Ponnampet


Prof. Babitha P K 57 CIT, Ponnampet

You might also like