0% found this document useful (0 votes)
21 views88 pages

BigData Notes

Uploaded by

vinothkumarg2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views88 pages

BigData Notes

Uploaded by

vinothkumarg2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

BIG DATA ANALYTICS

UNIT 1
Evolution of Big data — Best Practices for Big data Analytics — Big data characteristics —
Validating —The Promotion of the Value of Big Data — Big Data Use Cases- Characteristics of
Big Data Applications —Perception and Quantification of Value -Understanding Big Data
Storage —A General Overview of High-Performance Architecture— HDFS—MapReduce and
YARN—Map Reduce Programming Model.

INTRODUCTION:

Define What is Data:

Data is a collection of information, and this can be in various forms such as numbers,
text, sound, images, or any other format.

Sources of Data:

1. Human-Generated Data

o Social Media (Posts, Likes, Shares)

o Web Clickstreams & Interactions (Browse, Purchases)

o Emails & Documents

o Customer Feedback (Surveys, Reviews, Call Recordings)

o Mobile Device Usage

o Manual Enterprise Inputs (CRM, ERP)

2. Machine-Generated Data

o Internet of Things (IoT) Devices (Sensors from homes, industries, wearables,


smart cities)

o Server & Application Logs

o Satellite & Drone Imagery

o Medical Devices (Scanners, Monitors)

o Automotive Telemetry (Car data)

3. Transactional Data

o E-commerce & Retail Purchases (POS)

o Financial Transactions (Banking, Trades)


o Telecommunication Records (Call logs)

4. Public & Open Data

o Government Data (Census, Economic, Health)

o Scientific Research Data

o Digital Archives & Libraries

Defining Big Data

Big Data is simply extremely large and complex collections of information that are so
big, arrive so fast, and come in so many different forms, that regular computer programs and
traditional ways of looking at data can't handle them.

It's about having so much data, so quickly, and in so many different types (like texts, pictures,
videos, and numbers) that you need special tools and methods to understand it and find
useful patterns.

Need of Big Data:

 Too Much Information: We simply have too much data now for old methods to
handle.

 Find Hidden Value: It helps us discover secret patterns and insights within that huge
amount of data.

 Make Better Choices: This leads to smarter and quicker decisions for businesses and
organizations.

 Improve Everything: From understanding customers better to making operations


more efficient and even predicting future problems.

Define - What is Analytics? (From Data to Decisions)

 we have this massive, fast, and varied data. How do we make sense of it? That's
where 'Analytics' comes in.

 Define Analytics: Analytics is the process of examining raw data to draw conclusions
about that information. It involves applying statistical, mathematical, and
computational techniques.

 Introduce the Levels of Analytics:

o Descriptive Analytics: "What happened? (e.g., Sales reports, website traffic


statistics). This is looking backward."

 Example: "Our sales were up by 10% last quarter."


o Diagnostic Analytics: "Why did it happen? (e.g., Root cause analysis). Digging
deeper into the 'why'."

 Example: "Sales were up because of a successful marketing campaign


and a new product launch."

o Predictive Analytics: "What will happen? (e.g., Forecasting, fraud detection).


Using past data to predict future outcomes."

 Example: "Based on current trends, we predict a 15% increase in sales


next quarter."

o Prescriptive Analytics: "What should I do? (e.g., Recommendation systems,


optimization). Providing recommendations for action."

 Example: "To achieve a 20% sales increase, we should allocate more


budget to digital marketing and offer a loyalty program."

 Illustrate with a simple, common example: Think about a car's GPS:

 Descriptive: It shows where you've been.

 Diagnostic: It tells yo7u why you hit traffic (e.g., accident ahead).

 Predictive: It estimates your arrival time.

 Prescriptive: It recommends the best route to avoid traffic."

UNIT 1
Define Big Data Analytics:
The process of examining large and complex datasets (Big Data) to find hidden
patterns, trends, market insights and other valuable information.

The main goal is to help organizations to make data into smart decisions and gain a
competitive advantage.

Why is it Important?
Businesses: Make better decisions, find new ideas, improve how they work, understand
customers, prevent problems.

Society: Helps healthcare, smart cities, science and many.


1.1 . Evolution of Big data:
The Evolution of Big Data means tracing the historical journey of how data grew in size,
speed, and variety, and how technologies and methods developed to effectively manage,
process, and extract value from these increasingly complex datasets.

i) Early Data Management (1960s-1980s)


 Key Idea: "Companies had data, but it was mostly structured data. Think of it like neatly
organized forms or spreadsheets.

 Traditional databases appeared. They handled organized data.

 This was the start of data handling, but it wasn't ready for today's huge amounts of
"big data."

ii)The Raise of Internet (1990s)


 The internet grew, creating much more data.

 The term "big data" became popular because companies struggled with the volume
(how much), velocity (how fast), and variety (different types) of data.

iii)Web 2.0 and Social Media (2000s)


 Social media platforms made huge amounts of unorganized data.

 Open-source tools like Apache Hadoop were created to handle big, spread-out
datasets.

 NoSQL databases became popular because they were flexible with different kinds of
data.

iv) Smart Analytics & Cloud (2010s):


 Cloud computing offered flexible systems for storing and processing big data.
 AI (Artificial Intelligence) and machine learning became key for big data analysis,
helping to get deep insights and make predictions.
 The Internet of Things (IoT) added even more complexity with data from connected
devices.

v)AI & IoT Today (2020s and beyond)


 AI and machine learning are now more deeply integrated into big data work, allowing
for real-time analysis and automation.

 Edge computing is becoming important for processing data closer to where it's
created.
 There's a growing focus on data ethics and governance, with more attention on data
privacy and security.

1.2 Best Practices for Big data Analytics:


To make the most of big data, organizations should follow several best practices. Here's a
breakdown of key areas:

I)Strategy and Plan Well

 Define clear Objectives: Before diving into data, clearly articulate what business
problems you're trying to solve or what questions you want to answer. This guides
data collection, analysis, and interpretation.

 Start small and Scale Up: Don't try to analyze every piece of data you have all at once.
Begin with a small, manageable project that can show quick results. This helps you
learn, fix issues, and build confidence before tackling bigger challenges. It's like testing
the waters before diving in.

 Rules for data (Data Governance): You need clear rules about who owns the data,
how good the data needs to be, and how it's kept safe and private. This ensures
everyone uses data consistently and responsibly, avoiding confusion and security risks.

 Work together: Big data projects aren't just for tech people. They need input from
everyone involved: the IT team, the sales team, the marketing team, and even legal
experts.

II)Data Management:

 Good data is key (Data Quality): If your data is messy, incomplete, or wrong, any
analysis you do will also be flawed. Investing time in cleaning and validating your data
upfront saves huge headaches later.

 Bring data together (Data Integration): Businesses collect data from many different
places – your website, customer service, sales, social media, etc. To get a full picture,
you need to combine all this data into one usable format. This often involves special
tools that can connect different systems.

 Choose Right storage: You can't store massive amounts of diverse data in a traditional
spreadsheet. You need specialized storage solutions like data lakes or cloud storage
that can handle huge volumes and different types of data (like videos, text, and
numbers) efficiently.

 Data Security & Privacy (Keep it safe): Data, especially customer information, is
valuable and sensitive. You must protect it from unauthorized access, theft, or
misuse. This involves strong passwords, encryption, and following data privacy laws
like GDPR or India's Digital Personal Data Protection Act.

 Data Lifecycle Management (Know how long to keep it): You don't need to keep every
piece of data forever. Have a plan for how long to store data, when to move it to
cheaper archive storage, and when it's safe to delete it, considering both business
needs and legal requirements.

III) Use Right Tools & Tech

 Grow with data (Scalable Infrastructure): As your data grows, your system needs to
grow with it without breaking down. This means using flexible and scalable
technology, often found in cloud computing (like AWS, Azure, Google Cloud). It's like
having a house that can add more rooms as your family grows.

 Pick the best tools: There are many tools for big data analytics. You need to choose
the right ones that fit your specific tasks. For example, some tools are great for
processing huge amounts of data quickly (like Apache Spark), while others are better
for building predictive models (like Python with its machine learning libraries).

 Automate tasks: Many steps in big data analytics, like collecting data or preparing it,
can be repetitive. Automating these tasks with software not only saves time but also
reduces human errors, making your processes more efficient and reliable.

 Use the cloud: Cloud platforms offer massive computing power and storage that you
can rent. This means you don't have to buy and maintain expensive servers yourself,
making it more cost-effective and flexible for handling big data workloads.

IV) Analyze & Understand:

 Ask questions first (Hypothesis-Driven): Instead of just looking at data aimlessly, form
a "guess" or "question" first. For example, "Does running ads on Instagram lead to
more sales than Facebook ads?" Then, use the data to prove or disprove your guess.
This makes your analysis focused and productive.

 Be accurate (Statistical Rigor): To ensure your findings are trustworthy, you need to
use correct mathematical and statistical methods. This means applying the right
formulas and algorithms to avoid drawing wrong conclusions from your data.

 Show clearly (Data Visualization): Numbers and tables can be hard to understand.
Visualizing your data with charts, graphs, and dashboards makes complex insights
easy to grasp for everyone, even those without a technical background. A picture truly
is worth a thousand words.

 Keep trying (Iterative Process): Big data analysis isn't usually a one-time thing. It's a
continuous cycle. You analyze, get insights, make changes based on those insights, and
then analyze again to see if your changes worked. It's about constant learning and
improvement.

 Explain why (Contextual Understanding): Just stating "sales went up by 10%" isn't
enough. You need to explain why sales went up and what that means for the business.
Was it due to a new marketing campaign, a seasonal trend, or something else?
Providing context makes the insight actionable.

 Make it useful (Actionable Insights): The ultimate goal is not just to find interesting
patterns, but to find insights that you can act upon. These insights should help your
business make better decisions, improve operations, or develop new strategies. If an
insight doesn't lead to action, its value is limited.

V) People & Culture

 Skilled people: You need a team with the right skills: data scientists (who can analyze
data and build models), data engineers (who build systems to manage data), and
business analysts (who understand both data and business needs). Investing in these
roles or training existing staff is crucial.

 Everyone understands data (Data Literacy): For a business to be truly data-driven,


everyone, not just the experts, needs to understand basic data concepts and how to
interpret common charts and reports. This helps foster a culture where decisions are
based on facts, not just gut feelings.

 Keep learning: The world of technology, especially big data, is constantly evolving. Your
team needs to be open to learning new tools, techniques, and trends to stay
competitive and effective.

 Leaders must support: For any big data initiative to succeed, top management must
fully support it. Their buy-in ensures resources are allocated, teams are empowered,
and a data-driven mindset is adopted throughout the organization.

1.3. Characteristics of Big Data (The "Vs")


Introduction:
Big Data refers to extremely large and complex datasets that cannot be effectively
managed, processed, or analyzed using traditional data processing applications. It is defined
by a set of key characteristics, often called the "Vs."

I. The Core 3 Vs:

Volume (The Size of Data)


* Definition: Refers to the immense amount of data generated, collected, and stored. Data
sizes are measured in petabytes (10^{15} bytes), exabytes (10^{18} bytes), and even
zettabytes (10^{21} bytes).

* Sources: Billions of IoT devices, social media feeds, online transactions, sensor data,
scientific experiments.

* Challenge: Traditional databases struggle with sheer size for storage and processing.

* Example: Facebook stores over 300 petabytes of user data.

Velocity (The Speed of Data)

* Definition: Refers to the speed at which data is generated, captured, and, crucially, needs
to be processed. Data is constantly flowing. Often demands real-time or near real-time
analysis to be valuable.

* Challenge: Processing data quickly enough to act on timely insights.

* Examples: High-frequency stock trading (milliseconds), website clickstream analysis, fraud


detection systems, real-time sensor monitoring.

Variety (The Data Types)

* Definition: Refers to the different types and formats of data found in Big Data.

* Types:

* Structured Data: Highly organized, fits into fixed fields/schemas (e.g., relational
databases, spreadsheets).

* Semi-Structured Data: Has some organizational properties but no rigid schema (e.g.,
JSON, XML files, log files).

* Unstructured Data: No predefined format (e.g., text documents, images, audio, video,
social media posts). This is the most prevalent type in Big Data.

* Challenge: Integrating, parsing, and analyzing heterogeneous data types.

II. The Expanded 5 Vs:

Veracity (The Trustworthiness of Data)

* Definition: Refers to the accuracy, quality, consistency, and reliability of the data.

* Challenge: Big Data often comes from numerous, unverified, or inconsistent sources,
leading to noise, errors, and biases.

* Importance: Low veracity can lead to flawed insights and poor decision-making ("Garbage
In, Garbage Out"). Data cleansing and validation are critical.
* Example: Analyzing social media sentiment, where sarcasm or slang can affect accuracy;
sensor data with calibration errors.

Value (The Purpose and Insight from Data)

* Definition: The ultimate goal of Big Data – the ability to extract meaningful insights,
patterns, and actionable knowledge from the data.

* Purpose: To drive better business decisions, optimize operations, create new


products/services, or solve complex problems.

* Key Idea: Without extracting value, the other "Vs" are merely a burden. This requires
advanced analytics, machine learning, and data science expertise.

* Example: Using customer purchase history (Volume, Velocity, Variety, Veracity) to predict
future buying behaviour and offer personalized recommendations (Value).

III. Other "Vs" (Sometimes Mentioned):

Variability: Refers to the inconsistency that can be shown by data, especially in the speed at
which it arrives or the way its meaning can change.

Visualization: The ability to present complex Big Data insights in an easily understandable
and actionable graphical format.

Why are these "Vs" important?


Understanding these characteristics helps organizations recognize why traditional data
management tools are insufficient and why new Big Data technologies (like Hadoop, Spark,
NoSQL databases, AI/ML) are necessary to handle the scale, speed, diversity, quality, and
ultimately, extract the value from modern datasets.

1.4. Validating —The Promotion of the Value of Big Data

Purpose:
 Distinguish between big data hype and real-world applications.

 Hype: When new products comes to market people are in excitement.

 But in the Reality, its understand real benefits and Performance of the products.

Key Value Areas :


Marketing Optimization: Better targeted ads lead to optimized consumer
spending.
o Product Development: Enhanced research/analytics in manufacturing drives
new product innovation.

o Strategic Planning & Innovation: Data-driven insights improve business


strategy, fostering innovation and new startups.

o Supply Chain Efficiency: Predictive analytics optimizes stock, replenishment,


and forecasting.

o Fraud Detection: Improved accuracy and scope in identifying fraudulent


activities.

1.5 Big Data Use Cases:


This section details how different sectors leverage Big Data to gain insights and improve
operations:

1. Financial Services and Banking (with a focus on Fraud Detection):

o Challenge: Financial institutions process an enormous volume of transactions


daily. Identifying fraudulent activities quickly is crucial to prevent significant
losses and maintain customer trust.

o Big Data Solution: By analyzing real-time transaction data using Big Data
platforms, banks can identify patterns that deviate from normal behavior.

o Specific Application: Machine Learning (ML) models are trained on historical


data (both legitimate and fraudulent) to detect unusual or "outward bound"
(i.e., outgoing transactions that might be suspicious) fraud attempts as they
occur, enabling immediate alerts and actions.

2. Telecommunication (Network Optimization):

o Challenge: Telecom companies generate vast amounts of data from call


records, network traffic, device usage, and more. Ensuring optimal network
performance and customer satisfaction is a continuous effort.

o Big Data Solution: They analyze call detail records (CDRs) and various network
logs in real-time or near real-time.

o Specific Application: This analysis helps in understanding network congestion


points, identifying service quality issues, predicting demand, and ultimately
improving the overall quality of service (QoS) for subscribers. It also aids in
developing new, data-driven services.

3. Smart Cities (Traffic Management):


o Challenge: Urban areas face increasing traffic congestion, leading to delays,
pollution, and accidents. Efficient traffic flow is vital for city functionality.

o Big Data Solution: Smart cities deploy numerous sensors (e.g., in traffic lights,
on roads, cameras) that continuously generate massive streams of data.

o Specific Application: Big Data analytics is used to process this sensor data to
monitor traffic conditions dynamically, identify bottlenecks, predict
congestion, and even optimize traffic light timings to improve flow and reduce
travel times.

4. Healthcare:

o Challenge: Healthcare generates immense amounts of data, from electronic


health records (EHRs) to medical images, genomic data, and wearable device
data.

o Big Data Solution: Analyzing this diverse data can uncover hidden patterns and
correlations.

o Specific Application: One key application is identifying learning patterns (this


likely refers to patterns related to disease progression, treatment effectiveness,
patient responses, or even patterns in medical research and education). This
can lead to more personalized medicine and improved patient outcomes.

5. Manufacturing Industry (Leveraging IoT for Predictive Maintenance):

o Challenge: Equipment downtime in manufacturing leads to significant


production losses and maintenance costs.

o Big Data Solution: The Internet of Things (IoT) plays a crucial role here. Sensors
are embedded in machinery, continuously collecting data on temperature,
vibration, pressure, performance metrics, etc.

o Specific Application: This "Machine Sensor data" is fed into Big Data analytics
platforms. Advanced analytical models (often Machine Learning) are then
applied to predict when a piece of equipment is likely to fail before it actually
happens. This allows for proactive maintenance, reducing unexpected
downtime and optimizing maintenance schedules.

6. E-Learning (Learning Analytics):

o Challenge: Understanding how students interact with online learning


platforms, what content they engage with, and where they struggle is vital for
effective education.

o Big Data Solution: E-learning platforms collect vast amounts of data on student
activity, performance, interactions, and progress.
o Specific Application: "Learning analytics" uses Big Data to analyze this
information. The goal is to improve performance data which can then be used
to personalize learning paths, identify at-risk students, optimize course
content, and generally enhance the overall learning experience and outcomes.

1.6 Characteristics of Big Data Applications:


Applications built for Big Data environments must possess specific attributes to
effectively handle the "Vs."

 Scalability (Horizontal Scaling):

o Characteristic: The ability to handle increasing data volumes and user loads
by adding more commodity machines (nodes) to the cluster, rather than
relying on more powerful (and expensive) single servers.

o Implication: Allows systems to grow seamlessly with data generation,


ensuring performance doesn't degrade as data scales.

 Fault Tolerance:

o Characteristic: The system's ability to continue operating correctly even if


individual components (hardware, software processes, network connections)
fail.

o Implication: Achieved through data replication (multiple copies of data


blocks) and automatic task re-execution on healthy nodes. Crucial in
distributed environments where failures are inevitable.

 Distributed Processing:

o Characteristic: Workloads are broken down into smaller, independent tasks


that are executed in parallel across multiple machines in a cluster.

o Implication: Leverages the combined computational power of many nodes,


dramatically reducing the time required to process massive datasets.

 High Throughput vs. Low Latency:

o Characteristic: Big Data applications often prioritize processing large volumes


of data efficiently (high throughput) over providing immediate, low-latency
responses for single queries. However, a growing number of applications
demand both.
o Implication: Different architectural choices are made depending on whether
the primary need is batch processing (high throughput) or real-time analytics
(low latency).

 Ability to Handle Data Diversity:

o Characteristic: Must be able to ingest, store, process, and analyze various


data types: structured, semi-structured, and unstructured data, often from
disparate sources.

o Implication: Requires flexible data models (e.g., schema-on-read in data


lakes, NoSQL databases) and versatile processing frameworks.

 Cost-Effectiveness:

o Characteristic: Often built on clusters of inexpensive, commodity hardware


and leverage open-source software.

o Implication: Makes Big Data solutions financially viable for many


organizations, avoiding the prohibitive costs of proprietary, specialized
hardware/software.

 Data Locality:

o Characteristic: A crucial optimization where computation is moved to the


data's location (or the node storing it) rather than moving large amounts of
data across the network to a central processing unit.

o Implication: Minimizes network I/O, which is a significant bottleneck in


distributed Big Data systems, leading to faster processing.

1.7 Perception and Quantification of Value


Perception of Value:

 Initial Perception: Big Data initially gained attention as a technological buzzword,


often seen as complex, expensive, and difficult to implement. Businesses might
perceive it as a "nice-to-have" or solely as an IT challenge.

 Evolving Perception: As success stories emerge, the perception shifts towards Big
Data being a strategic asset, a source of competitive advantage, and a necessity for
modern business.

 Challenges in Perception:

o Lack of understanding: Business leaders may not fully grasp how data
translates into business value.
o Fear of complexity: Overwhelmed by the technical requirements.

o Unrealistic expectations: Expecting instant, magical solutions without


investment.

Quantification of Value:

 Moving Beyond Buzzwords: It's critical to move from a qualitative "belief" in Big
Data's value to a quantitative "proof" of its impact.

 Key Strategies for Quantification:

o Define Clear KPIs (Key Performance Indicators):

 Financial Metrics: Revenue increase, cost reduction (e.g., fraud losses


prevented, operational efficiency gains), profit margin improvement.

 Operational Metrics: Reduced downtime, improved supply chain


efficiency, faster time-to-market for new products/features.

 Customer Metrics: Increased customer satisfaction scores (CSAT),


reduced churn rate, higher customer lifetime value (CLTV), improved
conversion rates.

 Risk Metrics: Reduction in fraudulent transactions, improved security


breach detection time.

o Establish Baselines: Measure the current state of KPIs before implementing


Big Data solutions to create a clear benchmark for comparison.

o Pilot Projects with Measurable Goals: Start with small-scale, focused projects
that have defined success metrics. This provides concrete evidence of value
and builds internal champions.

o Attribution Modeling: Where possible, isolate the impact of Big Data


analytics from other business initiatives to accurately attribute value.

o Regular Reporting and Communication: Present the results and quantified


value to key stakeholders (executives, department heads) in business terms,
highlighting ROI and strategic impact.

o Case Studies: Document successful internal projects and external industry


examples to build a compelling narrative around Big Data's value.

o Cost-Benefit Analysis: Perform a thorough analysis of the costs involved


(infrastructure, personnel, tools) versus the projected and actual benefits.
1.8 Understanding Big Data Storage
 Big Data Applications: Achieve performance and scalability by deploying on a
collection of storage and computing resources in a runtime environment.

 Core Components: Big data applications depend on the underlying computing


platform's architecture (hardware and software) which leverage four key computing
resources:

1. Processing Capability (CPU/Processor/Node): Often involves multiple cores


per individual CPU, sharing the node's memory, and managing/scheduling
tasks simultaneously (multithreading).

2. Memory: Holds data actively being worked on by the processing node. Single
nodes have limited memory.

3. Storage: Provides data persistence; where datasets are loaded from and
processed in memory.

4. Network: Facilitates data exchange ("pipes") between different processing


and storage nodes.

 High-Performance Platforms: Are collections of computers designed to handle


massive amounts of data and processing by distributing resources among a pool.
Single-node computers are insufficient.

1.9 A General Overview of High-Performance Architecture


High-performance platforms are built by connecting multiple nodes via various network
topologies. The management of computing resources (task allocation) and data across the
network of storage nodes.

Configuration:

 Master Job Manager: Oversees processing nodes, assigns tasks, and monitors
activity.

 Storage Manager: Oversees the data storage pool and distributes datasets across
storage resources.

 Data Locality: To minimize access latency, it's beneficial for data and processing tasks
to be colocated (threads process local or close data).

Example (Apache Hadoop): Used to understand layering and interactions within a big data
platform due to its open-source and published architecture.
However, the general architecture distinguishes the management of
computing resources (and corresponding allocation of tasks) and the management of
the data across the network of storage nodes, as is seen in Figure

In this configuration, a master job manager oversees the pool of processing nodes,
assigns tasks, and monitors the activity. At the same time, a storage manager
oversees the data storage pool and distributes datasets across the collection of
storage resources. While there is no a priori requirement that there be any colocation
of data and processing tasks, it is beneficial from a performance perspective to ensure
that the threads process data that is local, or close to minimize the costs of
data access latency.

Figure: Typical Organization of resources in a Big Data Platform

1.10 HDFS
Purpose: Enables storage of large files by distributing data among a pool of data nodes.

 Architecture:

o NameNode (Single): Runs in a cluster, associated with one or more data nodes.
Manages the typical hierarchical file organization and namespace. Coordinates
interaction with distributed data nodes.

o File Handling: A single file appears as "chunks" distributed across individual


data nodes.

o Metadata: The NameNode maintains metadata for each file, including history
of changes, managed files, file properties, and mapping of blocks to files at data
nodes.

o Data Node Responsibility: Manages data blocks as separate files and shares
critical information with the NameNode.
 Data Writing Process:

o Initially cached in a temporary file.

o When enough data for a block is available, the NameNode is alerted.

o Temporary block is committed to a permanent data node, then incorporated


into the file management scheme.

 Fault Tolerance: Provided through data replication.

o Replication Factor: An application can specify the number of copies made


when a file is created.

o Management: The NameNode manages replication, optimizing marshalling


and communication of replicated data within the cluster to ensure efficient
network bandwidth. Crucial in large environments with multiple racks of data
servers.

HDFS provides performance through distribution of data and fault tolerance through
replication Core Idea: HDFS achieves performance and fault tolerance by distributing data and
using replication. This makes it robust for storing large files reliably.

 Key Features for Reliability/Management: HDFS uses several mechanisms:

o Monitoring (Heartbeats):

 Data nodes send regular "heartbeat" signals to the NameNode.

 If a NameNode doesn't hear a heartbeat, it considers the data node


failed.

 A copy (replica) of the data on the failed node is then created elsewhere
to ensure data availability.

o Rebalancing:

 HDFS automatically moves data blocks from one data node to another.

 This happens when there's free space on a node or when there's high
demand for specific data (e.g., moving data from a slow disk to a faster
one, handling many simultaneous accesses).

 It also helps spread data to react to potential node failures.

o Managing Integrity (Checksums):

 HDFS uses "digital signatures" (checksums) for data blocks.

 Checksums are calculated based on the data's values.


 When a data block is retrieved, its checksum is recalculated and
compared to the stored checksum.

 If they don't match, it means the data is corrupted, and a replica of that
block is retrieved instead.

o Metadata Replication:

 The NameNode's metadata files (information about the files and their
locations) are also replicated.

 This protects the crucial metadata from corruption or loss.

o Snapshots:

 HDFS allows creating "snapshots."

 These are incremental copies of data at a specific point in time.

 They allow the system to be rolled back to an earlier state if needed


(e.g., in case of accidental deletions or corruption).

1.11 MapReduce and YARN


In Hadoop, MapReduce was originally designed to handle both job management and the
programming model for execution. It follows a master/slave execution model:

 JobTracker (master):

o Manages and monitors all TaskTrackers (slaves)

o Handles job scheduling, task assignment, progress tracking, failure recovery,


and fault tolerance

o Acts as a central point for resource allocation, responding to requests from


multiple clients

 TaskTracker (slave):

o Waits for task assignments from the JobTracker

o Executes tasks and periodically sends status updates

Limitations of MapReduce:

1. Network Latency: It works best when data is local. If data needs to move across the
network, performance suffers.

2. Limited Flexibility: Not all applications fit easily into the MapReduce paradigm.
Alternative programming models may be better suited for some use cases.
Limitations of Original MapReduce Model

1. Programming Dependency:

o Even if other programming models are used, they still need MapReduce for job
management.

2. Fixed Slot Allocation:

o Cluster nodes are divided into map slots and reduce slots.

o If the workload is heavier in one phase, the other slots go underutilized,


causing inefficiency.

YARN: The Improvement Over MapReduce

YARN = Yet Another Resource Negotiator


YARN separates resource management from job execution, solving many MapReduce
limitations.

Key Components of YARN:

1. ResourceManager:

o Manages cluster-wide resource allocation centrally.

2. NodeManager:

o Runs on each node to manage local resources.

3. ApplicationMaster:

o One per application.

o Negotiates with ResourceManager for resources.

o Monitors task progress and tracks job status.


Advantages of YARN

 Better resource utilization:

o No fixed map/reduce slots — resources can be allocated more flexibly.

 Improved performance:

o Applications are aware of data location across the cluster.

o Allows compute and data colocation, reducing data transfer time.

 Scalability:

o More applications can be efficiently run in parallel.

1.12 Map Reduce Programming Model.


MapReduce Programming Model

MapReduce is a programming model for processing large datasets (hundreds of TB to PB) in a


parallel and distributed way. Originally introduced by Google, it’s widely used in Hadoop.

Core Concepts

 MapFunction:
Processes input key/value pairs to produce intermediate key/value pairs.
 ReduceFunction: Aggregates intermediate values by key to produce final output.

Execution Flow

1. Input data is split and distributed across worker nodes.

2. Each node applies the Map function to its data.

3. Intermediate key/value pairs are shuffled and grouped by key.

4. The Reduce function aggregates values and writes final results.

Advantages

 Scalable and parallelizable

 Works on commodity hardware

 Efficient for big data processing

 Supports fault tolerance

MapReduce Process Diagram


Here is a visual representation of the MapReduce process:

Here are 3 key points explaining the MapReduce diagram:

1. Map Phase: Input data is split and processed in parallel by Map functions, which
generate intermediate key/value pairs.
2. Shuffle Phase: Intermediate data is grouped by keys and shuffled so that all related
values go to the same reducer.

3. Reduce Phase: Reducers aggregate the grouped data and produce the final output
results, which are stored for further use.
UNIT II

Advanced Analytical Theory and Methods: Overview of Clustering — K-means — Use Cases
— Overview of the Method— Determining the Number of Clusters — Diagnostics — Reasons
to Choose and Cautions. - Classification: Decision Trees—Overview of a Decision Tree — The
General Algorithm — Decision Tree Algorithms—Evaluating a Decision Tree— Decision Trees
in R — Naïve Bayes — Bayes Theorem—Naïve Bayes Classifier.

1. Overview of Clustering

 Definition: Clustering is an unsupervised machine learning technique that aims to


group a set of data points into subsets (clusters) such that data points within the
same cluster are more similar to each other than to those in other clusters.

 Unsupervised Learning: Crucially, unlike classification, clustering does not require


labeled data. The algorithm finds inherent structures or groupings in the data on its
own.

 Goal: To discover natural groupings or hidden patterns in datasets where the class
labels are unknown.

 Similarity/Distance Metrics: Clustering algorithms rely on a measure of similarity or


dissimilarity (distance) between data points to form clusters. Common metrics
include Euclidean distance, Manhattan distance, cosine similarity, etc.

 Key Concepts:

o Cluster: A collection of data points that are grouped together because of their
similarity.

o Centroid: The center point of a cluster, often the mean of all data points in
that cluster (especially in K-means).

o Inter-cluster Distance: Distance between different clusters (should be


maximized for good clustering).

o Intra-cluster Distance: Distance between points within the same cluster


(should be minimized for good clustering).

 Applications:

o Customer Segmentation: Grouping customers with similar purchasing


behaviors or demographics.

o Anomaly Detection: Identifying outliers that do not belong to any significant


cluster.
o Document Clustering: Grouping similar articles, web pages, or research
papers.

o Image Segmentation: Dividing an image into regions of similar pixels.

o Bioinformatics: Grouping genes with similar expression patterns.

 Types of Clustering Algorithms:

o Partitioning Methods: Divide data into a specified number of clusters (e.g., K-


means, K-medoids).

o Hierarchical Methods: Build a hierarchy of clusters (agglomerative or


divisive).

o Density-Based Methods: Discover clusters of arbitrary shape based on data


point density (e.g., DBSCAN).

o Model-Based Methods: Assume a statistical model for each cluster (e.g.,


Gaussian Mixture Models).

2. K-means

 Type: K-means is a popular partitioning-based clustering algorithm.

 Objective: To partition 'n' observations into 'k' clusters, where 'k' is a pre-specified
number. Each observation belongs to the cluster with the nearest mean (centroid).

 Algorithm Steps (Iterative Process):

1. Initialization: Randomly select 'k' data points from the dataset to be the
initial cluster centroids. (The choice of initial centroids can influence the final
result).

2. Assignment Step (E-step - Expectation):

 For each data point in the dataset, calculate its distance (e.g.,
Euclidean distance) to all 'k' centroids.

 Assign each data point to the cluster whose centroid is closest.

3. Update Step (M-step - Maximization):

 After all data points are assigned, recalculate the new centroids for
each cluster. The new centroid is typically the mean of all data points
currently assigned to that cluster.

4. Iteration: Repeat steps 2 and 3. The algorithm converges when:


 The cluster assignments no longer change.

 The centroids no longer change significantly.

 A maximum number of iterations is reached.

 Goal Function (Minimization): K-means aims to minimize the Within-Cluster Sum of


Squares (WCSS), also known as inertia. This is the sum of the squared distances
between each data point and its assigned cluster centroid.

 Advantages:

o Relatively simple to understand and implement.

o Computationally efficient for large datasets.

o Guaranteed to converge (though not necessarily to the global optimum).

 Disadvantages:

o Requires specifying the number of clusters ('k') in advance.

o Sensitive to the initial placement of centroids (can lead to different local


optima). Often run multiple times with different initializations.

o Sensitive to outliers, which can heavily influence cluster centroids.

o Assumes clusters are spherical (or globular) and of similar size and density.

o Struggles with clusters of arbitrary shapes or varying densities.

o Scales linearly with the number of data points, but the iterative nature can be
slow for extremely high dimensions.

3. Use Cases (K-means)

K-means is a versatile algorithm applicable in various domains due to its simplicity and
efficiency.

 Customer Segmentation:

o Application: Grouping customers based on demographics, purchase history,


Browse behavior, or loyalty program data.

o Benefit: Enables targeted marketing campaigns, personalized product


recommendations, and improved customer relationship management.

 Document Classification/Clustering:
o Application: Grouping articles, research papers, news reports, or emails
based on their textual content.

o Benefit: Helps in organizing large document collections, information retrieval,


and topic discovery.

 Image Compression/Quantization:

o Application: Reducing the number of distinct colors in an image by grouping


similar colors together. The centroid of each cluster becomes the
representative color.

o Benefit: Reduces file size while maintaining visual quality to an acceptable


degree.

 Anomaly Detection:

o Application: Identifying data points that are far from any cluster centroid or
form very small, isolated clusters. These can be considered outliers.

o Benefit: Useful in fraud detection (e.g., unusual transaction patterns),


network intrusion detection, or identifying defective products.

 Geospatial Analysis:

o Application: Grouping geographical locations with similar characteristics (e.g.,


crime hotspots, areas with similar demographic profiles, optimal placement
for new retail stores).

o Benefit: Aids urban planning, resource allocation, and business expansion


strategies.

 Scientific Data Analysis:

o Application: Grouping biological samples, astronomical objects, or


experimental results with similar properties.

o Benefit: Helps in discovering patterns, classifying new observations, and


formulating hypotheses.

 Recommender Systems:

o Application: Grouping users with similar preferences or items with similar


attributes to generate personalized recommendations.

o Benefit: Enhances user experience and drives engagement with products or


content.
4. Overview of the Method – Determining the Number of Clusters

This topic refers specifically to the methods used to find the optimal 'k' for K-means
clustering.

 The "K" Problem: One of the main challenges with K-means is that the number of
clusters ('k') must be specified by the user before running the algorithm. An incorrect
'k' can lead to misleading or suboptimal clustering results.

 Goal: To find a 'k' that represents a good balance between having too few (broad,
less meaningful) and too many (over-specific, potentially overfitting) clusters.

 Common Methods:

1. Elbow Method:

 Concept: Plots the Within-Cluster Sum of Squares (WCSS) (also


known as inertia or distortion) against different values of 'k'. WCSS
generally decreases as 'k' increases (because points are closer to their
own cluster's centroid).

 Identification: The "elbow" point on the plot is where the rate of


decrease in WCSS sharply changes. This point is often considered the
optimal 'k', as adding more clusters beyond this point provides
diminishing returns in terms of reducing distortion.

 Limitation: The "elbow" can sometimes be ambiguous or non-


existent.

2. Silhouette Analysis (or Silhouette Score):

 Concept: For each data point, it calculates a silhouette coefficient that


measures how similar it is to its own cluster (cohesion) compared to
other clusters (separation).

 Interpretation:

 Score close to +1: The data point is well-matched to its own


cluster and poorly matched to neighboring clusters.

 Score close to 0: The data point is on or very close to the


decision boundary between two clusters.

 Score close to -1: The data point is likely assigned to the wrong
cluster.

 Identification: The optimal 'k' is typically the one that yields the
highest average silhouette score across all data points, indicating well-
separated and compact clusters.
3. Gap Statistic:

 Concept: Compares the WCSS of your data for various 'k' values to the
WCSS of a reference (randomly distributed) dataset.

 Identification: The optimal 'k' is where the gap between the actual
WCSS curve and the reference WCSS curve is maximized. This suggests
that the clustering found is significantly better than random.

4. Domain Knowledge/Business Understanding:

 Concept: Sometimes, expert knowledge of the data or the business


context can provide a reasonable estimate for 'k'.

 Benefit: If you know you're looking for, say, "3 distinct customer
segments," you might start with k=3 and validate.

5. Dunn Index / Davies-Bouldin Index: Other internal validation metrics that


can be used to evaluate cluster quality for different 'k' values.

5. Diagnostics – Reasons to Choose and Cautions (for Clustering/K-means)

This section covers the practical considerations when deciding to use clustering, particularly
K-means, and what pitfalls to watch out for.

5.1 Reasons to Choose K-means (or Clustering in General):

 Unlabeled Data: When you have a large dataset but no predefined categories or
labels, and you want to discover inherent structures or groups.

 Exploratory Data Analysis (EDA): To gain initial insights into the underlying
distribution and patterns within your data.

 Data Reduction/Summarization: To represent large datasets by a smaller number of


cluster centroids, simplifying subsequent analysis.

 Feature Engineering: Cluster assignments can be used as new features for other
supervised learning models.

 Targeted Actions: For applications like customer segmentation, where identifying


distinct groups allows for customized strategies (e.g., marketing, product
development).

 Scalability (K-means specifically): K-means is relatively efficient and scales well to


large datasets compared to some other clustering algorithms.

 Simplicity: Easy to understand, implement, and interpret the results.


5.2 Cautions (Limitations & Pitfalls of K-means):

 Pre-specifying 'k': The biggest drawback. Choosing the wrong 'k' leads to inaccurate
or uninformative clusters. Requires external methods (Elbow, Silhouette) or domain
knowledge.

 Sensitivity to Initial Centroids:

o Random initialization can lead to different final clusterings (local optima) if


the algorithm is not run multiple times with different starting points (e.g.,
using k-means++ for smarter initialization).

 Assumption of Spherical/Globular Clusters:

o K-means works best when clusters are compact, roughly spherical, and of
similar size and density.

o It struggles with irregularly shaped clusters (e.g., crescent-shaped,


intertwined) or clusters with varying densities.

 Sensitivity to Outliers:

o Outliers (extreme data points) can significantly skew the cluster centroids,
leading to poor clustering results.

o Preprocessing steps like outlier detection and removal/treatment might be


necessary.

 Assumes Equal Variance: Implicitly assumes that clusters have similar variance
(spread).

 Impact of Feature Scaling: K-means uses distance metrics, so features with larger
ranges can disproportionately influence the clustering. Feature scaling
(normalization/standardization) is almost always required.

 Not Suitable for Categorical Data (directly): K-means is designed for numerical data.
Categorical features need to be appropriately encoded (e.g., one-hot encoding)
before use, which can increase dimensionality.

 "Curse of Dimensionality": In very high-dimensional spaces, distance metrics


become less meaningful, making clustering more challenging.

6. Classification: Decision Trees

 Definition: A supervised machine learning algorithm used for both classification


(predicting categorical outcomes) and regression (predicting continuous outcomes).
 Analogy: It mimics human decision-making by creating a tree-like model of decisions
and their possible consequences.

 How it Works (High-Level): The algorithm learns simple "if-then-else" decision rules
from the training data. It recursively splits the dataset into smaller subsets based on
feature values until the subsets are relatively "pure" (i.e., contain mostly instances of
a single class).

 Interpretability: Decision trees are often called "white box" models because their
decision-making process is transparent and easy to visualize and understand, unlike
"black box" models (e.g., neural networks).

 Applications: Customer churn prediction, medical diagnosis, credit risk assessment,


spam detection, loan default prediction.

7. Overview of a Decision Tree (Structure and Terminology)

Understanding the components of a decision tree:

 Root Node:

o The topmost node in the tree.

o Represents the initial decision or the entire dataset before any splits.

 Internal Node (Decision Node):

o A node that has two or more branches.

o Represents a test on a specific attribute or feature (e.g., "Age > 30?", "Income
= High?").

o The outcome of this test determines which branch to follow.

 Branch (Edge):

o Represents the outcome of a test condition from an internal node.

o Leads to another internal node or a leaf node.

 Leaf Node (Terminal Node):

o A node that does not split further.

o Represents the final class prediction or a specific outcome. Each leaf node
ideally contains data points predominantly belonging to one class.

 Splitting Criterion (Impurity Measure):


o A metric used to determine the "best" attribute to split on at each node. The
goal is to choose a split that creates the purest possible child nodes (i.e.,
nodes where instances mostly belong to one class).

o Common impurity measures for classification:

 Gini Impurity: Measures the probability of incorrectly classifying a


randomly chosen element in the dataset if it were randomly labeled
according to the distribution of labels in the subset. Lower Gini
impurity is better.

 Entropy/Information Gain: Entropy measures the impurity or


randomness of a set of samples. Information Gain is the reduction in
entropy achieved by a split; the algorithm seeks to maximize
information gain.

 Tree Depth: The longest path from the root node to a leaf node. Deeper trees can
capture more complex relationships but are prone to overfitting.

8. The General Algorithm (Decision Tree Construction)

The process of building a decision tree is typically a recursive, top-down, greedy approach.

1. Start at the Root Node: Begin with the entire training dataset at the root of the tree.

2. Select the "Best" Splitting Attribute:

o For each attribute (feature) in the dataset, calculate its impurity (e.g., Gini
Impurity, Entropy) if a split were made on that attribute.

o Determine the Information Gain (or reduction in Gini Impurity) achieved by


splitting on each attribute.

o Choose the attribute that yields the highest Information Gain (or lowest Gini
Impurity) as the best splitting criterion for the current node. This attribute is
typically the most discriminative.

3. Create Child Nodes:

o Based on the chosen splitting attribute and its values/conditions, create new
child nodes.

o Partition the dataset into subsets, with each subset corresponding to a


branch leading to a new child node.

4. Recurse:
o Repeat steps 2 and 3 for each new child node using its corresponding subset
of the data.

o This process continues recursively until a stopping criterion is met.

5. Stopping Criteria (When to Stop Splitting):

o Pure Nodes: All instances in a node belong to the same class (perfect
classification at that node).

o No More Attributes: No attributes left to split on.

o Minimum Number of Samples per Leaf: The number of data points in a node
falls below a predefined threshold (prevents creating overly specific leaves).

o Maximum Depth: The tree reaches a predefined maximum depth (prevents


overfitting).

o Minimum Information Gain: No split yields a significant improvement in


impurity reduction.

6. Assign Leaf Class: Once a stopping criterion is met, the node becomes a leaf node.
The class label for this leaf is assigned based on the majority class of the data points
within that node.

 Pruning (Post-processing): After the tree is fully grown, pruning techniques are often
applied to remove branches that might be due to noise or overfitting, improving the
tree's generalization ability. This can be done by evaluating the tree's performance
on a validation set.

9. Decision Tree Algorithms

Several algorithms implement the general decision tree construction process, each with
slight variations.

 ID3 (Iterative Dichotomiser 3):

o Developer: Ross Quinlan (1986).

o Splitting Criterion: Uses Information Gain.

o Limitations:

 Only handles categorical attributes. Numerical attributes must be


binned/discretized.

 Does not handle missing values.

 Prone to overfitting if the tree is grown too deep.


 Prefers attributes with many values (can bias towards them).

 C4.5:

o Developer: Ross Quinlan (1993), successor to ID3.

o Improvements over ID3:

 Handles both categorical and numerical attributes (by dynamically


creating thresholds for numerical ones).

 Handles missing values by assigning a probability to each possible


outcome.

 Employs pruning to avoid overfitting.

 Uses Gain Ratio (a modification of Information Gain) to overcome the


bias of favoring attributes with many outcomes.

o Output: Generates a decision tree and can also extract "if-then" rules.

 C5.0:

o Developer: Ross Quinlan, proprietary successor to C4.5.

o Improvements over C4.5:

 Faster and more memory-efficient.

 Produces smaller rule sets.

 Supports boosting (combining multiple models for better accuracy).

 CART (Classification and Regression Trees):

o Developer: Leo Breiman, Jerome Friedman, Richard Olshen, Charles Stone


(1984).

o Versatility: Can be used for both classification (predicting a category) and


regression (predicting a continuous value).

o Splitting Criterion:

 For classification trees: Uses Gini Impurity.

 For regression trees: Uses least squares deviation (minimizes sum of


squared errors).

o Binary Splits: CART trees are strictly binary, meaning each internal node has
exactly two branches.

o Pruning: Supports cost-complexity pruning.


o Widespread Use: The CART algorithm forms the basis for many modern
ensemble methods like Random Forests and Gradient Boosting Machines
(GBMs).

11. Evaluating a Decision Tree

Evaluating a decision tree (or any classification model) involves assessing its performance on
unseen data.

 Train-Test Split:

o Concept: Divide the dataset into two parts: a training set (e.g., 70-80%) used
to build the model, and a testing set (e.g., 20-30%) used to evaluate its
performance on new, unseen data.

o Importance: Prevents overfitting, where a model performs well on training


data but poorly on new data.

 Confusion Matrix:

o Concept: A table that summarizes the performance of a classification model,


showing the number of correct and incorrect predictions for each class.

o Components for Binary Classification:

 True Positive (TP): Actual positive, predicted positive.

 True Negative (TN): Actual negative, predicted negative.

 False Positive (FP): Actual negative, predicted positive (Type I error).

 False Negative (FN): Actual positive, predicted negative (Type II error).

 Key Performance Metrics (derived from Confusion Matrix):

o Accuracy: TP+TN+FP+FNTP+TN (Overall correctness; can be misleading for


imbalanced datasets).

o Precision (Positive Predictive Value): TP+FPTP (Proportion of positive


predictions that were actually correct). Important when minimizing false
positives.

o Recall (Sensitivity or True Positive Rate): TP+FNTP (Proportion of actual


positives that were correctly identified). Important when minimizing false
negatives.
o F1-Score: 2⋅Precision+RecallPrecision⋅Recall (The harmonic mean of Precision
and Recall, useful for imbalanced datasets where both precision and recall are
important).

o Specificity (True Negative Rate): TN+FPTN (Proportion of actual negatives


that were correctly identified).

 ROC Curve (Receiver Operating Characteristic Curve):

o Concept: A graphical plot that illustrates the diagnostic ability of a binary


classifier system as its discrimination threshold is varied.

o Plots: True Positive Rate (Recall) against False Positive Rate.

o Interpretation: A curve closer to the top-left corner indicates better


performance.

 AUC (Area Under the ROC Curve):

o Concept: A single scalar value that summarizes the overall performance of a


classifier across all possible thresholds.

o Interpretation: An AUC of 1 indicates a perfect classifier; 0.5 indicates a


random classifier.

 Cross-Validation:

o Concept: A robust technique to estimate model performance and prevent


overfitting. The dataset is split into 'k' folds. The model is trained 'k' times,
each time using 'k-1' folds for training and the remaining fold for testing. The
results are then averaged.

o Benefits: Provides a more reliable estimate of model generalization error


than a single train-test split.

12. Decision Trees in R

R is a powerful statistical programming language with excellent libraries for implementing


and analyzing decision trees.

 Key Packages:

o rpart (Recursive Partitioning and Regression Trees): A highly popular package


for building CART-like decision trees for both classification and regression.

o tree: Another package for constructing classification and regression trees.


o caret: Provides a unified interface for training and evaluating various machine
learning models, including decision trees.

o party / partykit: For conditional inference trees (which address some biases of
traditional trees) and pretty visualizations.

Basic Workflow in R (using rpart):

5. Load Data: Read your dataset into R.

6. Data Preparation:

 Handle missing values (impute or remove).

 Convert relevant columns to factors (for categorical variables) or


numeric (for numerical variables).

 Split data into training and testing sets (e.g., using createDataPartition
from caret or sample).

7. Train the Decision Tree Model:

 Use the rpart() function.

 Syntax: model <- rpart(TargetVariable ~ Feature1 + Feature2 + ..., data


= training_data, method = "class" OR "anova")

 method = "class" for classification trees.

 method = "anova" for regression trees.

 You can use . instead of listing all features to include all others.

8. Visualize the Tree (Optional but Recommended):

 plot(model): Plots the tree structure.

 text(model): Adds text labels to the nodes.

 For prettier plots, use prp() from rpart.plot package or ctree::plot()


from partykit.

9. Make Predictions:

 Use the predict() function on your trained model with the test data.

 predictions <- predict(model, newdata = test_data, type = "class") (for


class labels)

 predictions_prob <- predict(model, newdata = test_data, type =


"prob") (for class probabilities)
10. Evaluate Performance:

 Compare predictions with actual test_data$TargetVariable.

 Use functions for confusion matrix (table(), confusionMatrix() from


caret), accuracy, precision, recall, F1-score.

 Plot ROC curves (ROCR package).

 Pruning in R:

o Decision trees can overfit. rpart automatically handles some complexity, but
explicit pruning is often needed.

o The printcp() function shows the complexity parameter table.

o The prune() function can be used to prune the tree based on a desired
complexity parameter (cp) value.

o pruned_model <- prune(model, cp =


model$cptable[which.min(model$cptable[,"xerror"]),"CP"])

13. Naïve Bayes

 Definition: A family of simple, yet powerful, probabilistic classifiers based on Bayes'


Theorem with a strong (or "naive") assumption of conditional independence
between features.

 Probabilistic Model: It calculates the probability of a data point belonging to a


certain class given its features, and then predicts the class with the highest
probability.

 "Naive" Assumption: The core assumption is that the presence or absence of a


particular feature (attribute) of a class is unrelated to the presence or absence of any
other feature, given the class variable. This is a simplification that rarely holds true in
real-world data, but the algorithm often performs surprisingly well nonetheless.

14. Bayes Theorem

 Foundation: A fundamental theorem in probability theory that describes the


probability of an event, based on prior knowledge of conditions that might be related
to the event.

 Formula: P(A∣B)=P(B)P(B∣A)⋅P(A) Where:


o P(A∣B): Posterior Probability – The probability of hypothesis A being true,
given that evidence B has occurred. (What we want to find: the probability of
a class given the observed features).

o P(B∣A): Likelihood – The probability of evidence B occurring, given that


hypothesis A is true. (The probability of observing the features given a
particular class).

o P(A): Prior Probability – The initial probability of hypothesis A being true,


before any evidence is considered. (The probability of a class occurring
independently of any features).

o P(B): Marginal Probability of Evidence – The probability of evidence B


occurring, regardless of the hypothesis A. (The probability of observing the
given features across all classes). This term acts as a normalizing constant.

15. Naïve Bayes Classifier

 Application of Bayes' Theorem to Classification:

o The goal of a classifier is to predict the class Ck (where k is one of the possible
classes) for a given data point with features x=(x1,x2,...,xn).

o We want to find P(Ck∣x1,x2,...,xn).

o Using Bayes' Theorem: P(Ck∣x1,...,xn)=P(x1,...,xn)P(x1,...,xn∣Ck)⋅P(Ck)

 Applying the "Naive" Independence Assumption:

o The crucial simplification: The features are assumed to be conditionally


independent given the class. This means: P(x1,...,xn∣Ck)≈P(x1∣Ck)⋅P(x2∣Ck
)⋅...⋅P(xn∣Ck)

o Substituting this into Bayes' Theorem: P(Ck∣x1,...,xn)∝P(Ck)⋅∏i=1nP(xi∣Ck)


(The denominator P(x1,...,xn) is the same for all classes, so we can ignore it
for classification, as we only need to compare relative probabilities).

 Decision Rule: The Naive Bayes classifier predicts the class Ck that maximizes this
posterior probability for a given set of features.

 How Probabilities are Estimated from Data:

o P(Ck) (Prior Probability of Class): Calculated as the frequency of class Ck in


the training dataset.

o P(xi∣Ck) (Likelihood of Feature given Class):


 For Categorical Features: Calculated as the frequency of feature value
xi appearing in instances belonging to class Ck.

 For Numerical Features: Often modeled using a probability


distribution, most commonly the Gaussian (Normal) distribution. This
leads to Gaussian Naive Bayes.

 Types of Naive Bayes Classifiers:

o Gaussian Naive Bayes: Assumes continuous features follow a Gaussian


distribution.

o Multinomial Naive Bayes: Suitable for discrete counts, particularly in text


classification (e.g., word counts in documents).

o Bernoulli Naive Bayes: For binary features (e.g., presence or absence of a


word in a document).

 Advantages:

o Simplicity and Speed: Easy to implement and very fast to train and predict,
even with large datasets.

o Scalability: Performs well on high-dimensional data.

o Efficiency: Requires a relatively small amount of training data to estimate


parameters.

o Good Performance: Often performs surprisingly well in practice, especially in


text classification.

 Disadvantages:

o Strong Independence Assumption: The "naive" assumption often does not


hold true in real-world data, which can limit accuracy in some cases.

o Zero-Frequency Problem: If a particular feature value does not appear in the


training data for a certain class, its conditional probability will be zero, causing
the entire product for that class to become zero. This is commonly addressed
using Laplace Smoothing (add-one smoothing).

o Output Probabilities: While it's a probabilistic classifier, the raw probabilities


outputted by Naive Bayes might not be perfectly calibrated due to the
independence assumption.

 Common Use Cases:

o Spam Filtering: Classifying emails as spam or not spam.


o Text Classification: Sentiment analysis, topic classification, categorizing
documents.

o Recommendation Systems: Suggesting items based on user preferences.

o Medical Diagnosis: Simple diagnostic tasks (though often combined with


other methods).
UNIT III

Advanced Analytical Theory and Methods: Association Rules—Overview—Apriori


Algorithm— Evaluation of Candidate Rules— Applications of Association Rules— Finding
Association & finding similarity — Recommendation System: Collaborative Recommendation-
Content Based Recommendation — Knowledge Based Recommendation- Hybrid
Recommendation Approaches.

1. Association Rules – Overview

 What is it? Association rule mining is an unsupervised data mining technique that
helps discover interesting relationships or co-occurrences among items in large
datasets. Think of it as finding "if-then" patterns.

 Purpose: To identify strong rules discovered in databases using different measures


of "interestingness." It's often used for Market Basket Analysis.

 Analogy (Market Basket Analysis): Imagine a supermarket wanting to know which


products customers often buy together. If a customer buys bread, do they also tend
to buy milk? This is what association rules help figure out.

 Key Concepts:

o Itemset: A collection of one or more items (e.g., {Bread, Milk}).

o Transaction: A single event where a set of items are bought together (e.g., a
customer's shopping cart).

o Rule: An implication of the form X→Y, where X and Y are disjoint itemsets.

 X is the antecedent (the "if" part).

 Y is the consequent (the "then" part).

 Example: {Diapers} \rightarrow {Beer} (If a customer buys diapers, they


also tend to buy beer).

 Measures of Rule Strength (Important for Evaluation):

o Support: How frequently an itemset appears in the dataset.

 Support(X)=Total number of transactionsNumber of transactions conta


ining X

 Support(X→Y)=Total number of transactionsNumber of transactions c


ontaining X and Y
 Meaning: Indicates the popularity of an itemset or a rule. Low support
means the rule is rare.

o Confidence: How often items in Y appear in transactions that also contain X.

 Confidence(X→Y)=Support(X)Support(X∪Y)

 Meaning: Indicates the reliability of the rule. A high confidence means


that if X is present, Y is very likely to also be present.

o Lift: Measures how much more likely Y is purchased when X is purchased,


relative to the general probability of purchasing Y.

 Lift(X→Y)=Support(X)⋅Support(Y)Support(X∪Y)

 Meaning:

 Lift=1: X and Y are independent (no association).

 Lift>1: Positive association (buying X increases the likelihood of


buying Y).

 Lift<1: Negative association (buying X decreases the likelihood


of buying Y).

 Lift is crucial because high support and confidence alone can be


misleading if the items are simply very popular individually.

2. Apriori Algorithm

 Purpose: The most classic and widely used algorithm for discovering association rules.
It efficiently finds frequent itemsets (item sets that meet a minimum support
threshold). Once frequent item sets are found, association rules can be generated from
them.

 Core Principle: The Apriori Property (Downward Closure Property):

o "If an itemset is frequent, then all of its subsets must also be frequent."

o Conversely, "If an itemset is infrequent (does not meet minimum support),


then all of its supersets must also be infrequent."

o Why this is important: This property allows the algorithm to prune (cut off)
many candidate itemsets early, drastically reducing the search space and
making the process efficient.

 Algorithm Steps (Iterative, Level-wise Search):

1. Generate Frequent 1-Itemsets (L1):


 Scan the entire dataset once.

 Count the occurrences of each individual item.

 Keep only those items whose count meets the minimum support
threshold. These are your frequent 1-itemsets.

 Example: If Min_Support = 2 transactions, and 'A' appears in 3


transactions, 'B' in 1, then 'A' is frequent, 'B' is not.

2. Generate Candidate k-Itemsets (Ck):

 Use the frequent (k-1)-itemsets (Lk−1) found in the previous step to


generate candidate k-itemsets. This is done by joining Lk−1 with itself
(e.g., if {A,B} and {A,C} are frequent 2-itemsets, candidate {A,B,C} might
be generated).

 Pruning Step: Apply the Apriori property: If any (k-1)-subset of a


candidate k-itemset is not frequent (i.e., not in Lk−1), then prune
(remove) that candidate k-itemset. It cannot be frequent.

3. Generate Frequent k-Itemsets (Lk):

 Scan the dataset again.

 Count the occurrences of the pruned candidate k-itemsets (Ck).

 Keep only those that meet the minimum support threshold. These are
your frequent k-itemsets.

4. Repeat: Continue steps 2 and 3, incrementing 'k' each time, until no more frequent
itemsets can be found (i.e., Lk is empty).

5. Generate Association Rules: Once all frequent itemsets are found, generate high-
confidence association rules from them. For every frequent itemset M, all non-empty subsets
of M are considered as antecedents, and the remaining items as consequents. Rules must
meet a minimum confidence threshold.

3. Evaluation of Candidate Rules

After the Apriori algorithm (or any frequent itemset mining algorithm) finds potential rules,
we need to evaluate them using the metrics:

 Minimum Support Threshold:

o Purpose: To filter out infrequent itemsets. Items or itemsets that don't occur
often enough are not interesting to analyze.
o How it works: You set a percentage (e.g., 5%). Any itemset appearing in less
than 5% of transactions is discarded immediately.

o Impact: A higher minimum support reduces the number of frequent itemsets


and rules, making the analysis faster and focusing on more common patterns.
Too high, and you might miss interesting niche rules.

 Minimum Confidence Threshold:

o Purpose: To filter out unreliable rules. A rule might have high support but low
confidence if the consequent item is very common even without the
antecedent.

o How it works: You set a percentage (e.g., 70%). A rule X→Y must have at least
70% confidence to be considered valid.

o Impact: A higher minimum confidence ensures stronger implications. Too high,


and you might miss rules that are statistically significant even if they don't hold
every time.

 Minimum Lift Threshold:

o Purpose: To identify truly interesting rules that are not just coincidences due
to high individual popularity of items.

o How it works: You usually set Lift > 1.

o Impact: Ensures that the association between X and Y is genuinely stronger


than what would be expected if X and Y were independent. This is crucial for
actionable insights.

 Other Metrics (Less Common but Useful):

o Conviction: Measures how much X implies Y, while taking into account the
rule's confidence.

o Leverage: Measures the difference between the observed frequency of X and


Y appearing together and what would be expected if X and Y were
independent.

Evaluation Process:

1. Generate all frequent itemsets based on minimum support.

2. For each frequent itemset, generate all possible association rules (X→Y).

3. Calculate the confidence for each candidate rule.

4. Filter out rules that fall below the minimum confidence threshold.
5. (Optional but Recommended) Calculate lift for the remaining rules and filter based on
a minimum lift threshold (typically > 1).

6. The remaining rules are considered "strong" or "interesting" association rules.

4. Applications of Association Rules

Association rule mining has numerous practical applications beyond just market baskets:

 Market Basket Analysis (Retail):

o Application: Identifying products that are frequently purchased together.

o Benefit: Helps in store layout optimization (placing associated items near each
other), cross-selling strategies, product bundling, promotional campaigns, and
inventory management.

o Example: If people buy {Chips} \rightarrow {Soda}, promote them together or


place them nearby.

 Recommendation Systems (Simple Form):

o Application: Suggesting items to users based on their current purchase or


Browse history.

o Benefit: Enhances user experience, increases sales.

o Example: "Customers who bought X also bought Y."

 Catalog Design & Website Layout:

o Application: Arranging products in a physical catalog or the layout of an e-


commerce website based on discovered associations to encourage purchases.

o Benefit: Improves navigation and potential sales.

 Fraud Detection:

o Application: Identifying unusual combinations of activities or transactions that


might indicate fraudulent behavior.

o Benefit: Helps flag suspicious patterns for further investigation.

o Example: A rule like {Large withdrawal, New account, Foreign ATM} \rightarrow
{Fraud}.

 Medical Diagnosis:

o Application: Discovering correlations between symptoms, diseases, patient


characteristics, and treatment outcomes.
o Benefit: Aids in diagnosis, treatment planning, and understanding disease
progression.

o Example: {Fever, Cough, Sore Throat} \rightarrow {Flu}.

 Web Usage Mining:

o Application: Analyzing user navigation patterns on websites (clickstream data).

o Benefit: Helps in optimizing website design, improving user experience, and


personalizing content.

o Example: {Visited Product Page A, Added to Cart A} \rightarrow {Visited


Checkout Page}.

 Telecommunications:

o Application: Identifying frequently co-occurring call patterns or services used


together.

o Benefit: Helps in designing new service bundles and understanding customer


behavior.

5. Finding Association & Finding Similarity

This topic connects association rules to the broader concept of similarity.

5.1 Finding Association (via Association Rules):

 Focus: Discovering rules or relationships between items based on their co-occurrence


within transactions. It's about "if A happens, then B is likely to happen."

 Output: Explicit rules (X→Y) with associated support, confidence, and lift metrics.

 Mechanism: Algorithms like Apriori are used to find frequent item sets and then
generate rules from them.

 Nature: Primarily focused on concomitant patterns (what occurs together) and


implicative patterns (what implies what).

5.2 Finding Similarity (via Clustering or other Distance Metrics):

 Focus: Identifying objects or users that are similar to each other based on their
attributes or behaviors. It's about grouping "like with like."

 Output: Clusters of similar data points, or a similarity score between any two
points/users.

 Mechanism:
o Clustering Algorithms (e.g., K-means): Group data points based on their
proximity in a multi-dimensional space defined by their features.

o Distance Metrics: Mathematical functions (e.g., Euclidean distance, Cosine


similarity, Jaccard similarity) are used to quantify how close or far apart two
data points are.

o Collaborative Filtering (in Recommendation Systems): Finds users who are


similar in their preferences or items that are similar based on how users rated
them.

 Nature: Primarily focused on grouping and proximity in a feature space.

Relationship/Overlap:

 While distinct, these two concepts often complement each other, especially in
recommendation systems.

 Association rules can be seen as a way to find item-item similarity based on co-
occurrence in transactions. For example, if {Item A} \right arrow {Item B} is a strong
rule, it suggests Item A and Item B are "similar" in the sense that they are often bought
together.

 Clustering can find user-user similarity (grouping users with similar tastes) or item-item
similarity (grouping products with similar features/ratings).

 Hybrid approaches in recommendation systems often combine both association rules


(for item-item links) and similarity measures (for user or content profiles).
UNIT 3 (PART 2)
Recommendation System: Collaborative Recommendation- Content Based
Recommendation — Knowledge Based Recommendation- Hybrid Recommendation
Approaches.

3.Recommendation System:

3.1. Collaborative Recommendation

Goal: To recommend items to a user based on the opinions or behaviours of other users, or
by finding items that are similar to what the user has interacted with previously. It leverages
the "wisdom of the crowd."

Core Idea: People who agreed in the past (e.g., liked the same movies) will agree again in
the future. Or, if you like item X, and others who liked item X also liked item Y, then you
might like item Y.

Example for Collaboration Recommendation:


Types of Collaborative Filtering:

1. User-Based Collaborative Filtering (User-User CF):

 How it works:

1. Find users who are "similar" to the active user (the one for
whom we want to make a recommendation). Similarity is
based on their past ratings or interactions (e.g., both liked the
same movies, bought the same products).

2. Identify items that these similar users liked or interacted with,


but the active user has not yet seen.

3. Recommend these items to the active user.

 Pros: Can recommend novel items outside the user's past experience;
intuitive.

 Cons: Scales poorly with a large number of users (computationally


intensive to find neighbors); "cold start" problem for new users (no
history); "sparsity" problem (most users rate only a few items).

2. Item-Based Collaborative Filtering (Item-Item CF):

 How it works:

1. Find items that are "similar" to items the active user has
already liked or interacted with. Similarity between items is
based on how users rated them (e.g., users who liked Movie A
also liked Movie B).

2. Identify the items most similar to the user's liked items.

3. Recommend these similar items.

 Pros: Scales better than user-based for large user bases (item
similarity is often more stable than user similarity); less prone to
sparsity issues for popular items.

 Cons: Can be slow if there are many items; struggles with new items
("cold start" for items).

Challenges:

o Cold Start Problem:

 New Users: If a user has no history, it's hard to find similar users or
items.
 New Items: If an item has no ratings, it won't be recommended by
collaborative filtering.

o Sparsity: Most users only interact with a small fraction of available items,
leading to very sparse rating matrices.

o Scalability: For very large datasets, finding neighbors or calculating all item
similarities can be computationally expensive.

o Shilling Attacks: Malicious users try to manipulate recommendations by


giving fake ratings.

3.2. Content-Based Recommendation

Goal: To recommend items to a user that are similar to items the user has liked or interacted
with in the past, based on the attributes or features of the items themselves.

Core Idea: If a user liked a movie with "sci-fi" and "action" genres, recommend other movies
with "sci-fi" and "action" genres.

How it Works:

1. Item Representation: Items are described by their features/attributes (e.g.,


for a movie: genre, actors, director, keywords; for a product: category, brand,
color). This creates an "item profile."

2. User Profile Creation: A "user profile" is built based on the features of items
the user has liked, bought, or interacted with. This profile represents the
user's preferences.

 It can be a simple aggregate (e.g., average of liked movie genres).

 It can use machine learning to learn preference weights for features.

3. Recommendation Generation:

 Compare the user's profile with the profiles of unseen items.

 Recommend items that have a high similarity score with the user's
profile.
Advantages:

o Handles Cold Start for New Users (partially): If a new user rates just one
item, a profile can start to be built.

o Handles Cold Start for New Items (partially): New items can be
recommended as long as their content features are known.

o Transparency: Recommendations are easily explainable ("You liked this movie


because it's a sci-fi action film, and you've liked many sci-fi action films
before").

o Novelty: Can recommend items that are highly niche but perfectly fit a user's
specific preferences.

Disadvantages:

o Limited Serendipity: Tends to recommend items very similar to what the user
already knows, leading to less discovery of diverse interests. It won't suggest
something completely new the user might like.

o Feature Engineering: Requires detailed item features/metadata, which can be


hard to obtain or extract (especially for complex items like books or music).

o Overspecialization: Users might get recommendations that are too narrow if


their profile becomes very specific.
3.3 Knowledge-Based Recommendation

Goal: To recommend items based on explicit knowledge about items and user preferences,
often involving a dialogue with the user. It's useful when ratings data is sparse or items are
complex.

Core Idea: Recommendations are generated by explicitly reasoning about item properties,
user needs, and domain knowledge, rather than implicit patterns in ratings.

Types:

1. Constraint-Based (or Query-Based) Systems:

 How it works: The user specifies their requirements through a set of


constraints or filters (e.g., "Find a laptop with at least 16GB RAM, price
under $1000, and screen size 13-15 inches").

 Recommendation: The system returns items that perfectly match all


specified constraints.

 Pros: Highly controllable by the user; good for high-involvement


purchases (cars, real estate, electronics) where users know their
needs.

 Cons: Can lead to "no results found" if constraints are too restrictive;
requires users to know exactly what they want.

2. Case-Based Reasoning (CBR) Systems:

 How it it works: The user describes their desired item by giving an


example (a "case") of an item they liked or want. The system then
finds similar items (cases) from its knowledge base.

 Recommendation: Provides items similar to the example, often along


with explanations of the similarities/differences.

 Pros: Easier for users who might not know exact specifications but can
provide an example; good for explaining recommendations.

 Cons: Requires a well-structured knowledge base of items and their


attributes.
How its works:

Advantages:

o Handles Cold Start Well: Can recommend new items or to new users as long
as their attributes are known and a user's needs can be articulated.

o Transparency: Easy to explain why an item was recommended ("This car


meets all your specified criteria for fuel efficiency and budget").

o User Control: High degree of user interaction and control over


recommendations.

Disadvantages:

o Knowledge Acquisition: Building and maintaining the knowledge base can be


costly and time-consuming.

o Requires Explicit Input: Relies on users providing clear and specific


requirements.

o Limited Scope: Recommendations are confined to the defined knowledge


rules; less serendipitous.
3.4 Hybrid Recommendation Approaches

Goal: To combine two or more recommendation techniques (e.g., collaborative filtering,


content-based, knowledge-based) to overcome the limitations of individual approaches and
leverage their strengths.

Hybrid recommendation model:

Why Hybrid?

o Address Cold Start: Content-based or knowledge-based methods can help


with new users/items where collaborative filtering struggles.

o Improve Serendipity: Collaborative filtering can introduce novel items that


content-based might miss.

o Mitigate Sparsity: Combining information can lead to more robust


recommendations.

o Enhance Accuracy: Different models might capture different aspects of user


preferences.

Common Hybridization Techniques:

0. Weighted Hybrid:

 How it works: Combine the scores from different recommendation


models by assigning weights to each.

 Example: Score = (Weight_CF * Score_CF) + (Weight_CB * Score_CB).


 Benefit: Simple to implement, balances different influences.

1. Mixed Hybrid:

 How it works: Present recommendations from different


recommenders side-by-side or in separate lists.

 Example: "Because you watched this..." (Content-based) and "People


like you also watched..." (Collaborative).

 Benefit: Offers variety to the user, allows the user to choose.

2. Cascade/Switching Hybrid:

 How it works: Use one recommender first, and if it fails (e.g., no


recommendations, cold start), switch to another. Or, use one
recommender to pre-filter items for another.

 Example: If a new user has no ratings, use a content-based system.


Once they have enough ratings, switch to collaborative filtering.

 Benefit: Effectively handles specific problems like cold start.

3. Feature Combination/Feature Augmentation:

 How it works: Integrate features from one technique into another.

 Example: Add content-based features (e.g., genre, actors) into a


collaborative filtering algorithm (e.g., in matrix factorization models).
Or, add user-item interaction features to content-based user profiles.

 Benefit: Enriches the input data for a single, more powerful model.

4. Meta-level Hybrid:

 How it works: One recommender learns from the predictions of


another. The output of one recommender becomes the input for a
second-level recommender.

 Example: Train a machine learning model to combine the outputs of a


collaborative filter and a content-based filter to make the final
prediction.

 Benefit: Can achieve highly sophisticated combinations.

 Overall Advantage of Hybrids: Generally, provide more robust, accurate, and diverse
recommendations than standalone approaches, leading to improved user satisfaction
and engagement.
UNIT 4

Introduction to Streams Concepts— Stream Data Model and Architecture— Stream


Computing ,Sampling Data in a Stream — Filtering Streams —Counting Distinct Elements in a
Stream — Estimating moments— Counting oneness in a Window—Decaying Window—
Realtime Analytics Platform(RTAP) applications — Case Studies — RealTime Sentiment
Analysis, Stock Market Predictions. Using Graph Analytics for Big Data: Graph Analytics.

4.1. Introduction to Streams Concepts

What is a Stream?

o A stream is a continuous, ordered sequence of data elements (or tuples) that


are generated and processed in real-time. Unlike traditional databases where
data is stored and then queried, stream data is in motion; it's constantly
flowing.
o Think of it like a river: the water is always flowing; you can't stop it to analyze
the whole river at once. You have to analyze it as it passes by.

Key Characteristics of Data Streams:

o Continuous Flow: Data arrives constantly, without end.


o High Volume: Often generated at very high rates (e.g., millions of events per
second).
o Unbounded: The total size of the data stream is unknown and potentially
infinite. You can't store it all.
o Real-time or Near Real-time Processing: Data needs to be processed as it
arrives or very soon after, to derive timely insights.
o Data Skew: The distribution of data values might change over time (concept
drift).
o Volatility: Once a data element is processed, it might be discarded or
archived, as it's often not feasible to store everything.

Why Streams are Important (Traditional vs. Stream):

o Traditional Batch Processing: Collects all data, stores it, then processes it in
large batches. Good for historical analysis.
 Analogy: Filling a bathtub, then analyzing the water.
o Stream Processing: Processes data as it arrives. Good for immediate insights
and reacting to live events.
 Analogy: Analyzing water directly from the tap as it flows.
 Use Cases: Fraud detection, real-time dashboards, IoT monitoring, stock market
analysis, social media sentiment analysis.
4.2. Stream Data Model and Architecture

4.2.1 Stream Data Model

 Data Elements (Tuples/Records): Each piece of data in a stream is typically a small,


self-contained record or "tuple."
o Example: For stock market data: (timestamp, stock_symbol, price, volume).
o Example: For IoT sensor data: (device_ID, sensor_type, temperature,
humidity, timestamp).
 Timestamps: A crucial component of stream data. Each record usually has a
timestamp (either when it was generated or when it was ingested) to allow for time-
based analysis (e.g., processing data from the last 5 minutes).
 Key-Value Pairs: Often, stream data can be seen as key-value pairs, where the key
identifies the data source or type, and the value is the actual data.
 Schema (Flexible): While individual tuples might have a schema, the overall stream
system needs to be flexible enough to handle evolving schemas or different types of
events in the same stream.

4.2.2 Stream Architecture

Here is a diagram illustrating the real-time fraud detection stream processing architecture
Stream processing architectures are designed to handle the continuous flow of data. A
common pattern involves:

1. Data Sources (Producers):


o Where the stream data originates. These are often distributed and diverse.
o Examples: IoT devices, web servers (logs), social media APIs, financial market
data feeds, mobile apps.
2. Stream Ingestion Layer (Message Queue/Broker):
o Acts as a highly scalable, fault-tolerant buffer that ingests data from sources
and makes it available to processors.
o Decouples data producers from consumers.
o Ensures data is not lost if processing engines are temporarily down or slow.
o Key Technologies: Apache Kafka, RabbitMQ, Amazon Kinesis.
3. Stream Processing Engine:
o The core component that consumes data from the ingestion layer, performs
real-time computations, transformations, and analytics.
o Can perform simple operations (filtering, aggregation) or complex ones
(pattern matching, machine learning on streams).
o Key Technologies: Apache Flink, Apache Spark Streaming, Apache Storm.
4. Data Sinks (Consumers/Output Destinations):
o Where the processed results are sent.
o Examples: Real-time dashboards, alerts systems, data warehouses (for
historical analysis), other applications, machine learning models.

4.3. Stream Computing, Sampling Data in a Stream

4.3.1 Stream Computing (Stream Processing)

Definition: The practice of performing computations and analytics on data as it arrives in a


continuous flow, rather than storing it first and then querying it in batches.

Goals:

o Real-time Insights: Provide immediate answers to questions as events unfold.


o Fast Reaction: Enable systems to react instantly to anomalies or important
events (e.g., fraud alert, system failure).
o Operational Intelligence: Continuously monitor business operations for
efficiency and performance.
Key Challenges:

o Unbounded Data: Cannot store the entire stream, so algorithms must


operate on limited memory or use approximations.
o Single-Pass Algorithms: Algorithms usually need to process data elements
only once as they arrive.
o Time-Varying Data: Data distributions can change over time (concept drift),
requiring adaptive algorithms.
o Fault Tolerance: Ensuring that processing continues even if parts of the
system fail.22
 Common Operations in Stream Computing: Filtering, aggregation (counting,
summing), joining streams, pattern matching, windowing (processing data over a
time-bound or count-bound segment of the stream).

4.3.2 Sampling Data in a Stream

 Problem: Streams can be too massive to process every single element, especially if
resources are limited or exact counts aren't needed.
 Solution: Sampling: Selecting a subset of data from the stream to perform analysis,
saving computational resources and storage.
 Challenges of Stream Sampling: You don't know the total size of the stream in
advance, and the distribution of data might change.

Types of Stream Sampling:

o Random Sampling (Reservoir Sampling):


 Concept: Maintains a fixed-size sample of 'k' elements. For each new
element in the stream, it's included in the sample with a decreasing
probability, replacing a randomly chosen element from the current
sample.
 Benefit: Ensures that each element seen so far has an equal
probability of being in the final sample, making it representative.
o Window-Based Sampling:
 Concept: Take a sample from the most recent 'N' elements (a
window).
 Benefit: Focuses on recent data, useful for analyzing trends.
o Systematic Sampling:
 Concept: Select every Nth element.
 Benefit: Simple to implement.
 Caution: Can be biased if the stream has periodic patterns that align
with N.
o Count-Based Sampling:
 Concept: Sample the first 'N' elements, or elements until a certain
count is reached.
 Caution: May not be representative of the entire stream if patterns
change later.

4.4. Filtering Streams

Definition: The process of selecting specific data elements from a stream that meet certain
criteria, while discarding the rest. It's often the first step in stream processing to reduce the
volume of data for subsequent analysis.

Purpose:

o Reduce Volume: Minimize the amount of data that needs to be processed


downstream, saving resources.
o Focus Analysis: Isolate relevant data points for a particular analysis or
application.
o Security/Privacy: Remove sensitive or irrelevant information before it reaches
further processing stages.

How it Works:

o A predicate (condition) is applied to each incoming data element.


o If the element satisfies the predicate, it is passed through; otherwise, it is
dropped.

Examples:

o Filtering Stock Ticker Data: Only keep records for a specific stock symbol (e.g.,
symbol == 'GOOG') or only trades above a certain volume (e.g., volume >
1000).
o Filtering Social Media Feeds: Only keep tweets containing specific keywords
(e.g., text.contains('AI') or text.contains('Machine Learning')).
o Filtering Sensor Data: Only keep temperature readings above a critical
threshold (e.g., temperature > 100).
o Filtering Log Data: Only keep error messages (e.g., log_level == 'ERROR').
 Implementation: Stream processing frameworks (Flink, Spark Streaming) provide
operators or APIs specifically for filtering streams based on user-defined functions or
lambda expressions.
4.5. Counting Distinct Elements in a Stream

 Problem: How to count the number of unique items (distinct elements) in a massive,
continuous data stream, given that you cannot store all elements due to memory
constraints.
 Challenges: The stream is unbounded, so a simple hash set won't work for very large
streams.
 Approximate Counting Algorithms: Since exact counting is often impossible or too
expensive, approximate algorithms are used. They provide a count within a certain
error bound.
 Key Algorithms:
o Flajolet-Martin Algorithm (FM Algorithm):
 Concept: Uses hashing. Each distinct element is hashed to a bit string.
The algorithm observes the position of the rightmost zero in these
hashed bit strings. A higher maximum position of the rightmost zero
suggests more distinct elements.
 Benefit: Very memory-efficient.
 Output: Provides an estimate of the distinct count.
o HyperLogLog (HLL):
 Concept: An advanced version of Flajolet-Martin. It divides the hash
space into "registers" and uses the maximum position of the rightmost
zero within each register. It then combines these estimates using a
harmonic mean.
 Benefit: Extremely memory-efficient, providing good accuracy for very
large distinct counts (e.g., counting billions of unique users with just a
few kilobytes of memory).
 Output: An estimate of the distinct count with a configurable error
margin.
o Count-Min Sketch (CMS):
 Concept: A probabilistic data structure that estimates frequencies of
elements in a stream. It uses multiple hash functions and a 2D array of
counters. While primarily for frequency, it can be adapted to estimate
distinct counts.
 Benefit: Provides estimates for both frequencies and distinct counts.
 Use Cases:
o Counting unique visitors to a website in real-time.
o Counting unique IP addresses accessing a server.
o Counting unique products viewed by customers.
o Estimating the number of unique active users in an application.
4.6. Estimating Moments

What are Moments? In statistics, moments describe the shape and characteristics of a
probability distribution.

o 0th Moment: Total count (sum of probabilities, usually 1).


o 1st Moment (Mean): Average value of the data.
o 2nd Moment (Variance): Measure of data spread (squared deviation from the
mean).27
o 3rd Moment (Skewness): Measure of asymmetry of the distribution.
o 4th Moment (Kurtosis): Measure of the "tailedness" of the distribution.

Estimating Moments in Streams:

o Challenge: Calculating exact moments for a continuous, unbounded stream is


difficult because it requires summing/averaging over potentially infinite data.
o Solution: Use algorithms that provide approximate estimates of moments
using limited memory and a single pass over the data.
 Alon-Matias-Szegedy (AMS) Algorithm (for 2nd Moment):
o Concept: A famous algorithm that estimates the second moment (which is
related to the sum of squares of frequencies, and thus helps measure data
variability or "hot spots") in a stream using a randomized approach. It creates
"random variables" (sketches) and uses their properties to estimate the sum
of squares.
o Benefit: Provides a bounded error estimate for the second moment using
much less memory than required for an exact calculation.
 Use Cases:
o Detecting Skew: Identifying if the distribution of data values is becoming
highly skewed (e.g., a few items dominating transactions).
o Identifying Hot Items: A high second moment suggests that a few items have
very high frequencies compared to others.
o Network Traffic Analysis: Detecting Denial of Service (DoS) attacks (which
would increase the second moment of source IP addresses).
o Quality Control: Monitoring the variance of sensor readings in real-time.

4.7. Counting Ones in a Window

Problem: To count the number of '1's (or any specific event) within a defined segment
(window) of a binary or event stream.

Why "Window"? Because streams are unbounded, you can't count events over the entire
infinite stream. You're usually interested in recent activity. A window defines this "recent"
segment.
4.7.1 Types of Windows:
o Tumbling (Fixed) Windows:
 Concept: Non-overlapping, fixed-size time intervals. Once a window
closes, the count is emitted, and a new window starts.
 Example: Count events in every 5-minute interval: (0-5 min), (5-10
min), (10-15 min).
o Sliding Windows:
 Concept: Fixed-size time intervals that overlap. They "slide" forward
by a smaller step size.
 Example: Count events in a 5-minute window, sliding every 1 minute:
(0-5 min), (1-6 min), (2-7 min).
o Session Windows:
 Concept: Based on user activity. A window starts with an event and
closes after a period of inactivity.
 Example: A user's Browse session ends after 30 minutes of no clicks.
o Count-Based Windows:
 Concept: Defined by a fixed number of events, not time.
 Example: Count the last 100 events, or events per batch of 50.
 Counting Mechanism:
o Simple Counters: For basic tumbling windows, just maintain a counter that
increments for '1's and resets at the end of the window.
o Data Structures for Sliding Windows (e.g., using a queue or specialized
algorithms): To avoid re-scanning the entire window for each slide, efficient
data structures are used to add new elements and remove old ones as the
window slides.
 Use Cases:
o Real-time aggregation: Number of website clicks in the last minute.
o Alerting: Number of failed logins in the last 5 minutes exceeds a threshold.
o Trend analysis: How many users are active in the last 10-minute period.

4.8. Decaying Window

 Problem: In many real-time scenarios, older data points in a window are less relevant
than newer ones. A simple count gives equal weight to all elements within the
window.
 Concept: A "decaying window" (or exponential decay window) gives more weight to
recent data points and less weight to older ones. As data ages within the window, its
influence on the aggregated result decays.
 How it Works (Exponential Decay):
o Instead of just counting '1's, each '1' (or event) is assigned a weight that
decreases exponentially with time since its arrival.
o A decay factor (α, between 0 and 1) is used.
o The sum of these decaying weights gives the "count" for the window.
o Formula (simplified): If a new event arrives, the total count C_new = (C_old *
decay_factor) + 1. This isn't strictly correct for time-based decay but
illustrates the idea. More complex formulas adjust for time difference
between events.

Benefits:

o Reflects Recency: Provides a more accurate view of current activity or trends.


o Adaptive to Change: Responds quickly to shifts in data patterns, as older, less
relevant data fades out.
o Memory Efficiency: Can be implemented without storing all data points in the
window explicitly.

Use Cases:

o Trending Topics: Identifying what's "hot" right now on social media (older
mentions contribute less).
o Anomaly Detection: Detecting sudden spikes in recent activity while ignoring
older, benign fluctuations.
o Real-time Metrics: Calculating a "live" average or count where the most
recent data has the most impact.
o Fraud Scores: A transaction's contribution to a fraud score might decay over
time.

4.9. Realtime Analytics Platform (RTAP) Applications

Definition: A Real-time Analytics Platform (RTAP) is an end-to-end system designed to ingest,


process, and analyze streaming data continuously, providing immediate insights and
enabling real-time decision-making.

Components (as described in "Stream Data Model and Architecture"):

o Data Ingestion Layer: High-throughput, fault-tolerant message brokers (e.g.,


Kafka, Kinesis).
o Stream Processing Layer: Distributed processing engines (e.g., Flink, Spark
Streaming) for continuous computations.
o Real-time Storage/Databases: Low-latency databases optimized for quick
reads and writes (e.g., Apache Cassandra, Redis, specialized time-series
databases).
o Real-time Dashboards/Visualization: Tools for immediate display of metrics
and alerts (e.g., Grafana, custom web apps).
o Action/Alerting Layer: Systems to trigger automated responses or
notifications based on insights (e.g., send SMS, update another system).
Characteristics of RTAP Applications:

o Low Latency: Insights are available within milliseconds or seconds of data


generation.
o High Throughput: Can handle massive volumes of incoming data.
o Scalability: Can scale horizontally to meet growing data demands.
o Fault Tolerance: Designed to be resilient to failures.
o Continuous Operation: Always on and processing data.

Key RTAP Applications :

4.9.1 Real-Time Sentiment Analysis

Concept: Analyzing text data (e.g., social media posts, customer reviews, news articles) as it
streams in, to determine the emotional tone (positive, negative, neutral) towards a specific
topic, product, or brand.

How it Works:

1. Ingestion: Stream social media feeds (Twitter, Facebook, review sites)


into an RTAP.
2. Processing: Use Natural Language Processing (NLP) techniques (text
tokenization, sentiment dictionaries, machine learning models) in the
stream processor to classify the sentiment of each incoming text.
3. Aggregation: Aggregate sentiment scores over windows (e.g., positive
vs. negative mentions in the last 5 minutes).
4. Output: Display real-time sentiment dashboards, trigger alerts for
sudden negative shifts, or route customer complaints based on
sentiment.

Benefit: Enables rapid response to brand crises, real-time product feedback, and
understanding public opinion as it evolves.

4.9.2 Stock Market Predictions

Concept: Analyzing vast streams of financial data (stock prices, trading volumes, news
headlines, social media chatter) in real-time to identify patterns, predict price movements,
and execute trades or generate alerts.

How it Works:

1.Ingestion: Stream real-time stock quotes, order book data, financial news feeds, and
relevant social media data.

2.Processing:

 Calculate real-time indicators (e.g., moving averages,


volatility).
 Run machine learning models trained on historical data to
predict short-term price movements.
 Perform sentiment analysis on news and social media related
to specific stocks.

3.Complex Event Processing (CEP): Detect patterns across multiple streams (e.g., a sudden
price drop correlated with negative news and high sell volume).

4.Output: Display real-time trading dashboards, generate buy/sell signals for automated
trading systems, issue alerts to traders.

Benefit: Enables high-frequency trading, real-time risk management, and quicker reactions
to market events.

4.10. Using Graph Analytics for Big Data: Graph Analytics

What is a Graph?

o In the context of data, a graph is a data structure consisting of nodes (or


vertices) and edges (or links) that connect them.
o Nodes: Represent entities (e.g., people, organizations, products, cities).
o Edges: Represent relationships or interactions between entities (e.g., "friend
of," "works for," "buys," "travels to"). Edges can be directed (e.g., A follows B)
or undirected (e.g., A is friends with B). They can also have weights (e.g.,
strength of friendship, number of transactions).

What is Graph Analytics?

o The process of analyzing relationships and connections between entities in a


graph structure. It focuses on how entities are linked and the properties of
these connections, rather than just the individual attributes of the entities.

Why use Graph Analytics for Big Data?

o Relationship-Centric Insights: Many Big Data problems are inherently about


relationships (e.g., social networks, supply chains, financial transactions,
disease spread). Traditional relational databases struggle to efficiently query
complex, multi-hop relationships.
o Pattern Discovery: Helps uncover hidden connections, communities, central
figures, and influential paths that are difficult to find with other analytical
methods.
o Performance: Graph databases and processing engines are optimized for
traversing relationships, leading to much faster queries for connected data
compared to SQL joins on large relational tables.
Key Graph Analytics Concepts and Algorithms:

Centrality Measures: Identify the most important or influential nodes in a network.

 Degree Centrality: Number of direct connections a node has.


 Betweenness Centrality: Measures how often a node lies on the
shortest path between other nodes.
 PageRank (Influence): Measures the importance of a node based on
the importance of nodes linking to it (originally for web page ranking).

Pathfinding Algorithms: Finding shortest paths or all paths between nodes.

 Dijkstra's Algorithm: Finds the shortest path between two nodes.

Community Detection Algorithms: Grouping nodes that are more densely connected to
each other than to the rest of the network.

 Louvain Method, Girvan-Newman: Identify clusters or communities


within a graph.

Pattern Matching: Searching for specific substructures or patterns within a graph.

 Graph Databases (NoSQL): Specialized databases optimized for storing and querying
graph data (e.g., Neo4j, Amazon Neptune, ArangoDB).
 Graph Processing Frameworks: For large-scale offline graph computations (e.g.,
Apache Giraph, GraphX on Apache Spark).

Use Cases for Big Data:

o Social Network Analysis: Identifying influencers, community detection, friend


recommendations.
o Fraud Detection: Uncovering complex fraud rings by analyzing relationships
between accounts, transactions, and individuals.
o Recommendation Systems: "People who bought X also bought Y" (item-item
relationships), "Friends of friends" recommendations.
o Cybersecurity: Mapping network dependencies, identifying attack paths,
anomaly detection.
o Knowledge Graphs: Representing complex real-world entities and their
relationships for AI and semantic search.
o Supply Chain Optimization: Analyzing dependencies and identifying single
points of failure.
o Drug Discovery: Modeling interactions between proteins, genes, and drugs.
UNIT V

NoSQL Databases : Schema-less Models: Increasing Flexibility for Data Manipulation-


Key Value Stores-Document Stores — Tabular Stores — Object Data Stores—Graph
Databases Hive—Sharding—Hbase — Analyzing big data with twitter — Big data for
ECommerce Big data for blogs — Review of Basic Data Analytic Methods using R.

5.1. NoSQL Databases: Schema-less Models

 What is NoSQL?

o NoSQL stands for "Not Only SQL." It's a broad category of databases designed
to handle vast volumes of rapidly changing, diverse data (the "3 Vs" of Big Data)
that traditional Relational Database Management Systems (RDBMS) struggle
with.

o Unlike traditional SQL databases, NoSQL databases often do not enforce a fixed
schema.

 Why NoSQL? (Limitations of RDBMS for Big Data):

o Scalability: RDBMS primarily scale "up" (more powerful single server), which
becomes very expensive and eventually hits limits. NoSQL scales "out"
(distributes data across many commodity servers).

o Schema Rigidity: RDBMS require a predefined schema, which makes it difficult


to store rapidly changing or diverse data types (like social media posts, sensor
data).

o Performance for Non-Relational Data: RDBMS are optimized for structured,


tabular data. For highly interconnected, hierarchical, or key-value data, they
can be slow.

5.2. Schema-less Models: Increasing Flexibility for Data Manipulation

Core Concept:

In the context of databases, "schema-less" (or more accurately, "schema-flexible" or "schema-


on-read") means that you do not need to define the structure of your data (like columns and
their data types) before you store it.

 Unlike traditional relational databases (like MySQL, PostgreSQL,


Oracle), where every row in a table must conform to a predefined set
of columns and data types (this is called "schema-on-write"), schema-
less databases allow each record (document, key-value pair, row in a
column family) to have its own unique structure.
How it Works (The "Schema-on-Read" Philosophy):

 With schema-less databases, you write data into the database in


whatever format is convenient (e.g., a JSON document).

 The schema is then applied when you read the data or when your
application code interprets the data. The database itself doesn't
enforce it.

 This shifts the responsibility of data structure from the database to the
application or the query.

Key Advantages for Data Manipulation:

Agile Development and Iteration:

 Problem with RDBMS: In a relational database, if you need to add a


new field to your data (e.g., a new user preference, a new product
attribute), you typically need to perform a ALTER TABLE command. For
large tables, this can be a very slow operation, potentially requiring
downtime, and can be complex to manage in production.

 Schema-less Solution: You simply start adding the new field to new
records. Existing records that don't have that field are unaffected. This
allows developers to iterate quickly, test new features, and deploy
changes without complex database migrations.

Handling Diverse and Evolving Data:

 Problem with RDBMS: When dealing with data sources that have
inconsistent structures (e.g., IoT devices from different manufacturers
sending different metrics, user profiles where users provide varying
levels of detail), forcing them into a rigid relational schema is very
difficult and can lead to many NULL values.

 Schema-less Solution: Each incoming data point can have its own
unique set of attributes. For example, one IoT sensor might send
(temperature, humidity), while another sends (temperature, pressure,
battery_level). Both can be stored directly without issue. Similarly, user
A's profile might have (name, age, email), while user B's also includes
(name, email, address, phone_number).

Simplified Data Ingestion for Unstructured/Semi-structured Data:

 Problem with RDBMS: Unstructured data (like raw text, images, video)
or semi-structured data (like log files, XML, JSON without a fixed
schema) must be parsed, transformed, and fitted into a tabular
structure before storage.

 Schema-less Solution: These data types can often be stored directly as


values (in key-value stores), documents (in document stores), or large
columns (in column-family stores) without prior transformation. This
makes building "data lakes" much easier.

Support for Hierarchical and Nested Data:

 Problem with RDBMS: Representing hierarchical or nested data (like a


complex JSON object with arrays and nested objects) in a relational
model requires multiple tables and complex joins.

 Schema-less Solution (especially Document Stores): Documents can


natively store nested structures and arrays within a single document,
making it easier to represent and query complex real-world entities.
This eliminates the need for many joins.

Scaling Out (Horizontal Scalability):

 While not directly a flexibility for data manipulation, the schema-less


nature often complements the distributed architecture of NoSQL
databases. Without rigid schema enforcement, it's easier to distribute
data across many servers (sharding) because each server doesn't need
to strictly conform to a global schema definition. This provides high
availability and fault tolerance.

Trade-offs (What you gain in flexibility, you might lose elsewhere):

 Less Data Consistency/Integrity Enforcement: Without a schema, the


database can't automatically ensure that all records have certain fields
or that data types are consistent. This responsibility falls to the
application developer.

 More Complex Querying/Application Logic: While flexible, querying


can sometimes be more complex if you don't know the exact structure
of a document or if different documents have different fields. The
application code needs to be more robust in handling variations.

 Increased Storage Space (potentially): If fields are not present, they


don't take space. But if many records have slightly different structures,
the metadata overhead per record might increase.
5.3. Key-Value Stores

 Concept: The simplest form of NoSQL database. Data is stored as a collection of key-
value pairs, similar to a dictionary or a hash map.

o Key: A unique identifier (e.g., user_id, product_SKU).

o Value: Can be anything – a string, an integer, a JSON object, a binary blob


(image, video), etc. The database doesn't interpret the value; it just stores it.

 Operations: Primarily supports basic CRUD (Create, Read, Update, Delete) operations
based on the key.

o PUT(key, value): Stores a value with a given key.

o GET(key): Retrieves the value associated with a key.

o DELETE(key): Removes a key-value pair.

 Advantages:

o Extremely Fast: Optimized for rapid read/write operations when accessing


data by its key.

o Highly Scalable: Easy to distribute data across many servers (sharding) because
data is largely independent.

o Simple Data Model: Easy to understand and implement.

 Disadvantages:

o Limited Querying: Cannot perform complex queries based on value content


(e.g., "find all users whose age is > 30"). You must know the key to retrieve
data.

o No Relationships: Doesn't inherently support relationships between different


data items.

 Use Cases:

o Session Management: Storing user session data for web applications.

o Caching: Storing frequently accessed data for quick retrieval.

o User Profiles: Simple user profiles where the user_id is the key.

o Configuration Storage: Storing application configurations.

 Examples: Redis, Amazon DynamoDB, Riak, Memcached.


5.4. Document Stores

 Concept: Stores data in flexible, semi-structured "documents," typically in formats like


JSON (JavaScript Object Notation), BSON (Binary JSON), or XML. Each document is self-
contained and can have a different structure.

o Think of a document as a flexible record, similar to a row in a relational


database, but without a rigid schema.

 Collections: Documents are grouped into "collections," which are similar to tables in
RDBMS, but without enforcing a schema for their documents.

 Operations: Supports CRUD operations on documents, and often allows querying


based on fields within the documents.

 Advantages:

o Flexible Schema: Documents can have varying structures, making it ideal for
evolving data or diverse data.

o Rich Querying: Can query based on fields within the document, and sometimes
supports more complex queries like aggregation or full-text search.

o Scalable: Easy to scale horizontally across multiple servers.

o Developer Friendly: JSON is a common format for web developers.

 Disadvantages:

o Limited Transactions: Complex, multi-document transactions can be


challenging or not fully supported.

o Joins are Hard: No native support for joins across documents/collections.


Relationships are often handled by embedding related data or by performing
multiple queries at the application layer.

 Use Cases:

o Content Management Systems: Storing articles, blog posts, product catalogs.

o User Profiles: More complex user profiles with varying attributes.

o E-commerce: Product data, order information.

o Mobile Applications: Backend for mobile apps.

 Examples: MongoDB, Couchbase, Apache CouchDB.

5.5. Tabular Stores (Column-Family Stores / Wide-Column Stores)


 Concept: Designed for very large datasets where data is organized into "column
families." While they conceptually look like tables (rows and columns), they are highly
flexible and sparse.

o Rows: Identified by a unique row key.

o Column Families: Groups of related columns. You define column families


beforehand, but within a column family, you can have a variable number of
columns for each row.

o Sparse: If a row doesn't have data for a specific column, it simply doesn't store
that column, saving space.

 Flexibility: While there's a concept of columns and rows, new columns can be added
to any row at any time without altering a rigid schema.

 Operations: Optimized for rapid read/write access to specific columns or column


families within a row. Excellent for time-series data.

 Advantages:

o Massive Scalability: Designed to handle petabytes of data distributed across


thousands of servers.

o High Write Throughput: Excellent for applications with high write volumes
(e.g., sensor data, log data).

o Flexible Schema: Can add new columns dynamically.

o Efficient for Columnar Access: If you often access specific sets of columns for
many rows, it's very efficient.

 Disadvantages:

o Complex Data Modeling: Can be harder to design the optimal column families
and row keys compared to relational databases.

o No Joins: No native support for joins, relationships are handled by application


logic.

o No Transactions (ACID): Typically sacrifices full ACID compliance for scalability.

 Use Cases:

o Time-Series Data: Storing sensor data, stock market data, log data.

o Big Analytics: Powering big data analytics platforms.

o Messaging Systems: Storing message queues.

o Large-Scale Event Logging: Storing massive volumes of events.


 Examples: Apache Cassandra, HBase, Google Bigtable.

5.6. Object Data Stores

 Concept: A type of storage system where data is managed as objects, not as files or
blocks. Each object is a self-contained unit comprising the data itself, associated
metadata (e.g., content type, creation date, custom tags), and a globally unique
identifier.

o Think of it like storing files on the internet, accessible via a unique URL.

 Flat Structure: Object stores typically have a flat structure, not hierarchical folders.
Objects are stored in "buckets," and then retrieved using their unique identifier.

 Focus on Metadata: Metadata is crucial for managing and retrieving objects, allowing
for flexible querying beyond just the identifier.

 Operations: RESTful APIs (HTTP GET, PUT, DELETE) are commonly used to interact with
object stores.

 Advantages:

o Massive Scalability: Designed for virtually limitless storage capacity.

o High Durability & Availability: Data is typically replicated across multiple


devices and locations.

o Cost-Effective: Often cheaper per gigabyte for large-scale storage compared to


block or file storage.

o Flexible Access: Data is accessible over HTTP from anywhere.

 Disadvantages:

o No Incremental Updates: If you want to change a small part of an object, you


usually have to upload the entire new object. Not suitable for frequently
updated, small data.

o Not for Transactional Data: No support for complex transactions or real-time


database-like operations.

o Latency for Small Reads: While good for large objects, retrieving many small
objects can incur overhead.

 Use Cases:

o Data Lakes: Storing raw, unstructured, and semi-structured data for Big Data
analytics.
o Archiving and Backup: Long-term storage of large datasets.

o Media Storage: Storing images, videos, audio for websites and applications.

o Cloud Storage: Powering cloud storage services.

 Examples: Amazon S3, Google Cloud Storage, Azure Blob Storage.

5.7. Graph Databases

 Concept: A NoSQL database that uses graph structures (nodes, edges, and properties)
to store and query highly interconnected data. It explicitly models relationships
between entities.

o Nodes (Vertices): Represent entities (e.g., a Person, a Product, a Location).

o Edges (Relationships): Represent the connections between nodes (e.g.,


FRIENDS_WITH, BOUGHT, LIVES_IN, FOLLOWS). Edges can be directed and
have properties.

o Properties: Key-value pairs associated with both nodes and edges (e.g., a
'Person' node might have properties like name, age; a BOUGHT edge might
have a date property).

 Strength: Excels at handling complex relationships and performing "traverse" queries


(finding paths, connections, and neighborhoods).

 Operations: Uses specialized query languages optimized for graph traversals (e.g.,
Cypher for Neo4j, Gremlin for Apache TinkerPop).

 Advantages:

o Efficient Relationship Traversal: Orders of magnitude faster for multi-hop


relationship queries compared to SQL joins.

o Intuitive Data Modeling: Relationships are a first-class citizen, making complex


data models easier to understand and represent.

o Flexible Schema: Can evolve the graph schema easily by adding new node
types, relationship types, or properties.

 Disadvantages:

o Not for Simple Data: Overkill for simple data structures without complex
relationships.

o Scalability for Massively Large Graphs: While good for relationships, scaling
for extremely massive graphs (trillions of nodes/edges) can still be challenging.
o Specialized Skills: Requires learning new query languages and concepts.

 Use Cases:

o Social Networks: Friend recommendations, community detection, influence


analysis.

o Fraud Detection: Uncovering complex fraud rings by linking suspicious


activities, accounts, and individuals.

o Recommendation Engines: Beyond simple collaborative filtering, finding


complex user-item-user relationships.

o Knowledge Graphs: Building semantic networks for AI, search, and intelligent
assistants.

o Network/IT Operations: Mapping network dependencies, impact analysis.

o Supply Chain Management: Visualizing and optimizing complex supply chains.

 Examples: Neo4j, Amazon Neptune, ArangoDB.

5.8. Hive

 What is Apache Hive?

o Apache Hive is a data warehouse software that sits on top of Apache Hadoop.
It provides a way to query and manage large datasets residing in distributed
storage (like HDFS) using a SQL-like language called HiveQL.

 SQL-on-Hadoop: Hive translates HiveQL queries into executable MapReduce, Spark, or


Tez jobs that run on the Hadoop cluster. It allows data analysts familiar with SQL to
work with Big Data without learning complex Java MapReduce programming.

 Schema-on-Read (vs. Schema-on-Write):

o Concept: Hive enforces a schema when data is read (at query time) rather than
when it's written.

o Benefit: Allows users to load unstructured or semi-structured data into HDFS


first, and then define a schema later based on how they want to analyze it. This
provides flexibility for raw data lakes.

 Key Features:

o HiveQL: SQL-like language for querying.

o Data Definition Language (DDL): For creating, altering, and dropping tables
and databases.
o Data Manipulation Language (DML): For loading data into tables (inserts,
updates, deletes are limited or append-only).

o Partitioning and Bucketing: Techniques to optimize query performance by


organizing data on HDFS.

o User-Defined Functions (UDFs): Extend HiveQL with custom logic.

 Advantages:

o SQL Familiarity: Lowers the barrier to entry for Big Data analysis for SQL-savvy
users.

o Scalability: Leverages Hadoop's scalability for processing massive datasets.

o Cost-Effective: Runs on commodity hardware.

o Batch Processing: Ideal for batch processing and long-running analytical


queries.

 Disadvantages:

o High Latency: Not suitable for real-time queries or low-latency lookups.


Queries can take minutes to hours.

o No Row-Level Updates/Deletes (Historically): Primarily designed for append-


only data. ACID support for updates/deletes has been added but with
limitations.

o Not a Relational Database: Does not enforce ACID properties or support


complex transactions like a traditional RDBMS.

 Use Cases:

o Batch ETL (Extract, Transform, Load): Processing raw data for reporting.

o Data Warehousing on Hadoop: Building large-scale data warehouses for


analytical purposes.

o Log File Analysis: Analyzing vast amounts of log data.

o Data Exploration: Ad-hoc querying of large datasets.

5.9. Sharding

 Concept: A database partitioning technique that divides a large database into smaller,
more manageable, independent units called shards. Each shard is a separate database
(or a cluster of servers) that contains a subset of the total data.
 Purpose: To enable horizontal scalability (scaling out) and improve performance for
very large datasets and high traffic.

 How it Works:

1. Sharding Key: A specific column or set of columns in your data is chosen as the
"sharding key." This key determines which shard a particular
row/document/record will reside on.

2. Partitioning Scheme: A rule or algorithm defines how the sharding key maps
to a specific shard. Common schemes:

 Range-based Sharding: Data within a specific range of the sharding key


goes to one shard (e.g., users A-M go to Shard 1, N-Z go to Shard 2).

 Hash-based Sharding: The sharding key is hashed, and the hash value
determines the shard (e.g., hash(user_id) % num_shards). This tends to
distribute data more evenly.

 Directory-based Sharding: A lookup table (directory) stores which


sharding key range belongs to which shard.

 Advantages:

o Horizontal Scalability: Add more shards (servers) to increase capacity and


handle more data/traffic.

o Improved Performance: Queries only hit a subset of the data on a single shard,
reducing I/O and CPU load.

o Increased Availability: If one shard fails, only a portion of the data is affected,
not the entire database.

 Disadvantages:

o Increased Complexity: Sharding adds significant complexity to database


design, application development, and operational management.

o No Cross-Shard Joins: Queries that require joining data across multiple shards
are difficult or impossible without complex application logic.

o Resharding Difficulty: Rebalancing data or adding new shards (if the initial
partitioning isn't optimal) can be a very challenging and time-consuming
operation.

o Hot Shards: If the sharding key leads to uneven data distribution, some shards
can become "hot" (overloaded) while others are underutilized.
 Applicability: Used in many NoSQL databases (Document stores, Column-family
stores) and sometimes with very large relational databases. It's a fundamental concept
for Big Data scale.

5.10. HBase

 What is Apache HBase?

o An open-source, non-relational (NoSQL) distributed column-oriented


database designed to run on top of HDFS (Hadoop Distributed File System).

o It's part of the Hadoop ecosystem and provides real-time random read/write
access to Big Data.

 Based on Google's Bigtable: HBase is often described as the open-source equivalent


of Google's Bigtable.

 Data Model:

o Tables: Similar to relational tables, but columns are organized into column
families.

o Row Key: Each row has a unique, primary row key, which is always indexed.
Rows are stored lexicographically by row key.

o Column Families: Logical groupings of columns. All columns within a family are
stored together on disk.

o Columns: Identified by a column family prefix and a qualifier (e.g.,


personal_data:name, personal_data:age).

o Cells: Each cell stores a value and has a timestamp (enabling versioning).

 Key Characteristics:

o Schema-less (Flexible Schema): You define column families, but new columns
can be added to any row at any time without pre-defining them.

o Sparse: Only stores data for columns that have values, making it efficient for
sparse datasets.

o Versioning: Automatically stores multiple versions of a cell's value, indexed by


timestamp.

o Random Access: Optimized for quick lookups by row key, and range scans.

o Scalable: Horizontal scalability by adding more RegionServers.

o Fault Tolerant: Leverages HDFS for data replication and fault tolerance.
 Advantages:

o Real-time Read/Write: Provides low-latency access to individual records


(unlike Hive).

o Handles Massive Scale: Can store and manage petabytes of data.

o High Write Throughput: Excellent for ingesting large volumes of data quickly.

o Good for Sparse Data: Efficiently stores tables with many empty cells.

 Disadvantages:

o No SQL: Requires using its own API (Java, Python clients) or wrappers like
Apache Phoenix for SQL-like access.

o No Joins/Complex Transactions: Designed for atomic row-level operations, not


complex multi-row transactions or joins.

o Complex Setup/Management: Can be challenging to set up, tune, and


manage.

o Not a Replacement for RDBMS: Not suitable for highly normalized,


transactional data that requires strong ACID properties.

 Use Cases:

o Time-Series Data: Storing sensor data, financial tick data.

o Large User Profiles: Storing user data for web applications.

o Real-time Analytics: Serving as a data store for dashboards where sub-second


query latency is required.

o Ad Serving Platforms: Storing user behavior data for real-time ad targeting.

5.11. Analyzing Big Data with Twitter

 Twitter as a Big Data Source:

o Twitter generates an immense volume of real-time, unstructured, and semi-


structured data (tweets, retweets, likes, user profiles, hashtags, mentions).

o It's a rich source of public opinion, trends, events, and sentiment.

 Why Analyze Twitter Data for Big Data Applications?

o Real-time Insights: Understand public reaction to events, products, or


campaigns as they unfold.
o Sentiment Analysis: Gauge public sentiment towards brands, political
candidates, or current affairs.

o Trend Detection: Identify emerging trends, popular topics, and viral content.

o Customer Feedback: Monitor customer service issues, product reviews, and


brand mentions.

o Event Detection: Identify breaking news, disasters, or significant events based


on tweet patterns.

o Influence Analysis: Find key influencers and opinion leaders.

o Market Research: Understand consumer preferences and market dynamics.

 Challenges of Analyzing Twitter Big Data:

o Volume & Velocity: Millions of tweets per day require scalable ingestion and
processing.

o Noise & Irrelevance: Much of the data can be irrelevant, spam, or contain
sarcasm/ambiguity.

o Data Cleaning & Preprocessing: Requires significant effort to remove noise,


normalize text, handle emojis, etc.

o Lack of Structure: Tweets are unstructured text, requiring NLP techniques.

o Data Access: APIs often have rate limits, and obtaining historical data can be
challenging.

 Typical Big Data Architecture for Twitter Analysis:

1. Data Ingestion:

 Twitter API: Use Twitter's Streaming API (for real-time samples) or REST
API (for historical data, user profiles).

 Apache Kafka/Kinesis: Ingest the raw tweet stream into a message


queue for buffering and decoupling.

2. Stream Processing:

 Apache Spark Streaming/Flink: Process tweets in real-time.

 Operations:

 Filtering: Select tweets based on keywords, hashtags, user


mentions.
 Parsing: Extract relevant fields (tweet text, user ID, timestamp,
location).

 Sentiment Analysis: Apply NLP models to classify sentiment.

 Topic Modeling: Identify main topics discussed.

 Trend Detection: Count frequencies of hashtags or keywords


over time windows.

3. Storage:

 HDFS/Object Storage (S3): For long-term storage of raw or processed


tweets (Data Lake).

 NoSQL Databases (e.g., MongoDB, Elasticsearch): For rapid querying


of processed tweets or real-time dashboards.

4. Analytics & Visualization:

 Dashboards: Real-time dashboards (e.g., Grafana, custom web apps)


showing sentiment trends, popular topics.

 Reporting: Generate reports for deeper analysis.

 Alerting: Trigger alerts for unusual activity or sentiment spikes.

5.11.1. Big Data for E-Commerce

 Problem: E-commerce companies generate and consume vast amounts of data from
customer interactions, product catalogs, sales, logistics, marketing, and external
sources.

 Why Big Data is Crucial for E-commerce:

o Understanding Customers: Gain a 360-degree view of customer behavior


(Browse, purchase history, demographics, preferences).

o Personalization: Tailor experiences, recommendations, and marketing


messages to individual users.

o Operational Efficiency: Optimize inventory, supply chain, and logistics.

o Fraud Prevention: Detect and prevent fraudulent transactions.

o Competitive Advantage: React quickly to market changes, analyze competitor


strategies.

 Key Data Sources:


o Clickstream Data: Website visits, page views, clicks, search queries, time spent.

o Transaction Data: Purchase history, order details, payment information.

o Customer Data: Demographics, contact information, loyalty program data.

o Product Data: Product descriptions, images, reviews, ratings, pricing.

o Marketing Data: Campaign performance, ad impressions, email interactions.

o Social Media Data: Mentions, reviews, sentiment.

o Supply Chain Data: Inventory levels, shipping information, supplier


performance.

 Big Data Applications in E-commerce:

o Recommendation Systems:

 Collaborative Filtering: "Customers who bought this also bought..."

 Content-Based: "Because you viewed this category..."

 Personalized Homepage/Product Listings: Dynamically arranged


content.

o Personalized Marketing & Promotions:

 Targeted emails, ads, and discounts based on Browse history and


purchase patterns.

 Predicting customer churn and offering retention incentives.

o Dynamic Pricing:

 Adjusting product prices in real-time based on demand, competitor


prices, inventory levels, and customer segments.

o Inventory Optimization & Demand Forecasting:

 Predicting future demand based on historical sales, seasonality, trends,


and external factors (weather, events).

 Optimizing warehouse stock levels to reduce waste and avoid


stockouts.

o Fraud Detection:

 Analyzing transaction patterns, user behavior, and IP addresses to


identify suspicious activities (e.g., multiple orders from the same card
with different shipping addresses).
o Customer Service Enhancement:

 Analyzing customer interaction data to identify common issues,


improve chatbot responses, and route queries efficiently.

o Website Optimization (A/B Testing):

 Analyzing user behavior data to test different website layouts, product


placements, and conversion funnels.

 Technologies: Data Lakes (HDFS, S3), Spark, Hive, Kafka, NoSQL databases (Cassandra,
MongoDB), Machine Learning platforms.

5.11.2. Big Data for Blogs

 Problem: Blog platforms (e.g., WordPress, Medium, personal blogs) generate various
types of data: content, user interactions, and external engagement. Managing and
analyzing this data can be complex for large-scale blogs or blogging platforms.

 Key Data Sources:

o Content Data: Blog post text, categories, tags, author information, publication
dates.

o User Interaction Data (Clickstream): Page views, time on page, bounce rate,
clicks on internal/external links, search queries.

o Comments & Engagements: User comments, likes, shares on social media.

o Traffic Sources: Referrers (Google, social media, other blogs), demographics of


visitors.

o Advertising Data: Ad impressions, clicks, conversions (if monetized).

 Why Big Data is Useful for Blogs:

o Content Optimization: Understand what content resonates with the audience.

o Audience Engagement: Identify popular topics, optimal posting times, and


reader preferences.

o Monetization Optimization: Improve ad placement, identify high-value


content.

o Personalization: Recommend relevant articles to users.

o Author Performance: Track performance of individual authors or categories.

 Big Data Applications in Blogging:


o Content Performance Analysis:

 Analyzing page views, time on page, and bounce rates for individual
articles to identify top-performing content.

 Understanding which topics or categories drive the most engagement.

o Audience Segmentation:

 Grouping readers based on their Browse behavior, interests (from


articles read), and demographics.

 Tailoring newsletters or content recommendations to specific


segments.

o Personalized Content Recommendations:

 Suggesting "related articles" to users based on their reading history


using collaborative filtering or content-based methods.

o Sentiment Analysis of Comments:

 Understanding the overall sentiment in comments to gauge reader


reaction and identify potential issues.

o A/B Testing of Layouts/Features:

 Analyzing user behavior data to test different blog layouts, call-to-


action placements, or subscription forms.

o Traffic Source Optimization:

 Understanding which channels (social media, search engines, direct)


drive the most valuable traffic.

o Advertising/Monetization Insights:

 Analyzing ad impression data, click-through rates, and conversion data


to optimize ad placements and strategies.

 Technologies: Web analytics tools often provide initial insights. For deeper, large-scale
analysis, Big Data technologies like Spark, Hadoop, Kafka, and appropriate NoSQL
databases would be used to store and process the raw log and interaction data.

5.12. Review of Basic Data Analytic Methods using R

R is a powerful statistical programming language widely used for data analysis, visualization,
and machine learning. Here's a review of basic methods.
Data Import and Export:

o Concept: Reading data into R from various formats (CSV, Excel, databases) and
writing results back out.

o R Functions: read.csv(), read.table(), readxl::read_excel(), DBI package for


databases, write.csv().

o Example: my_data <- read.csv("data.csv")

Data Cleaning and Preprocessing:

o Concept: Handling missing values, outliers, inconsistent formats, and


transforming data into a suitable format for analysis.

o R Functions/Packages:

 is.na() for checking missing values.

 na.omit(), na.exclude() for removing missing values.

 dplyr::mutate() for creating/modifying columns.

 dplyr::filter() for subsetting rows.

 tidyr::pivot_longer(), pivot_wider() for reshaping data.

 scale() for standardization/normalization.

 Factor conversion: as.factor().

o Example: my_data$Age[is.na(my_data$Age)] <- mean(my_data$Age, na.rm =


TRUE) (Impute missing age with mean)

Descriptive Statistics:

o Concept: Summarizing and describing the main features of a dataset.

o R Functions:

 summary(): Provides min, max, quartiles, mean, median for numeric;


counts for factors.

 mean(), median(), sd() (standard deviation), var() (variance).

 table(): Frequency counts for categorical variables.

 hist(): Histograms for distribution.

 boxplot(): Box plots for distribution and outliers.

o Example: summary(my_data$Sales), hist(my_data$Age)


Data Visualization (Exploratory Data Analysis - EDA):

o Concept: Creating graphical representations to understand data patterns,


relationships, and distributions.

o R Functions/Packages:

 plot(): Generic plotting.

 ggplot2: A powerful and flexible package for creating high-quality,


customizable plots.

 line plots, scatter plots, bar charts, box plots, histograms.

o Example: library(ggplot2); ggplot(my_data, aes(x=Age, y=Sales)) +


geom_point()

Inferential Statistics (Basic):

o Concept: Making inferences about a population based on a sample of data.

o R Functions:

 t.test(): T-tests for comparing means of two groups.

 cor(): Correlation coefficient between two variables.

 lm(): Linear regression for modeling relationships.

 aov(): ANOVA for comparing means of more than two groups.

o Example: cor(my_data$Age, my_data$Sales), model <- lm(Sales ~ Age +


Income, data = my_data)

Basic Machine Learning (Introduction):

o Concept: Building models that can learn from data and make predictions or
decisions without being explicitly programmed.

o R Packages:

 caret: For general machine learning workflows (splitting data, training,


evaluation).

 rpart: For Decision Trees (as covered previously).

 e1071: For Naive Bayes (as covered previously), SVM, etc.

 stats::kmeans: For K-means clustering.

o Workflow Example (Classification):


1. Split Data: library(caret); set.seed(123); trainIndex <-
createDataPartition(my_data$Target, p = 0.7, list = FALSE)

2. Train Model: model <- train(Target ~ ., data = my_data[trainIndex,], method = "rpart")

3. Predict: predictions <- predict(model, newdata = my_data[-trainIndex,])

4. Evaluate: confusionMatrix(predictions, my_data[-trainIndex,]$Target)

This comprehensive set of notes should cover all the topics you requested clearly and
thoroughly.

You might also like