0% found this document useful (0 votes)

129 views42 pages

Data Analytics One Shot Notes

Uploaded by

harshitkamra2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

129 views42 pages

Data Analytics One Shot Notes

Uploaded by

harshitkamra2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

LePic

🔥Data Analytics🔥
Complete Syllabus in One Shot
💯Unit-01+02+03+04+05💯
Ques-Describe the characteristics of data that are relevant in the field of
data analytics. How do these characteristics impact the analysis process?
(AKTU)

Characteristics of Data in Data Analytics and Their Impact

1. Volume
Refers to the amount of data.
Data can be huge (terabytes or more).
Impact: Large data needs more storage, better tools, and powerful computers for
analysis.
2. Variety
Refers to different types of data (text, images, videos, etc.).
Structured, semi-structured, and unstructured data.
Impact: Different tools are needed to process each type. It increases complexity.
3. Velocity
Refers to the speed at which data is generated and processed.
Example: Social media updates or stock market data.
Impact: Fast data requires real-time or quick analysis tools.
4. Veracity
Refers to the accuracy and trustworthiness of data.
Data can be incomplete, wrong, or biased.
Impact: Poor quality data gives wrong results. Data cleaning is necessary.
5. Value
Refers to the usefulness of data.
Not all data is helpful. Only meaningful data gives insights.
Impact: Analysts focus only on valuable data to save time and get better results.
6. Variability
Refers to changes in data over time.
LePic Meaning and format of data can vary.
Impact: Makes analysis difficult. Consistency must be maintained.
7. Data Quality
Good data should be complete, correct, and consistent.
Impact: High-quality data leads to better decisions and outcomes.

Ques-Explain the concept of generalization in neural networks. How does it

relate to the trade-off between bias and variance, and what strategies can
be employed to enhance generalization performance? (AKTU)

Concept of Generalization in Neural Networks

→ Generalization means how well a neural network works on new or unseen data (not
the data it trained on).
→ A model that generalizes well gives correct answers not just on training data but also
on real-world or test data.
→ This is important because we usually use the model on new data.
Relation to Bias-Variance Trade-off

→ Bias is the error when a model is too simple and cannot learn the data properly (called
underfitting).
→ Variance is the error when a model is too complex and learns the training data too
perfectly, including noise (called overfitting).
→ A model with high bias misses patterns.
→ A model with high variance cannot handle new data well.
→ Good generalization happens when there is a balance between bias and variance.
Strategies to Improve Generalization

→ Use more training data to help the model learn better.

→ Early stopping: stop training before the model starts overfitting.
→ Regularization (like L1 or L2): adds a penalty to the model to keep it simple.
→ Dropout: randomly turn off some neurons while training to avoid overfitting.
→ Cross-validation: test the model on different parts of the data to check if it's learning
correctly.
→ Data augmentation: slightly change the data (like flipping images) to teach the
model better.
→ Use a simpler model if the problem is not very complex.

Ques-Provide a detailed explanation of how fuzzy logic is used to extract

models from data. Discuss the advantages of fuzzy modeling in capturing
LePic
uncertainty and handling imprecise information in comparison to
traditional crisp models. (AKTU)

How Fuzzy Logic is Used to Extract Models from Data

→ Fuzzy logic is a way of thinking that allows partial truth values (like 0.2, 0.5, 0.9), instead
of only true or false (0 or 1) like in traditional logic.

→ In fuzzy modeling, we use "if-then" rules to describe systems, like:

If temperature is high, then fan speed is fast.

→ These rules use fuzzy sets (e.g., “high temperature” is not a fixed number but a range
with a degree of membership).

→ When we have real-world data, fuzzy logic helps to create these rules automatically
by looking at the patterns in the data.

→ A fuzzy model is built by:

1. Fuzzifying the input – converting real data into fuzzy values (like "low", "medium", "high").
2. Creating rules based on patterns in the data.
3. Applying fuzzy inference to process the rules.
4. Defuzzifying the output – converting fuzzy result back into a crisp number.

→ This process helps to model systems where it’s hard to define exact rules due to
complexity or unclear boundaries.

Advantages of Fuzzy Modeling Over Traditional Crisp Models

→ Handles Uncertainty:
Fuzzy models can manage uncertainty in data, such as noisy, incomplete, or vague
information.

→ Works with Imprecise Inputs:

Instead of needing exact numbers, fuzzy logic allows inputs like “almost high” or
“somewhat low.”

→ Human-like Reasoning:
Fuzzy logic models behave more like humans who use terms like "warm" or "fast" instead
of exact values.

→ Simple Rule-Based Approach:

Fuzzy systems use understandable “if-then” rules which are easy to interpret and explain.

→ Better for Real-World Problems:

Real-life problems often have gray areas, not black-and-white situations. Fuzzy models
handle these better than crisp ones.

→ No Need for Precise Mathematical Models:

Fuzzy modeling can work even when we don’t fully know the equations of the system.
LePic
Comparison with Traditional Crisp Models

Feature Fuzzy Model Traditional Crisp Model

Input Type Imprecise, vague Precise, exact

Logic Used Partial (0 to 1) Binary (0 or 1)

Flexibility High Low

Real-life Use More natural Less realistic in uncertain

cases

Handling Noise Strong Weak

Ques-In the context of stream data, explain different approaches for

counting distinct elements. How do these methods address challenges
associated with continuously changing data? (AKTU)

What is Stream Data?

→ Stream data means data that is continuously coming in, like messages on WhatsApp,
live sensors, or social media feeds.
→ We cannot store all the data because it’s too fast and too big.
→ So, we need smart ways to count distinct elements (like how many different users sent
messages) without storing everything.

Approaches for Counting Distinct Elements in Stream Data

→ 1. Exact Counting (using Hash Sets or Hash Tables)

→ Store each unique element in a hash set.
→ At the end, count the number of items in the set.
Problem:
→ Needs a lot of memory when the data is large.
→ Not good for fast and continuous data streams.

→ 2. Sampling-Based Methods
→ Take a small sample from the stream instead of using the whole data.
→ Estimate the number of distinct elements based on the sample.
LePic
Advantage:
→ Saves memory and time.
Limitation:
→ Only gives an approximate answer, not 100% correct.
→ 3. Hashing with Bitmaps (Flajolet-Martin Algorithm)
→ Use hash functions to map elements to a binary pattern (like 0001, 0100).
→ Count the position of the first 1 in the hashed result.
→ Use that position to estimate the number of distinct elements.
Advantage:
→ Uses very little memory.
→ Works well on large data streams.
Limitation:
→ Approximate result, not exact.
→ 4. HyperLogLog Algorithm
→ An improved version of Flajolet-Martin.
→ Uses many hash functions and registers to get a better estimate.
→ Combines the results for high accuracy.
Advantage:
→Very accurate with small memory usage.
→Used in real systems (like by Google and Facebook).
Limitation:
→More complex to implement.

How These Methods Handle Challenges of Stream Data

→ Limited Memory:
Approximate methods (like HyperLogLog) use very little space.

→ Speed:
These algorithms are fast and don’t need to store all elements.

→ Changing Data:
They update estimates as new data comes in, so they handle live changes easily.

→ Scalability:
They work well even if the data grows huge, like millions of users.

Ques-Describe the concept of counting uniqueness in a window in the

context of stream processing. How does this relate to measuring the
LePic
frequency and uniqueness of elements within a specified time frame?
(AKTU)

Concept of Counting Uniqueness in a Window (Stream Processing)

→ Stream processing means analyzing data that comes in continuously, like messages,
clicks, or sensor readings.

→ Window means a limited time frame or range (for example, last 1 minute, or last 100
elements).
→ In this window, we only look at data that falls within that specific time or size.
→ Counting uniqueness means finding out how many different (unique) elements
appeared in that window.
For example:
If the window contains: [A, B, A, C] → unique elements are A, B, and C → count = 3.
Types of Windows

→ Tumbling Window:
Fixed size, non-overlapping. Example: every 1 minute.

→ Sliding Window:
Fixed size, but moves forward in steps (overlapping). Example: every 30 seconds, check
last 1 minute.

→ Count-based Window:
Instead of time, based on number of elements. Example: check every 100 messages.

Relation to Frequency and Uniqueness

→ Frequency = How many times an element appears in a window.

Example: In [A, A, B, C], frequency of A = 2.

→ Uniqueness = Count of different elements in the window.

In the same example: unique elements = A, B, C → count = 3.

→ By measuring both, we understand:

- Which elements are common (high frequency)
- How diverse the data is (high uniqueness)

Why It’s Useful in Real-Time Analysis

→ Helps detect trends or anomalies.

Example: sudden drop in uniqueness may mean spam or attack.

→ Helps in user behavior analysis, like how many different users visited a site in the last 5
minutes.
LePic
→ Useful in network monitoring, fraud detection, and recommendation systems.

Ques-Explain the process model and computation model for Big data
platform. (AKTU)

→ Process Model in Big Data Platform

The process model shows how big data is handled step by step — from collection to final
result.

→ 1. Data Collection
Data comes from different sources like websites, sensors, social media, etc.
Can be structured (tables), semi-structured (XML/JSON), or unstructured (text, images).

→ 2. Data Storage
Data is stored in big storage systems like HDFS (Hadoop Distributed File System) or
cloud storage.
Data is stored across many machines for fault tolerance.

→ 3. Data Processing
The collected data is processed using tools like MapReduce, Apache Spark, etc.
Processing can be batch (large data at once) or real-time (as it comes in).

→ 4. Data Analysis
Data is analyzed using statistical methods, machine learning, or data mining.
Tools like Hive, Pig, Spark MLlib are used.

→ 5. Visualization & Reporting

Results are shown using dashboards, graphs, or reports for decision-making.
Tools like Tableau, Power BI, or Kibana are used.

→ Computation Model in Big Data Platform

The computation model explains how data is processed internally across multiple
systems.

→ 1. Batch Processing Model

Processes large blocks of data at once.
Example: Hadoop MapReduce.
→ Good for processing big historical data.
→ Slower, not for real-time use.

→ 2. Stream (Real-Time) Processing Model

Processes data as it arrives (event-by-event).
Examples: Apache Storm, Apache Flink, Apache Spark Streaming.
→ Good for live data like stock prices, logs, etc.
LePic
→ 3. DAG-Based (Directed Acyclic Graph) Model
Used by tools like Apache Spark.
Each task is a node in a graph.
→ Allows better optimization and fault recovery.

→ 4. In-Memory Computation
Keeps data in RAM instead of reading from disk.
Example: Apache Spark.
→ Much faster than disk-based systems like MapReduce.

→ 5. Parallel and Distributed Computing

Big data is split into parts and processed on many machines at the same time.
→ Helps to handle very large data quickly and efficiently.

Ques-Explain the use and advantages of decision trees. (AKTU)

Uses and Advantages of Decision Trees

Aspect Details

Use 1: Classification Used to categorize data (e.g., spam vs

not spam).

Use 2: Regression Predicts continuous values (e.g., house

prices).

Use 3: Feature Selection Identifies most important input features.

Use 4: Rule Generation Creates simple "if-then" rules that are

easy to interpret.

Use 5: Versatility Used in medicine, finance, marketing, etc.

LePic
Advantage Explanation

Easy to Understand Tree structure is like a flowchart, simple

to follow.

No Need for Normalization Works without scaling or transforming

input data.

Handles Categorical & Numeric Data Works well with both types of inputs.

Less Data Cleaning Required Can manage missing or imperfect data.

Fast and Efficient Quick training and predictions.

Shows Feature Importance Clearly shows which variables affect the

output most.

Good for Ensemble Learning Used in Random Forest and Boosting

methods to improve accuracy.

Ques-Explain the architecture of data stream model. (AKTU)

LePic

Architecture of Data Stream Model

This architecture is used to process continuous, fast, and large volumes of data in real
time.

1. Streams Entering

These are continuous data inputs coming into the system.

Examples of input streams:
Numeric stream: 1, 5, 2, 7, 4, 0, 3, 5
Character stream: q, w, e, r, t, y, u, i, o
Binary stream: 0, 1, 1, 0, 1, 0, 0, 0
These streams keep arriving over time and cannot be fully stored before processing.

2. Stream Processor

The core component that processes the incoming data.

Responsibilities:
Continuously accepts and processes multiple data streams.
Applies queries (both standing and ad-hoc).
LePic Generates output streams based on the processed data.
Works with two types of memory:
Limited Working Storage
Archival Storage

3. Standing Queries

These are predefined queries that are always active inside the system.
They automatically process the incoming data in real time.
Used for ongoing tasks such as counting, filtering, or aggregating data.
For example, a standing query might track how many times a specific number
appears.

4. Ad-hoc Queries

These are queries that are added by users manually when needed.
Not always running like standing queries.
Used for specific analysis tasks, often involving recent or historical data.
For example, a user might ask: "What was the average of the last 100 values?"

5. Output Streams

The results generated by the stream processor after processing the input.
These can be real-time summaries, alerts, filtered data, or analytics.
The output is continuously updated as new data arrives.

6. Limited Working Storage

Temporary, small storage space used for keeping recent data.

Because the stream is infinite, the system cannot store all of it.
Stores only the data that is needed for immediate processing.
Useful for responding quickly to queries about recent events.

7. Archival Storage

Permanent, large storage for saving historical data.

Stores data that is no longer in the working memory.
Helps in answering queries about past events or trends.
Used when the user runs ad-hoc queries that need access to older data.

How These Components Work Together

Input data streams enter the stream processor.

The processor uses standing queries to analyze the data in real time.
Users can submit ad-hoc queries to get specific information.
Recent data is stored in limited working storage for fast access.
Old data is moved to archival storage for long-term use.
The results are sent out as output streams.

Summary Table
LePic
Component Function

Streams Entering Real-time continuous input data

Stream Processor Main unit that processes and applies

queries to the streams

Standing Queries Always-on queries for automatic real-

time results

Ad-hoc Queries User-created queries for specific analysis

Output Streams Final results of processed data

Limited Working Storage Temporary storage for recent data

Archival Storage Long-term storage for historical data

Ques-Illustrate the K-means algorithm in detail with its advantages

(AKTU)

K-means Clustering

Introduction

K-means is a popular unsupervised machine learning algorithm used for clustering

data into K groups based on their similarities. It is widely used in data mining, pattern
recognition, and image analysis.

The goal of K-means is to partition the dataset into K clusters such that:

Each data point belongs to the cluster with the nearest mean.
The intra-cluster variance (within the same group) is minimized.
The inter-cluster variance (between different groups) is maximized.

Features of K-means Clustering

→ Unsupervised learning
It does not require labeled data; it groups data based solely on patterns.

→ Centroid-based algorithm
Each cluster is defined by the centroid (mean) of the data points in the cluster.
LePic
→ Iterative refinement
K-means repeatedly updates cluster assignments and centroids until convergence.

→ Distance measure used

Usually uses Euclidean distance to determine similarity between data points and
centroids.

→ Scalability
K-means is highly scalable and works well with large datasets.

→ Speed and efficiency

The algorithm is computationally efficient due to its simple steps and linear complexity.

→ Applicability
Used in a wide variety of domains such as marketing, biology, image segmentation, and
more.

Important Points to Remember

→ K must be defined beforehand

The number of clusters (K) needs to be chosen in advance. This can be done using
techniques like the Elbow Method or Silhouette Score.

→ Sensitive to initial centroids

Different initial centroids can lead to different final clusters (local minima issue).

→ Clusters formed are convex

K-means assumes clusters are spherical and equally sized, which may not always be
true.

→ Not suitable for non-linear data

K-means cannot handle complex cluster shapes or outliers effectively.

→ Assumes numerical data

The algorithm assumes features are numerical and comparable using Euclidean
distance.

Working of the K-means Algorithm

The K-means algorithm follows these steps:

→ Step 1: Choose the number of clusters (K)

Decide how many groups the data should be divided into.

→ Step 2: Initialize centroids

Randomly select K data points from the dataset as the initial cluster centers.

→ Step 3: Assign each point to the nearest centroid

Measure the distance from each data point to all centroids, and assign it to the nearest
LePic
one.

→ Step 4: Update the centroids

Recalculate the centroids by taking the mean of all data points assigned to each cluster.

→ Step 5: Repeat Steps 3 and 4

Continue reassigning points and updating centroids until:

The centroids no longer move (i.e., converge), or

A maximum number of iterations is reached.

→ Step 6: Return the final clusters

Output the final cluster assignments and the centroid positions.

Advantages of K-means Algorithm

→ Simple and easy to understand

Its concept is straightforward and intuitive for beginners.

→ Fast and efficient

It performs well on large datasets due to its linear time complexity.

→ Guaranteed convergence
Although it may converge to a local minimum, it will always converge in finite steps.

→ Works well when clusters are well-separated

It performs best when the data naturally forms distinct clusters.

→ Useful in many real-world applications

It is widely used for market segmentation, social network analysis, image compression,
etc.

→ Can be improved using K-means++

A better initialization strategy that reduces the chance of poor clustering.

Ques-Differentiate between NoSQL and RDBMS databases (AKTU)

Difference Between NoSQL and RDBMS

LePic
Criteria RDBMS (Relational DBMS) NoSQL (Non-relational
DB)

Data Model Tables with rows and Key-Value, Document,

columns Column, Graph-based

Schema Fixed schema Dynamic or flexible

(predefined structure) schema

Data Storage Format Structured data stored in Unstructured, semi-

tabular form structured, or structured
data

Scalability Vertical scalability Horizontal scalability

(scale-up: add more (scale-out: add more
power to a server) servers)

ACID Compliance Fully ACID-compliant Often supports BASE

(Atomicity, Consistency, (Basically Available, Soft
Isolation, Durability) state, Eventually
consistent)

Query Language SQL (Structured Query Various: MongoDB uses

Language) MQL, Cassandra uses
CQL, etc.

Joins Supports complex joins Typically does not

support joins
(denormalized data)

Best For Applications requiring Applications with large

complex transactions, volumes of data, high
data integrity scalability needs

Examples MySQL, PostgreSQL, MongoDB, Cassandra,

Oracle, SQL Server Redis, CouchDB, Neo4j

Data Integrity High data integrity with Less emphasis on

strong relationships relationships; focuses on
performance and
flexibility

Performance Efficient for structured Efficient for big data and

LePic data with complex real-time applications
queries

Ques-Explain multivariate analysis and Bayesian network. (AKTU)

Multivariate Analysis

Multivariate Analysis refers to a set of statistical techniques used to analyze data that
involves multiple variables simultaneously. It aims to understand the relationships
between variables and how they interact with each other.

Key Features:

→ Involves more than one dependent or independent variable

Multivariate analysis is used when the data has multiple dimensions or variables.

→ Reveals patterns and relationships

Helps in identifying correlations, trends, clusters, and dependencies in data.

→ Reduces dimensionality
Techniques like Principal Component Analysis (PCA) help in reducing the number of
variables while retaining important information.

→ Improves decision-making
By analyzing multiple factors together, it supports better, data-driven decisions.

Common Multivariate Analysis Techniques:

→ Multiple Regression Analysis

Examines the relationship between one dependent variable and several independent
variables.

→ Principal Component Analysis (PCA)

Reduces dimensionality by transforming variables into a set of uncorrelated
components.

→ Factor Analysis
Identifies underlying factors that explain the correlations among variables.

→ Cluster Analysis
Groups similar observations into clusters based on their characteristics.

→ Discriminant Analysis
Classifies data into categories based on predictor variables.

Applications:
LePic
→ Market research
→ Medical diagnosis
→ Financial modeling
→ Image and speech recognition
→ Social science research
Bayesian Network

A Bayesian Network (also known as a Belief Network) is a probabilistic graphical model

that represents a set of variables and their conditional dependencies using a directed
acyclic graph (DAG).

Key Features

→ Nodes represent random variables

Each node stands for a variable (e.g., weather, disease, test result).

→ Edges represent dependencies

Directed edges show the probabilistic influence of one variable on another.

→ Uses Bayes' Theorem

It calculates posterior probabilities using prior knowledge and observed data.

→ Captures uncertainty
Bayesian networks handle uncertain or incomplete data effectively.

How It Works:

→ Each variable (node) has a Conditional Probability Table (CPT)

It defines the probability of that variable given its parent nodes.

→ Inference is made by updating beliefs

When new evidence is observed, probabilities are updated using Bayes' theorem.

Example:

A simple Bayesian network to diagnose flu:

Nodes: Fever, Cough, Flu

Edges: Flu →Fever, Flu →Cough
This shows that the presence of flu influences the probability of fever and cough.

Applications:

→ Medical diagnosis
→ Spam filtering
→ Fault detection
→ Risk analysis
→ Natural language processing
LePic
Comparison with Traditional Models:

→ Unlike classical models, Bayesian networks can model causal relationships and
update beliefs based on new evidence.

Ques-Why PCY algorithm is preferred over Apriori algorithm? (AKTU)

1. Background:

→ Apriori Algorithm is a classical algorithm used for frequent itemset mining and
association rule learning in large datasets.
→ It generates candidate itemsets and prunes those that do not meet the minimum
support threshold using the Apriori property (all subsets of a frequent itemset must also
be frequent).

→ PCY Algorithm is an improved version of Apriori that focuses on reducing the number
of candidate itemsets and memory usage during the generation of frequent pairs.

2. Limitations of Apriori Algorithm:

→ Generates too many candidate pairs in the second pass, leading to high memory
usage.
→ Performs multiple passes over the database.
→ Time and space complexity increases rapidly with the number of items.
3. How PCY Improves Apriori:

The PCY algorithm introduces hashing and bitmap filtering in the second pass to reduce
the number of candidate pairs stored in memory.

4. Key Reasons PCY is Preferred Over Apriori:

→ Efficient Memory Usage

PCY uses a hash table to count occurrences of item pairs during the first pass.
→ If a bucket in the hash table exceeds the support threshold, it is marked in a bitmap (1
= frequent, 0 = not frequent).
→ This way, infrequent pairs can be filtered early, avoiding unnecessary storage and
computation.

→ Reduces Candidate Generation

PCY avoids generating candidate pairs that fall into infrequent hash buckets, unlike
Apriori which generates all possible pairs.

→ Faster Performance
Fewer candidate itemsets lead to less time spent scanning the database and faster
support counting.
LePic
→ Scalable for Large Datasets
Because it consumes less memory and processes fewer candidate pairs, PCY is more
scalable for datasets with many items and transactions.

Example Difference:

→ Suppose we have 1,000 frequent items.

Apriori would generate ~500,000 candidate pairs (using combinations).
PCY would use a hash function to reduce the number of these pairs significantly
before counting their actual support.

Ques-Explain Datar-Gionis-Indyk-Motwani (DGIM) algorithm for counting

oneness in a window. (AKTU)

DGIM Algorithm (Datar–Gionis–Indyk–Motwani)

1. Problem Statement

Suppose you have a binary data stream (sequence of 0s and 1s) arriving continuously,
and you want to efficiently count the number of 1s in the last k bits (a sliding window of
size k).

Challenge: Storing the entire window uses too much space.

Goal: Count 1s approximately using logarithmic space, with guaranteed error bounds.

2. DGIM Algorithm – Key Idea

→ Instead of storing every bit, the DGIM algorithm stores summarized information using
buckets.
→ Each bucket contains a timestamp and a count of 1s it represents.
→ The algorithm approximates the number of 1s using a small number of buckets, with a
maximum error of 50% for the last bucket only.

3. Bucket Properties

→ Buckets store only 1s.

→ The number of 1s in a bucket is always a power of 2: 1, 2, 4, 8, …
→ For each power of 2, there can be at most two buckets.
4. Working of the Algorithm

Step 1: Initialize an empty list of buckets.

Step 2: Process the stream bit by bit (right to left)

→ If the incoming bit is 0

Do nothing.
LePic
→ If the incoming bit is 1
Create a new bucket of size 1 with a timestamp (position in stream).
If there are more than two buckets of the same size, merge the two oldest into one of
double the size.

Step 3: At any point, to estimate the count of 1s in the last k bits:

→ Traverse the buckets from newest to oldest.

→ Add the sizes of buckets whose timestamps fall within the window.
→ For the last bucket that only partially overlaps, add half its size (as an upper-bound
estimate).

5. Example

Let the data stream be:

... 1 0 1 1 0 1 1 1 0 1 (last 10 bits)

→ Buckets might look like:

One bucket of size 4 (covering 4 recent 1s)
Two buckets of size 2
One bucket of size 1

To estimate the number of 1s in the last k bits:

→ Add bucket sizes = 2 + 2 + 1 + (½ × 4) = 2 + 2 + 1 + 2 = 7 (approx.)
6. Advantages

→ Efficient memory usage: Uses O(log² N) space for a window of size N.

→ Fast update and query: Both insertion and counting are efficient.
→ Bounded error: The estimation error is within 50% of the last bucket.
7. Applications

→ Network traffic monitoring

→ Clickstream analysis
→ Real-time data mining
→ Counting events in time-based windows

Ques-Provide an in-depth comparison between the CLIQUE and ProCLUS

clustering algorithms. How do these methods handle challenges such as
noise, outliers, and varying cluster shapes? (AKTU)

Comparison Between CLIQUE and ProCLUS Clustering Algorithms

LePic
Aspect CLIQUE (Clustering In ProCLUS (Projected
QUEst) Clustering)

Type of Algorithm Grid-based, Density- Partition-based

based Subspace Subspace Clustering
Clustering (Extension of k-medoid)

Main Idea Divides space into grid Projects data onto

cells and finds dense subspaces and performs
regions in subspaces clustering using medoids

Cluster Shape Handling Can find arbitrary shaped Works well with globular
clusters in subspaces clusters (similar to k-
medoid)

Dimensionality Handling Identifies clusters in Assigns each cluster to a

different subspaces different subspace
independently

Grid-Based Approach Yes – splits each No – does not use grids,

dimension into intervals uses medoid and
and creates grid units distance measures

Parameter Sensitivity Requires parameters like Requires number of

grid size and density clusters (k) and average
threshold number of dimensions
per cluster (l)

Noise & Outlier Handling Robust to noise – sparse Handles outliers

grid cells are discarded moderately well, but not
as robust as CLIQUE

Efficiency Highly scalable – uses a More efficient than full-

bottom-up approach dimensional clustering,
but less than CLIQUE

Interpretability Good – clusters exist in Moderate – depends on

clearly defined distance from medoids in
subspaces/grids projected subspaces

Complexity Lower – due to grid Higher – due to iterative

structure and pruning optimization and
subspace selection
LePic
Output Type Overlapping clusters Produces disjoint clusters
possible

Cluster Overlap Supports overlapping Does not support

clusters overlapping clusters

Examples of Use Bioinformatics, Market segmentation,

astronomy, sensor data customer data clustering

Handling of Challenges

1. Noise and Outliers

→ CLIQUE
→ Discards sparse grid cells, making it very robust to noise and outliers.
→Since it’s density-based, isolated points do not form clusters.
→ ProCLUS
→ Less robust than CLIQUE; some outliers may be included in clusters if they are close to
medoids.
→ Uses medoid selection to minimize this effect, but still not ideal for noisy data.
2. Varying Cluster Shapes

→ CLIQUE
→ Can detect arbitrarily shaped clusters, as it is grid and density-based.
→Does not assume any cluster shape.
→ ProCLUS
→ Best suited for spherical or convex clusters, similar to k-medoids.
→ Struggles with non-globular or irregular cluster shapes.
3. High Dimensionality / Subspaces

→ CLIQUE
→ Efficiently identifies dense regions in multiple subspaces without checking all
combinations.
→Bottom-up pruning avoids unnecessary computations.
→ ProCLUS
→ Projects each cluster into its own relevant subspace, reducing the curse of
dimensionality.
→ Learns relevant dimensions per cluster dynamically.
LePic
Ques-Compare various types of support vector and kernel methods of
data analysis.(AKTU)

1. Introduction

→ Support Vector Machines (SVM) are supervised learning models used for
classification, regression, and outlier detection.
→ Kernel methods enable SVMs to work efficiently in high-dimensional or non-linearly
separable spaces by transforming the data using kernel functions.

2. Types of Support Vector Approaches

Type of SVM Purpose Key Features

Linear SVM Binary classification when Finds a straight

data is linearly separable hyperplane that best
separates two classes

Non-Linear SVM Classification when data Uses kernel trick to map

is not linearly separable data into higher-
dimensional space

Soft Margin SVM Allows misclassifications Adds slack variables to

tolerate noisy or
overlapping data

Hard Margin SVM No misclassification Assumes data is perfectly

allowed separable; sensitive to
outliers

Support Vector Predicts continuous Uses a margin of

Regression (SVR) values tolerance (epsilon) for
regression instead of
classification

One-Class SVM Anomaly or outlier Trained on only one class

detection to detect deviations or
novelties

3. Kernel Methods in SVM

What is a Kernel?
LePic
A kernel function computes the similarity between two data points in a transformed
(high-dimensional) space without computing the transformation explicitly.

4. Common Kernel Types and Their Use Cases

Kernel Type Kernel Function Use Case / Data Characteristics

Nature

Linear Kernel K(x,y)=xTy Linearly Simple, fast;

separable data equivalent to
standard linear
SVM

Polynomial Kernel K(x,y)=(xTy+c)d Polynomial Captures non-

relationships linear patterns;
among features degree ddd
controls
complexity

Radial Basis K(x,y)= Most commonly Maps data to

Function (RBF) / used; unknown infinite-
Gaussian Kernel exp⁡(−y||x−y||2) data distribution dimensional
space; effective for
non-linear
classification

Sigmoid Kernel K(x,y)=tanh⁡(αxTy+c) Neural-net-like Related to

behavior perceptron
models; not
always positive
semi-definite

Custom Kernel User-defined Domain-specific Can be tailored for

similarity functions problems string, graph, or
image data

5. Comparison Based on Challenges and Strengths

LePic
Criteria Linear SVM Non-Linear SVR One-Class
SVM with SVM
RBF/Poly

Data Linearity Works for Works for Works for Works for
linear non-linear regression novelty
detection

Scalability High Lower than Moderate Lower for

linear large datasets

Handling Soft margin Soft margin + Tolerant via Sensitive to

Outliers needed kernel trick epsilon noise

Interpretabilit High (simple Moderate– Moderate Moderate

y hyperplane) Low
(transformed
space)

Kernel No (doesn’t Yes Yes Yes

Dependency use kernel)

6. Applications

Application Area SVM/Kernel Used

Text Classification Linear SVM (high-dimensional sparse

data)

Image Recognition RBF Kernel SVM

Financial Forecasting SVR (Support Vector Regression)

Anomaly Detection in Networks One-Class SVM

Bioinformatics (e.g., gene classification) Polynomial / RBF Kernel SVM

LePic
Ques-In the context of stream data, explain different approaches for
counting distinct elements. How do these methods address challenges
associated with continuously changing data? (AKTU)

Counting Distinct Elements in Stream Data

1. Introduction

→ In data stream environments, data arrives continuously at high speed and in large
volumes.
→ A common task is to count the number of unique (distinct) elements in the stream.
→ Traditional methods fail due to limitations like:
→ Limited memory availability
→ Inability to store or revisit past data
→ Requirement for real-time processing
2. Approaches for Counting Distinct Elements

A. Naive Approach (Exact Counting)

→ Maintains a hash table or set containing all unique elements encountered.

→ Every new element is checked and stored if not already present.
Limitations:
→ Requires memory proportional to the number of distinct elements (O(n)).
→ Not scalable for high-speed or unbounded data streams.
B. Approximate and Probabilistic Methods

→ These methods estimate the count of distinct elements using hashing and compact
data structures.

1. Flajolet-Martin Algorithm (FM)

→ Hashes each element and examines the position of the rightmost 1-bit in the hash.
→ Keeps track of the maximum number of trailing zeros among all hashes.
→ Uses this to estimate the number of distinct elements.
Advantages:
→Space-efficient
→Suitable for streaming environments

Disadvantages:
→ Can have high variance; typically used with multiple hash functions to improve
accuracy

2. HyperLogLog
LePic
→ Enhanced version of Flajolet-Martin that divides data into multiple buckets.
→ Each bucket tracks the maximum trailing zeros separately.
→ Averages over all buckets to get the final estimate.
Advantages:
→ Very low memory usage
→ High accuracy
→ Widely adopted in practice (used by Redis, Google BigQuery)
3. Linear Counting

→ Uses a bit array where each element is hashed to a specific bit position.
→ The number of unset bits (zeros) is used to estimate how many distinct elements have
been seen.

Advantages:
→ Simple and efficient for relatively small cardinalities
Limitations:
→ Accuracy decreases when many bits in the array are set (saturation)

C. Sketch-Based Methods

Count-Min Sketch with Distinct Counting Variants

→ Designed primarily for frequency estimation, but can be combined with methods like
MinHash to estimate distinct counts.
→ Uses multiple hash functions and a 2D array of counters.
Advantages:
→ Space- and time-efficient
→ Suitable for distributed streaming systems
D. Sliding Window Models

→ In many applications, it is necessary to count distinct elements only within the most
recent window of time or data.

DGIM (Datar–Gionis–Indyk–Motwani) Algorithm

→ Initially designed to count the number of 1s in the last k bits of a binary stream.
→ Uses exponentially increasing bucket sizes to summarize the stream compactly.
→ Can be adapted for distinct element estimation over sliding windows.
Advantages:
→ Supports window-based analysis
→ Uses logarithmic memory with respect to the window size
LePic
Ques-Explore the challenges and considerations when performing
clustering in non-Euclidean spaces. How do distance metrics and
similarity measures differ in non-Euclidean environments, and what
impact does this have on clustering outcomes? (AKTU)

Clustering in Non-Euclidean Spaces

1. Introduction

→ Clustering is a fundamental task in data mining and machine learning, aiming to group
similar data points.
→ Many traditional clustering algorithms (e.g., K-means) assume that the data lies in a
Euclidean space, where distances are calculated using the Euclidean distance.
→ However, in many real-world scenarios (e.g., graphs, text, biological sequences), the
data may exist in a non-Euclidean space, where Euclidean assumptions do not hold.

2. What is a Non-Euclidean Space?

→ A non-Euclidean space is one where the geometry does not follow Euclidean
principles, such as the Pythagorean theorem.
→ Common examples include:
→ Graphs and networks
→ Manifolds
→ Discrete structures (e.g., strings, trees, sequences)
→ High-dimensional and sparse data (e.g., text data with cosine similarity)
3. Challenges in Non-Euclidean Clustering

A. Choice of Distance/Similarity Measure

→ Euclidean distance may not be meaningful in non-Euclidean data.

→ Alternative measures must be chosen based on the data type and application.
Examples:
→ Cosine similarity for text or document data
→ Edit distance (Levenshtein) for strings
→ Graph distance for network nodes
→ Dynamic Time Warping (DTW) for time series
B. Violation of Metric Properties

→ Many similarity measures in non-Euclidean spaces do not satisfy metric properties

(e.g., triangle inequality, symmetry).
→ Algorithms like K-means, which rely on centroids and metric space properties, may fail
or give misleading results.

C. Inapplicability of Centroids
LePic
→ In spaces like graphs or sequences, the concept of a mean or centroid is not well-
defined.
→ Algorithms relying on centroid computation (e.g., K-means) are unsuitable without
adaptation.

D. Curse of Dimensionality

→ In high-dimensional non-Euclidean spaces (e.g., text, gene data), distance measures

tend to lose discriminatory power.
→ All points may appear to be at a similar distance, leading to poor clustering quality.
E. Visualization and Interpretation

→ Non-Euclidean spaces are harder to visualize.

→ Interpretation of clusters and distances may become non-intuitive.
4. Considerations in Choosing Distance/Similarity Measures

A. Data Type Awareness

→ Choose distance measures tailored to the data type:

→→ Use cosine similarity for sparse vectors
→→ Use Jaccard distance for set-based data
→→ Use graph-based measures for networked data
B. Metric vs. Non-Metric Spaces

→ Ensure that chosen distance measures satisfy at least some metric properties when
possible.
→ In non-metric scenarios, consider kernel methods or embedding techniques.
C. Computational Complexity

→ Non-Euclidean distance computations (e.g., edit distance, DTW) can be expensive.

→ Clustering algorithms may need to be adapted for performance.
5. Impact on Clustering Algorithms

A. K-means

→ Performs poorly in non-Euclidean spaces due to reliance on arithmetic mean and

Euclidean distance.
→ Use alternatives like K-medoids or DBSCAN which are less dependent on Euclidean
geometry.

B. K-medoids (PAM, CLARANS)

→ Selects actual data points (medoids) instead of calculating means.

→ Can be used with any distance metric, making it suitable for non-Euclidean spaces.
LePic
C. DBSCAN and Density-Based Methods

→ Do not assume any geometry; based on density estimation using arbitrary distance
measures.
→ Suitable for irregular-shaped clusters and non-Euclidean data.
D. Spectral Clustering

→ Uses graph representations and eigenvalue decomposition.

→ Converts similarity matrix into a lower-dimensional space for clustering.
→ Particularly effective for graph and manifold-structured data.

Channel Name → LePic

Ques-Discuss the case study of stock market predictions in detail. (AKTU)

Case Study: Stock Market Prediction

1. Introduction

→ The stock market is a dynamic financial environment where investors buy and sell
shares of companies.
→ Predicting stock prices or movements has long been a critical and complex task due to
market volatility and numerous influencing factors.
→ This case study explores the use of data science, machine learning, and deep
learning techniques to predict stock prices or trends.

2. Objectives of the Study

→ To predict future stock prices or trends using historical data.

→ To analyze the performance of different prediction models.
→ To compare statistical methods vs machine learning/deep learning techniques.
→ To identify key features affecting stock prices such as volume, moving averages, news
sentiment, etc.

3. Data Collection
LePic
→ Source: Historical stock data from sources like Yahoo Finance, Google Finance, Alpha
Vantage, Quandl, etc.
→ Attributes collected:
→ Open, High, Low, Close Prices (OHLC)
→ Volume
→ Moving Averages (SMA, EMA)
→ Technical Indicators (RSI, MACD)
→ News sentiment (optional for hybrid models)
4. Data Preprocessing

→ Handling missing values

→ Feature engineering: Calculating new attributes like moving averages, lag variables,
returns, volatility
→ Normalization or scaling of data
→ Splitting data into training and testing sets
→ Time-series formatting: Using windows or sequences of previous values as input for
prediction models

5. Models Used

A. Traditional Statistical Models

1. ARIMA (AutoRegressive Integrated Moving Average)

→ Used for univariate time-series forecasting.

→ Assumes stationarity in the time-series.
→ Parameters: p (autoregressive), d (differencing), q (moving average)
Limitations:
→ Cannot handle non-linear relationships
→ Requires manual tuning and assumption checking

B. Machine Learning Models

1. Linear Regression

→ Simple model assuming linear relationship between features and stock price.
→ Performs poorly with complex time-series data.
2. Random Forest Regression

→ An ensemble of decision trees to capture non-linear relationships.

→ Handles feature importance well but may not capture time-dependence effectively.
3. Support Vector Regression (SVR)

→ Effective for non-linear regression with kernel tricks.

→ Suitable for medium-sized datasets.
LePic
C. Deep Learning Models

1. LSTM (Long Short-Term Memory)

→ A special type of Recurrent Neural Network (RNN) that captures long-term

dependencies in time-series data.
→ Input: sequences of past data (e.g., 60 previous days)
→ Output: next day’s price or trend
Advantages:
→ Captures temporal patterns
→ Deals well with sequential data
2. GRU (Gated Recurrent Unit)

→ Similar to LSTM with fewer parameters

→ Faster training and comparable accuracy
6. Evaluation Metrics

→ MAE (Mean Absolute Error)

→ RMSE (Root Mean Squared Error)
→ MAPE (Mean Absolute Percentage Error)
→ Accuracy (for classification-based trend prediction)
→ R-squared (for regression performance)
7. Key Findings

→ ARIMA works well for short-term, linear patterns but fails on sudden shocks or non-
linear changes.
→ Random Forest and SVR offer better performance than linear models but need careful
feature selection.
→ LSTM models outperform others in capturing sequential patterns, especially with
sufficient historical data.
→ Hybrid models (e.g., LSTM + sentiment analysis) improve performance when
combined with external signals like news or macroeconomic indicators.

8. Challenges in Stock Market Prediction

→ High volatility and noise in financial data

→ External events (e.g., economic reports, geopolitical changes) not captured by
historical prices
→ Overfitting due to limited training data
→ Delayed reaction to news in market prices
→ Non-stationarity of time series
9. Real-World Applications
LePic
→ Automated trading systems (Algo-Trading)
→ Stock screening and portfolio management
→ Financial risk analysis
→ Investor decision support systems

Ques-What is Prediction error ? With the help of suitable example explain

prediction error in classification and regression.(AKTU)

Prediction Error

1. Definition

→ Prediction error is the difference between the actual (true) value and the predicted
value made by a machine learning model.
→ It helps assess how accurately a model performs on unseen data.
→ A lower prediction error indicates better model performance.
2. Types of Prediction Errors

→ Regression Prediction Error → Difference between actual and predicted continuous

values
→ Classification Prediction Error → Incorrect class labels assigned by the model
3. Prediction Error in Regression

Explanation:

→ In regression, the model predicts a continuous numeric value.

→ The prediction error is calculated as:
→→ Error = Actual Value − Predicted Value
→ Common metrics used:
→→ MAE (Mean Absolute Error)
→→ MSE (Mean Squared Error)
→→ RMSE (Root Mean Squared Error)
Example:

Suppose you're predicting the price of a house.

LePic
House Actual Price (in ₹ Predicted Price (in Error
lakhs) ₹ lakhs)

A 75 70 5

B 60 62 -2

C 90 85 5

→ These differences (errors) are then used to calculate overall performance.

4. Prediction Error in Classification

Explanation:

→ In classification, the model predicts discrete class labels (e.g., "Yes" or "No").
→ The prediction error is the count or proportion of misclassified instances.
→ Common metrics:
→→ Accuracy = (Correct Predictions / Total Predictions)
→→ Error Rate = (Incorrect Predictions / Total Predictions)
→→ Also evaluated using Confusion Matrix, Precision, Recall, and F1 Score
Example:

Suppose you're predicting whether a student will pass or fail.

Student Actual Outcome Predicted Error

Outcome

1 Pass Pass No Error

2 Fail Pass Misclassified

3 Pass Fail Misclassified

→ Out of 3 predictions, 2 are incorrect → Error Rate = 2 / 3 = 66.7%

Ques-Draw and discuss the architecture of Hive in detail. (AKTU)

LePic

1. User Interface

→ Hive supports multiple ways to interact:

→ CLI (Command Line Interface)
→ Web UI
→ JDBC/ODBC drivers (to connect with BI tools like Tableau, Power BI)
2. Driver

→ Acts as the controller of the lifecycle of a HiveQL query.

→ Responsibilities include:
→→ Receiving the query from the user interface
→→ Creating sessions and monitoring query execution
→→ Passing the query to the compiler
3. Compiler

→ Translates HiveQL queries into an execution plan.

→ Tasks performed:
LePic
→→ Parsing the query to check syntax and build an Abstract Syntax Tree (AST)
→→ Semantic analysis: checking metadata, validating tables/columns using Metastore
→→ Generating an optimized logical and physical query execution plan
4. Execution Engine

→ Executes the tasks based on the query plan.

→ Converts the logical plan into physical jobs—traditionally MapReduce jobs, but also
supports Apache Tez or Spark.
→ Submits jobs to the underlying Hadoop/YARN cluster.
→ Monitors job execution and handles task failures.
5. Metastore

→ Central repository of Hive metadata.

→ Stores information about:
→→ Databases, tables, columns, data types
→→ Partitions and buckets
→→ Location of data in HDFS or other file systems
→ Metastore can be embedded (derby) or run as a standalone service (MySQL,
PostgreSQL, etc.)
→ Enables Hive to be schema-on-read and supports schema evolution.
Additional Components

Driver Session Handler: Manages multiple queries and sessions.

Optimizer: Improves query execution plans by applying rule-based or cost-based
optimization techniques.

Ques-Differentiate between analysis and reporting in the context of data

analytics. How do these two aspects contribute to the overall
understanding of data? (AKTU)
LePic
Aspect Analysis Reporting

Definition Examining and Presenting data in a

interpreting data to structured, summarized
extract insights format

Purpose To understand patterns, To show what has

causes, and make happened in a clear
predictions format

Nature Exploratory, diagnostic, Descriptive and routine

predictive, prescriptive

Questions Answered "Why did it happen?", "What happened?"

"What might happen?"

Approach In-depth analysis using Summarizing and

models, statistics, or ML visualizing data

Output Format Insights, trends, Dashboards, charts,

forecasts, tables, static reports
recommendations

Frequency On demand or as needed Regular (daily, weekly,

for decision-making monthly)

Tools Used Python, R, SQL, Jupyter, Excel, Tableau, Power BI,

Power BI (analytics) Google Data Studio

User Focus Data analysts, data Business managers,

scientists, decision- stakeholders, executives
makers

Ques-Explore modern data analytic tools and their functionalities. How

have these tools transformed the landscape of data analytics? (Aktu)

Modern Data Analytics Tools and Their Functionalities

LePic
Tool Functionalities

Power BI Data visualization, real-time dashboards,

DAX functions, data modeling

Tableau Interactive visual analytics, drag-and-

drop interface, advanced charting

Google Data Studio Free, cloud-based reporting tool,

integrates with Google products,
dashboards

Excel (Modern) PivotTables, Power Query, data analysis

add-ins, charts

Python (with Pandas, NumPy, Data cleaning, manipulation, statistical

Matplotlib) analysis, visualizations

R Statistical computing, predictive

modeling, data visualization (ggplot2)

Apache Spark Big data processing, distributed

computing, supports SQL, MLlib for ML

RapidMiner Drag-and-drop analytics workflows, ML

model building, no-code solution

KNIME Open-source data analytics platform,

visual workflows, supports scripting

Qlik Sense Associative data model, real-time

analytics, embedded analytics
capabilities

How These Tools Have Transformed Data Analytics

→ 1. Automation of Data Processing:

Tools like Power BI, Tableau, and Spark automate data integration, transformation, and
visualization, reducing manual effort and time.
LePic
→ 2. Democratization of Data Access:
Non-technical users can now explore and interpret data using intuitive interfaces (e.g.,
Google Data Studio, Tableau).

→ 3. Real-Time Analytics:
Real-time dashboards and alerts allow businesses to respond quickly to changes (e.g.,
Power BI, Qlik Sense).

→ 4. Scalability for Big Data:

Tools like Apache Spark and cloud-based platforms handle massive datasets efficiently.

→ 5. Integration of AI/ML:
Python, R, and RapidMiner support advanced machine learning and predictive modeling,
enabling smarter decision-making.

→ 6. Enhanced Collaboration:
Cloud tools like Google Data Studio and Power BI allow teams to collaborate on reports
and dashboards simultaneously.

→ 7. Visual Storytelling:
Interactive visuals and dashboards help convey insights more clearly and impactfully
than static reports.

→ 8. Customization and Extensibility:

Tools offer scripting support (Python, R, DAX) to create customized analytics solutions
tailored to specific needs.

→ 9. Cost Efficiency:
Open-source and freemium tools (e.g., KNIME, Google Data Studio) reduce entry barriers
for organizations.

→ 10. Improved Decision-Making:

With faster access to insights, decision-makers can take data-driven actions promptly,
improving business performance.

Ques-Explain various phases of Data Analytics Life Cycle. (Aktu)

The Data Analytics Life Cycle refers to a structured process that guides how data is
collected, processed, analyzed, and used to make decisions. It ensures that data analysis
is systematic, efficient, and valuable to the organization.

Below are the main phases of the Data Analytics Life Cycle, explained in detail:

1. Discovery (Problem Identification)

→ Purpose: Understand the business problem or objective clearly.

→ Activities:
LePic
Define the problem statement.
Identify key stakeholders.
Assess available resources (people, tools, time).
→ Outcome: A well-defined problem to solve using data.

2. Data Collection (Data Preparation)

→ Purpose: Gather data from various internal and external sources.

→ Activities:
Identify relevant data sources (databases, APIs, files).
Collect structured and unstructured data.
Document metadata and data formats.
→ Outcome: Raw data ready for preprocessing.

3. Data Cleaning and Preparation (Data Wrangling)

→ Purpose: Clean and organize data for analysis.

→ Activities:
Remove missing, duplicate, or inconsistent values.
Normalize and transform data.
Feature selection and engineering.
→ Outcome: High-quality, structured dataset.

4. Data Exploration and Analysis

→ Purpose: Understand the data through exploration and visualization.

→ Activities:
Use descriptive statistics and visualizations.
Identify patterns, correlations, and outliers.
Form hypotheses and begin model development.
→ Outcome: Initial insights and analysis direction.

5. Model Building (Advanced Analytics)

→ Purpose: Apply statistical or machine learning models to analyze data.

→ Activities:
Select appropriate models (e.g., regression, classification).
Train and test models.
Evaluate performance using metrics (accuracy, precision, etc.).
→ Outcome: Validated and optimized analytical model.

6. Deployment (Operationalization)

→ Purpose: Integrate the model into the business environment for use.
→ Activities:
LePic
Deploy the model in applications or dashboards.
Automate workflows and reporting.
Ensure accessibility for end-users.
→ Outcome: Model becomes a part of real-time decision-making processes.

7. Communication of Results

→ Purpose: Share findings with stakeholders in an understandable form.

→ Activities:
Create dashboards, reports, or presentations.
Explain insights and recommendations.
Provide actionable takeaways.
→ Outcome: Data-driven decisions made by business users.

8. Feedback and Iteration

→ Purpose: Continuously improve the model and process.

→ Activities:
Monitor model performance over time.
Update models with new data.
Incorporate user feedback.
→ Outcome: Refined analytics system that evolves with changing needs.

Thankyou!!!
For notes visit-> http://lepic.mzelo.com
LePic

Project Cycle 1-2-25
No ratings yet
Project Cycle 1-2-25
6 pages
Rachin
No ratings yet
Rachin
13 pages
Class 9 AI Project Cycle Notes
100% (1)
Class 9 AI Project Cycle Notes
8 pages
Cs607 Final Term Current Paper 2023 Solved by Prince (Mahmeer) .
No ratings yet
Cs607 Final Term Current Paper 2023 Solved by Prince (Mahmeer) .
14 pages
Understanding Data Streaming Concepts
No ratings yet
Understanding Data Streaming Concepts
12 pages
BigData QB (C.format)
No ratings yet
BigData QB (C.format)
6 pages
Data Science
No ratings yet
Data Science
32 pages
Ai Cycle-Ix
No ratings yet
Ai Cycle-Ix
40 pages
LectureNote 3
No ratings yet
LectureNote 3
11 pages
FAM Prelims
No ratings yet
FAM Prelims
15 pages
Ai Project Cycle Short Note
No ratings yet
Ai Project Cycle Short Note
9 pages
Data Analytics Question Bank
No ratings yet
Data Analytics Question Bank
5 pages
4-5 Units Fds
No ratings yet
4-5 Units Fds
13 pages
02 AI Project Cycle Revision Notes
No ratings yet
02 AI Project Cycle Revision Notes
5 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
Revised NOTES On AI PROJECT CYCLE Class 9 and 10 As On 29-10-2024 1
No ratings yet
Revised NOTES On AI PROJECT CYCLE Class 9 and 10 As On 29-10-2024 1
21 pages
02 Ai Project Cycle Revision Notes
No ratings yet
02 Ai Project Cycle Revision Notes
4 pages
Ai - Foundations of Machine Learning III
No ratings yet
Ai - Foundations of Machine Learning III
98 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
AI Project Cycle
No ratings yet
AI Project Cycle
31 pages
DWDM Unit 5 Part One
No ratings yet
DWDM Unit 5 Part One
29 pages
Mining Data Streams
No ratings yet
Mining Data Streams
17 pages
AI Project Cycle Guide for Students
No ratings yet
AI Project Cycle Guide for Students
4 pages
Cs3352 Fods QB
No ratings yet
Cs3352 Fods QB
25 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
15 pages
Project Cycle - Key Points
No ratings yet
Project Cycle - Key Points
3 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Big Data Analytics Rajnish)
No ratings yet
Big Data Analytics Rajnish)
13 pages
DWH m2p2
No ratings yet
DWH m2p2
8 pages
Doubt Clearance Session (AI) On 29.12.2024
No ratings yet
Doubt Clearance Session (AI) On 29.12.2024
41 pages
FDS Pyq2
No ratings yet
FDS Pyq2
10 pages
Unit 4
No ratings yet
Unit 4
10 pages
Data Visualization and Story Telling Notes
No ratings yet
Data Visualization and Story Telling Notes
31 pages
Unit 2
No ratings yet
Unit 2
19 pages
DataStreamsCRC Anjaly
No ratings yet
DataStreamsCRC Anjaly
258 pages
ChatGPT - Shared Content
No ratings yet
ChatGPT - Shared Content
26 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
5 Major Issues 10 Feb 2021material I 10 Feb 2021 Mod1 Issues
No ratings yet
5 Major Issues 10 Feb 2021material I 10 Feb 2021 Mod1 Issues
5 pages
CS3352 QB
No ratings yet
CS3352 QB
35 pages
Data
No ratings yet
Data
36 pages
AI Data Processing Guide
No ratings yet
AI Data Processing Guide
5 pages
Machine Learning for Nigerian Languages
No ratings yet
Machine Learning for Nigerian Languages
67 pages
AI Project Cycle PPT - Notes
No ratings yet
AI Project Cycle PPT - Notes
9 pages
Data
No ratings yet
Data
70 pages
Data Stream Algorithms Primer
No ratings yet
Data Stream Algorithms Primer
76 pages
Paper 22
No ratings yet
Paper 22
9 pages
NN 7
No ratings yet
NN 7
26 pages
DCPP Notes
No ratings yet
DCPP Notes
6 pages
Data in Machine Learning
No ratings yet
Data in Machine Learning
7 pages
Data Mining Essentials Guide
No ratings yet
Data Mining Essentials Guide
23 pages
Pattern L1 L6
No ratings yet
Pattern L1 L6
19 pages
Lecture20 Fuzzy PDF
No ratings yet
Lecture20 Fuzzy PDF
20 pages
AI Project Cycle: Stages and Scoping
No ratings yet
AI Project Cycle: Stages and Scoping
4 pages
AI Project Lifecycle Overview
No ratings yet
AI Project Lifecycle Overview
30 pages
Soft Computing
No ratings yet
Soft Computing
123 pages
Five V's of Big Data Explained
No ratings yet
Five V's of Big Data Explained
8 pages
DM Condensed Representation - Taf Univ Citation - Ay Be TK
No ratings yet
DM Condensed Representation - Taf Univ Citation - Ay Be TK
15 pages
Data-Science-Assignment No-1
No ratings yet
Data-Science-Assignment No-1
5 pages
AI-Based Music Recommendation System .
No ratings yet
AI-Based Music Recommendation System .
9 pages
Best Free Tools To Create Coloring Books
No ratings yet
Best Free Tools To Create Coloring Books
1 page
Cloud Computing Overview Video
No ratings yet
Cloud Computing Overview Video
52 pages
DBMS U3 Normalization
No ratings yet
DBMS U3 Normalization
4 pages
Cloud Computing Overview Video
No ratings yet
Cloud Computing Overview Video
52 pages
DBMS U3 Normalization
No ratings yet
DBMS U3 Normalization
4 pages
Jazz - Wikipedia, The Free Encyclopedia
No ratings yet
Jazz - Wikipedia, The Free Encyclopedia
43 pages
Stenografie Dup Sistemul Duploye Ion Vasilescu Pierre Dephanis
100% (2)
Stenografie Dup Sistemul Duploye Ion Vasilescu Pierre Dephanis
112 pages
Outlier Detection
No ratings yet
Outlier Detection
20 pages
The Effects of Direct and Indirect Instruction On Students Achievements in Mathematics
100% (1)
The Effects of Direct and Indirect Instruction On Students Achievements in Mathematics
18 pages
CoinGecko 2025 Q1 Crypto Industry Report
No ratings yet
CoinGecko 2025 Q1 Crypto Industry Report
50 pages
Golf Logix GPS Strategy Analysis
No ratings yet
Golf Logix GPS Strategy Analysis
3 pages
Gaurav Pandey
No ratings yet
Gaurav Pandey
2 pages
128832485
No ratings yet
128832485
18 pages
Physics Textbook Solutions
No ratings yet
Physics Textbook Solutions
13 pages
McAfee ePO Backup
No ratings yet
McAfee ePO Backup
4 pages
San-el Wind Power Solutions
No ratings yet
San-el Wind Power Solutions
20 pages
Advanced Calculus and Complex Analysis Sem
No ratings yet
Advanced Calculus and Complex Analysis Sem
4 pages
so sánh hơn, so sánh nhất
No ratings yet
so sánh hơn, so sánh nhất
2 pages
Resource Governor
No ratings yet
Resource Governor
70 pages
Wiring Diagram ATS 25kva AtyS-Revisi1 PDF
80% (5)
Wiring Diagram ATS 25kva AtyS-Revisi1 PDF
5 pages
Nle Quick Guide Handbook 02-24-25
No ratings yet
Nle Quick Guide Handbook 02-24-25
19 pages
FGFGF
No ratings yet
FGFGF
17 pages
WIGGENS General Laboratory Equipment (GLE) - 2025
No ratings yet
WIGGENS General Laboratory Equipment (GLE) - 2025
242 pages
LiDAR Surveys and Flood Mapping of Bauang River PDF
100% (1)
LiDAR Surveys and Flood Mapping of Bauang River PDF
308 pages
Barcelona Itinerary
No ratings yet
Barcelona Itinerary
6 pages
Ancient Qatari History and Archaeology
No ratings yet
Ancient Qatari History and Archaeology
9 pages
Korg Volca Manual - English
No ratings yet
Korg Volca Manual - English
11 pages
CLIMATÉRIO Terapia Não Hormonal NAMS 2023
No ratings yet
CLIMATÉRIO Terapia Não Hormonal NAMS 2023
18 pages
Bursa Station Manual
No ratings yet
Bursa Station Manual
83 pages
Tissue Engineering & Regenerative Medicine Guide
100% (1)
Tissue Engineering & Regenerative Medicine Guide
42 pages
Bio 11 Finals Mock Exam
No ratings yet
Bio 11 Finals Mock Exam
9 pages
Pharo Project Roadmap Overview
No ratings yet
Pharo Project Roadmap Overview
39 pages
Universal Gate - NOR: © 2014 Project Lead The Way, Inc. Digital Electronics
No ratings yet
Universal Gate - NOR: © 2014 Project Lead The Way, Inc. Digital Electronics
15 pages
Review and Practice For The Earth Science SOL PDF
100% (1)
Review and Practice For The Earth Science SOL PDF
34 pages
A General Theory of Artistic Legitimation How Art Worlds Are Like Social Movements
No ratings yet
A General Theory of Artistic Legitimation How Art Worlds Are Like Social Movements
19 pages