0% found this document useful (0 votes)
49 views89 pages

Introduction to Data Mining

Uploaded by

nayanank.cds
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views89 pages

Introduction to Data Mining

Uploaded by

nayanank.cds
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

DMML Module 1

What is Data?
• Data is any factual information or measurement that is collected and used for
making a decision, reasoning, or any calculation.
• We need to process the data in order to extract and understand the information
that the data is telling us.
• Data can be either qualitative (expressed as text – example color of the
product, description of the product features) or quantitative (expressed as
numbers – number of items with color green is 4)
What is Data Mining?
• Data Mining is defined as extracting information from huge sets of data.
• In other words, we can say that data mining is the procedure of mining
knowledge from data.
• Data mining is the use of efficient techniques for the analysis of very large
collections of data and the extraction of useful and possibly unexpected
patterns in data.

Why do we need Data Mining?


• Huge amounts of data are collected these days from different application
domains. The amount and complexity of the collected data doesn’t allow for
manual analysis.
• Terabytes of data are being generated every second from various devices
through various platforms.
• It has been seen that large amounts of data are more valuable and powerful
than complex algorithms and models.
• These days, the collected data is one of the biggest assets of an online
company. Ex: Google search queries, tweets on Twitter, shopping history on
Amazon
• Thus, there is a need to harness the collective intelligence and knowledge from
these vast amounts of data. This is where Data Mining comes into the picture.
Types of Data

• Quantitative versus Qualitative data


o Quantitative Data (Numerical Data): This data can be described using numbers,
and basic mathematical procedures, including addition, are possible on the set.
They can be counted or measured.
▪ Discrete: – Represents the items that can be counted. The list of
possible values may be fixed (also called finite); or it may go from 0, 1,
2, on to infinity (making it countably infinite).
▪ Continuous - Represent measurements; their possible values cannot be
counted and can only be described using intervals on the real number
line
o Qualitative Data (Categorical Data): This data cannot be described using
numbers and basic mathematical operations cannot be performed on them.
This data is generally thought of as being described using "natural" categories
and language
▪ Nominal: Here data is described purely by name or category or label
with no magnitude attached. Ex: Gender, Nationality
▪ Ordinal: Adding magnitude to the variable and thereby being able to
order the data based on the magnitude, the data is ordinal. Data in the
ordinal level provides us with a rank order, or the means to place one
observation before the other. Ex: Very Good, Good, Average, Poor
• Record data
o Data that consists of a collection of records, each of which consists of a fixed set
of attributes. Ex: voters list
• Transactional Data
o In general, each record in a transactional database captures a transaction, such
as a customer’s purchase, a flight booking or a user’s clicks on a webpage.
o A transaction typically includes a unique transaction identity number (trans ID)
and a list of the items making up the transaction, such as the items purchased
in the transaction.

• Temporal data
o Time Series Data: Time series data is a special type of sequential data in which
each record is a time series, i.e., a series of measurements taken over time. For
example, daily prices of various stocks.

o Sequence Data: Sequence data consists of a data set that is a sequence of


individual entities, such as a sequence of words or letters. It is quite similar to
sequential data, except that there are no time stamps; instead, there are
positions in an ordered sequence.

• Spatial Data
o Some objects have spatial attributes, such as positions or areas, as well as
other types of attributes. An example of spatial data is weather data that is
collected for a variety of geographical locations.
o Spatial databases are databases that, in addition to usual data, store
geographical information like maps, and global or regional positioning. Such
spatial databases present new challenges to data mining algorithms.
• Structured vs Unstructured
o Structured data is highly-organized and formatted so that it's easily searchable
in relational databases. Ex: names, dates, addresses, credit card numbers etc.
o Unstructured data has no predefined format or organization, making it much
more difficult to collect, process, and analyze. Ex: text, video files, audio files,
mobile activity, social media posts, satellite imagery, surveillance imagery etc.

3 Main Steps in Data Mining


• Mining consists of three major steps:
1. Explore the data to uncover themes and trends. This stage may include some
fairly complex analysis using a wide variety of statistical methods.
2. Build models to explain the data and identify patterns with validation and
verification. Multiple models are considered during this step.
3. Apply the model to new data to predict outcomes.

Knowledge Discovery in Databases (KDD) Steps (Data CISTM PE-KP)


1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the
database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present mined knowledge to users)
• Steps 1 through 4 are different forms of data pre-processing, where data are
prepared for mining.
• The data mining step may interact with the user or a knowledge base.
• The interesting patterns are presented to the user and may be stored as new
knowledge in the knowledge base
Data Pre-processing - Need for it
• Measures for data quality:
o Accuracy: correct or wrong, accurate or not
o Completeness: Several attributes for various tuples have missing values, not
recorded
o Consistency: Eg. Department code used to categorize items may have
discrepancies
o Timeliness : Data not submitted on time
o Believability and Interpretability : Because of past errors, no trust in current
data
• Reasons for Inaccurate, incomplete and inconsistent Data
o Data collection instruments might be faulty
o Human or Computer errors while collecting data
o Users may purposely submit incorrect data
o Errors in data transmission
o Duplicate tuples

4 Major Steps in Data Pre-processing

• Data cleaning
o Fill in missing values, remove noise, identify or remove outliers, and resolve
inconsistencies
• Data integration
o Data is merged from multiple sources so integration required—
o Data Migration tools, Data Synchronization tools
• Data reduction/selection
o Obtains a reduced representation of the data set that is much smaller in
volume, yet produces the nearly same/same analytical results.
o Dimensionality Reduction: In dimensionality reduction, data encoding
schemes are applied so as to obtain a reduced or “compressed”
representation of the original data.
o Numerosity reduction: In numerosity reduction, the data are replaced by
alternative, smaller representations using parametric models (e.g., regression
or log-linear models) or nonparametric models (e.g., histograms, clusters,
sampling, or data aggregation).
o Data compression
• Data transformation
o Normalization
o Transforming data to single data type.
o Ex: All entries for date field should be of the same types

Data Cleaning
• Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning
routines attempt to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.
• Lots of potentially incorrect data
o Incomplete: lacking attribute values or containing only aggregate data. Ex:
Occupation=“ ” (missing data)
o Noisy: containing noise, errors, or outliers. Ex: Salary = −10 (an error)
o Inconsistent: containing discrepancies in codes or names. Ex: Age = 42,
Birthday = 03/07/2010, discrepancy between duplicate records
o Intentional: disguised missing data Ex: Jan. 1 as everyone’s birthday

Ways to deal with Missing Values


• Ignore the tuple
o Usually done when the class label is missing (assuming task involves
classification)
o Not very effective, unless the tuple contains several attributes with missing
values.
o By ignoring the tuple, we do not make use of the remaining attributes’ values
in the tuple. Such data could have been useful to the task at hand.
• Fill in the missing value manually
o In general, this approach is time consuming and may not be feasible given a
large data set with many missing values.
• Use a global constant to fill in the missing value
o Replace all missing attribute values by the same constant such as a label like
“Unknown” or −∞. If missing values are replaced by, say, “Unknown,” then
the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common—that of “Unknown.” Hence,
although this method is simple, it is not fool proof.
• Use the mean or median to fill in the missing value
o For normal/symmetric data, mean can be used.
o But for skewed data, median is a better option
• Use the most probable value to fill in the missing value
o Replace the missing values using prediction models constructed using Data
Mining algorithms from the data that is already complete

Ways to deal with Noisy Data


• Noise is a random error or variance in a measured variable.
• Incorrect attribute values may be due to
o Faulty data collection instruments
o Data entry problems
o Data transmission problems
o Technology limitation
o Inconsistency in naming convention
• Binning
o Binning methods smooth a sorted data value by consulting its
“neighbourhood” i.e., the values around it.
o First sort the data and partition it into (equal-frequency) bins. Then one can
smooth by bin means/ bin median/ bin boundaries etc.
o Because binning methods consult the neighborhood of values, they perform
local smoothing.
o In smoothing by bin means (or median), each value in a bin is replaced by the
mean (or median) value of the bin.
o In smoothing by bin boundaries, the minimum and maximum values in a
given bin are identified as the bin boundaries. Each bin value is then replaced
by the closest boundary value.
• Regression
o Data smoothing can also be done by regression, a technique that conforms
data values to a function
o Linear regression involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the other
o Linear regression involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the other
• Outlier Analysis
o Outliers may be detected by clustering, for example, where similar values are
organized into groups, or “clusters.” Values that fall outside of the set of
clusters may be considered outliers.
Find the mean, median, mode, and range for the following list of values:
13, 18, 13, 14, 13, 16, 14, 21, 13
Solution:
Given data: 13, 18, 13, 14, 13, 16, 14, 21, 13
The mean is the usual average.
Mean = (13+18+13+14+13+16+14+21+13) / 9 = 15
The median is the middle value, so to rewrite the list in ascending order:
13, 13, 13, 13, 14, 14, 16, 18, 21
There are 9 numbers in the list, so the middle one will be
(9+1) / 2 = 10 / 2 = 5 = 5th number ---> Hence, the median is 14.

The mode is the number that is repeated more often than any other, so 13 is the
mode.

The largest value in the list is 21, and the smallest is 13, so the range is 21 – 13 = 8.
Mean = 15, Median = 14, Mode = 13, Range = 8

Data Integration
• Data Integration is the merging of data from multiple data stores. Data Mining
often requires merging of data from multiple sources (database, data cubes and
flat files)
• Careful Integration is required to reduce and avoid redundancies and
inconsistencies
Entity Identification Problem
• It deals with matching equivalent real-world entities from multiple data sources.
• For example, how can the data analyst or the computer be sure that customer_id
in one database and cust_number in another refer to the same attribute?
• Examples of metadata for each attribute include the name, meaning, data type,
and range of values permitted for the attribute, and null rules for handling blank,
zero, or null values. Such metadata can be used to help avoid errors in schema
integration.
Redundancy and Correlation Analysis
• An attribute (such as age) may be redundant if it can be “derived” from another
attribute or set of attributes (like DOB).
• Inconsistencies in attribute or dimension naming can also cause redundancies in
the resulting data set.
• Some redundancies can be detected by correlation analysis.
• Given two attributes, such analysis can measure how strongly one attribute implies
the other.
• For nominal data, we use the χ2 (chi-square) test.
• For numeric attributes, we can use the correlation coefficient and covariance.
o Correlation: is a measure that indicates how strongly two variables are
related. Lies between -1 and +1.
o Covariance: is a measure indicating the extent to which two random
variables change. It is a measure of Correlation. Lies between -∞ to + ∞

Correlation analysis of Nominal attributes using χ2 (chi-square) test


(Observed − Expected) 2
The larger the χ2 value, the more likely the  = 2

Expected
predictor variables are related / dependent on each other.
Male Female Count (row)

Fiction 250 (e11=90) 200 (e21=360) 450

Non fiction 50 (e12=210) 1000 (e22=840) 1050

Count (col) 300 1200 n = 1500


(Observed − Expected) 2
 =
2

Expected

Degree of freedom = (nos. of row - 1)(nos. of cols - 1) = 1 and Critical Value = 10.28
(From Chi Square Distribution Table at .001 error and 1 degree freedom). Since our
computed value (507.93) is above this, we can conclude that preferred_reading and
gender are strongly correlated.

Correlation coefficient for Numeric Data (Pearson’s product moment coefficient)

• If r > 0 → x & y are positively correlated (x’s values increase as y’s value increases)
• Higher the value of r, the more strongly corelated they are.
• If r < 0 → x and y are negatively correlated, r = 0 →independent of each other
Covariance of Numeric Data

COV(A, B) = E(A . B) – (E(A) × E(B))


Ex: Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14)
If the stocks are affected by the same industry trends, will their prices rise or fall
together?
• E(A) = Mean of A = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = Mean of B = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• COV(A, B) = E(A . B) – (E(A) × E(B))
= ((2×5 + 3×8 + 5×10 + 4×11 + 6×14) / 5) − (4 × 9.6) = 4
• Thus, A and B’s price will rise together since COV (A, B) > 0

Tuple Duplication (Data Integration Methods cont.)


• In addition to detecting redundancies between attributes, duplication should be
detected at tuple level also.
• Denormalized tables also result in duplication
• Duplication happens because of inaccurate data entry or updating data at some
places
Data Value Conflict Detection and Resolution
• Attribute values from different sources will differ in representation and scaling
• One university may adopt a quarter system with grades from A to F, while another
may have semesters and adopt grades from 1 to 10.

Data Reduction/Selection
• Complex data analysis and mining on huge amounts of data can take a long time,
making such analysis impractical or infeasible.
• Data reduction techniques can be applied to obtain a reduced representation of
the data set that is much smaller in volume, yet closely maintains the integrity of
the original data.
• Data reduction strategies
o Dimensionality reduction:
▪ It is the process of reducing the number of random variables or
attributes under consideration.
▪ Curse of dimensionality
• Curse of Dimensionality refers to a set of problems that arise
when working with high-dimensional data.
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to
clustering, outlier analysis, becomes less meaningful
▪ Aims of Dimensionality Reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
▪ Ex: Wavelet Transforms, Principal Components Analysis, Attribute
Subset Selection
o Numerosity reduction (some simply call it: Data Reduction)
▪ Regression and Log-Linear Models
▪ Histograms, clustering, sampling
▪ Data cube aggregation
o Data compression
Data Reduction Techniques
• Dimensionality Reduction:
o Wavelet Transforms
▪ Data are transformed to preserve relative distance between objects at
different levels of resolution
▪ Allow natural clusters to become more distinguishable
▪ Used for image compression
▪ Wavelet transforms can be applied to multidimensional data such as a
data cube. This is done by first applying the transform to the first
dimension, then to the second, and so on.
▪ Wavelet transforms give good results on sparse or skewed data and on
data with ordered attributes.
o Principal Components Analysis (PCA)
▪ Find a projection that captures the largest amount of variation in data
▪ The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance
matrix, and these eigenvectors define the new space
▪ Works for numeric data only

o Attribute Subset Selection


▪ Attribute subset selection reduces the data set size by removing
irrelevant or redundant attributes (or dimensions).
▪ The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained using
all attributes.
▪ Mining on a reduced set of attributes has an additional benefit: It
reduces the number of attributes appearing in the discovered patterns,
helping to make the patterns easier to understand.
▪ Redundant attributes
• Duplicate much or all of the information contained in one or more
other attributes
• E.g., purchase price of a product and the amount of sales tax paid
▪ Irrelevant attributes
• Contains information not useful for the data mining task at hand
• E.g., students' ID is often irrelevant to the task of predicting
students' GPA
• Numerosity reduction: These replace the original data volume by alternative,
smaller forms of data representation
o Parametric: Assume the data fits some model, estimate model parameters,
store only the parameters, and discard the data (except possible outliers)
Ex: Regression, Log-linear models
▪ Linear regression
• Data modeled to fit a straight line
• Often uses the least-square method to fit the line
▪ Multiple regression
• Allows a response variable Y to be modeled as a linear function of
multidimensional feature vector
▪ Log-linear model
• Approximates discrete multidimensional probability distributions

o Non-Parametric: Do not assume models. Ex: Histograms, Clustering,


Sampling, Data Cube Aggregation
▪ Histogram Analysis
• Divide data into buckets and store average (sum) for each bucket
• Partitioning rules:
o Equal-width: equal bucket range
o Equal-frequency (or equal-depth)
• Histograms are highly effective at approximating both sparse and
dense data, as well as highly skewed and uniform data
▪ Clustering
• They partition the objects into groups, or clusters, so that objects
within a cluster are “similar” to one another and “dissimilar” to
objects in other clusters.
▪ Sampling
• Sampling can be used as a data reduction technique because it
allows a large data set to be represented by a much smaller
random data sample (or subset).
• Types of Sampling:
o Simple random sample without replacement (SRSWOR)
o Simple random sample with replacement (SRSWR)
o Cluster sample
o Stratified sample

• Data Compression: Here, transformations are applied so as to obtain a reduced or


“compressed” representation of the original data.
o If the original data can be reconstructed from the compressed data without
any information loss, the data reduction is called lossless.
o If, instead, we can reconstruct only an approximation of the original data,
then the data reduction is called lossy.

Data Transformation and Discretization


• Methods
o Smoothing: Remove noise from data
o Attribute/feature construction
▪ New attributes constructed from the given ones
o Aggregation: Summarization, data cube construction
o Normalization: Scaled to fall within a smaller, specified range
▪ min-max normalization
▪ z-score normalization
▪ normalization by decimal scaling
o Discretization: where the raw values of a numeric attribute (e.g., age) are
replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g.,
youth, adult, senior)
o Concept hierarchy generation for nominal data: Where attributes such as
street can be generalized to higher-level concepts, like city or country.

Methods of Data Normalization


1. Decimal Scaling Method for Normalization
It normalizes by moving the decimal point of values of the data. To normalize the
data by this technique, we divide each value of the data by the maximum absolute
value of data. The data value, vi, of data is normalized to vi‘ by using the formula
below
Ex: Let the input data is: -10, 201, 301, -401, 501, 601, 701
v To normalize the above data,
v' = j Step 1: Maximum absolute value in given data(m): 701
10
Step 2: Divide the given data by 1000 (i.e j=3)
Result: The normalized data is: -0.01, 0.201, 0.301, -0.401, 0.501, 0.601, 0.701

2. Z-score Normalization
Ex:
Data: 8, 10, 15, 20
Mean = 13.25
SD: 4.6

3. Min-max normalization
Ex: Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to 73,600 − 12,000 (1.0 − 0) + 0 = 0.716
98,000 − 12,000

Discretization
• Used to reduce the number of values for a given continuous attribute, by dividing
the attribute into a range of intervals.
• Interval value labels can be used to replace actual data values. These methods are
typically recursive, where a large amount of time is spent on sorting the data at
each step.
• The smaller the number of distinct values to sort, the faster these methods should
be.
• Many discretization techniques can be applied recursively in order to provide a
hierarchical or multi resolution partitioning of the attribute values known as
concept hierarchy.

Data Discretization techniques


Discretization techniques can be categorized based on whether it uses class
information, as:
• Supervised: the discretization process uses class information
• Unsupervised: the discretization process does not use class information
Discretization techniques can be categorized based on which direction it proceeds as:
• Top-down: If the process starts by first finding one or a few points (called split
points or cut points) to split the entire attribute range, and then repeats this
recursively on the resulting intervals.
• Bottom-up: starts by considering all of the continuous values as potential split-
points. Then removes some by merging neighborhood values to form intervals,
and then recursively applies this process to the resulting intervals.

• Binning
o It is a top-down splitting technique based on a specified number of bins
o Binning does not use class information and is therefore an unsupervised
discretization technique.
o Binning Methods:
▪ Equal-width (distance) partitioning
Divides the range into N intervals of equal size.
If A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B –A)/N.
Ex: 4, 8, 15, 21, 21, 24, 25, 28, 34
W = (B –A)/N = (34 – 4) / 3 = 10 (N value -> Number of bins; our choice)
Bin 1: 4-14, Bin2: 15-24, Bin 3: 25-34
• Bin 1: 4, 8
• Bin 2: 15, 21, 21, 24
• Bin 3: 25, 28, 34
▪ Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing approximately same
number of samples. Ex: 4, 8, 15, 21, 21, 24, 25, 28, 34
Equal-depth (frequency) partitioning:
• Bin 1: 4, 8, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 28, 34
• Histogram Analysis
o It is an unsupervised discretization technique because it does not use class
information.
o A histogram partitions the values of an attribute, A, into disjoint ranges called
buckets or bins.
• Cluster Analysis
o A clustering algorithm can be applied to discretize a numeric attribute, A, by
partitioning the values of A into clusters or groups.
o Clustering takes the distribution of A into consideration, as well as the
o closeness of data points, and therefore is able to produce high-quality
discretization results.
• Decision Tree Analysis
o This technique employs a top-down splitting approach.
o Unlike the other methods mentioned so far, decision tree approaches to
discretization are supervised, that is, they make use of class label
information.
• Corelation Analysis
o ChiMerge is a χ2 -based discretization method which employs a bottom-up
approach by finding the best neighboring intervals and then merging them to
form larger intervals, recursively.
o ChiMerge is supervised in that it uses class information.
• Concept Hierarchy Generation
o It organizes concepts (i.e., attribute values) hierarchically and is usually
associated with each dimension in a data warehouse
o Concept hierarchy formation: Recursively reduce the data by collecting and
replacing low level concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)
o Concept hierarchies can be explicitly specified by domain experts and/or data
warehouse designers
DMML Module 2
What Is Association Mining?
• Association Rule Mining is a Data Mining technique that finds patterns in data.
• The patterns found by Association Rule Mining represent relationships between
items. When this is used with sales data, it is referred to as Market Basket
Analysis.
Definitions
• Itemset
o A collection of one or more items
o Example: {Milk, Bread, Diaper}
o k-itemset: An itemset that contains k items
• Support count (σ)
o Frequency of occurrence of an itemset
o E.g. σ ({Milk, Bread, Diaper}) = 2
• Support
o Fraction of transactions that contain an itemset
o E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
o An itemset whose support is greater than or equal to a minsup threshold
The Apriori Principle
Main Observations:
• If an itemset is frequent, then all of its subsets must also be frequent
• If an itemset is not frequent, then all of its supersets cannot be frequent
Candidate Generation
• Construct a candidate of size k+1 by combining two frequent itemsets of size k
• Prune the generated k+1 itemsets that do not have all k-subsets to be frequent
Two-step approach
1. Frequent Itemset Generation
o Generate all itemsets whose support ≥ minsup
2. Rule Generation
o Generate high confidence rules from each frequent itemset, where each
rule is a partitioning of a frequent itemset into LHS and RHS.
Example: Frequent itemset: {A,B,C,D}
Rule: AB→CD
Apriori Algorithm Example
Given, the following data with minimum support count = 2

Step 1: K=1
i) Create a table containing support count of each item present in dataset called
Candidate Set C1.
ii) Compare candidate set C1’s support count with minimum
support count (here min_support=2).
If sup_count of candidate set items is less than min_support
then remove those items. This gives us itemset L1.
As every item’s sup_count >= 2, no items are removed.
Step 2: K=2
i) Now generate C2 by joing L1 with L1. After joining, we get C2 as
ii) Now check if all subsets of an itemset are frequent or not and if not
frequent remove that itemset.
Example: subset of {I1, I2} are {I1}, {I2} and they are frequent. Check
similarly for each itemset.
iii) Now find support count of these itemsets by searching in dataset
iv) Now remove the itemsets from C2 whose sup_count is less than
minimum support count to get L2.
C2

L2

Not Frequent List


{I1, I4}, {I3, I4}, {I3, I5},
{I4, I5}

Step 3: K=3
i) Now generate C3 by joining L2 with L2. Condition of joining Lk-1 and Lk-1 is that it
should have (K-2) elements in common. So here, for L2, first element should match.
So C3 generated by joining L2 is {I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},
{I2, I4, I5}, {I2, I3, I5}
ii) Check if all subsets of these itemsets are frequent or not and if not, then remove
that itemset.
Here subset of {I1, I2, I3} are {I1, I2}, {I2, I3}, {I1, I3} which are frequent.
For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Check for every itemset.
iii) Now, find support count of these remaining itemset and remove those that have
sup_count < minimum support count. After this, we get L3:
Not Frequent List
{I1, I4}, {I3, I4}, {I3, I5}, {I4, I5}, {I1, I3, I5},
{I2, I3, I4}, {I2, I4, I5}, {I2, I3, I5}
Step 4: K=4
i) Generate C4 by joining L3 and L3. Condition of joining Lk-1 and Lk-1 is that it should
have (K-2) elements in common. So here, for L3, first 2 elements should match.
ii) Check all subsets of these itemsets are frequent or not
Here itemset formed by joining L3 is {I1, I2, I3, I5} but its subset contains {I1, I3, I5},
which is not frequent. Hence, no itemset in C4.
We stop here because no frequent itemsets are found further.
Hence Frequent Itemsets are {I1, I2, I3} and {I1, I2, I5}
Association Rules from frequent Subsets
For each frequent Itemset X,
For every proper non-empty subset A of X, and if B = X - A,
then A ⇒ B is an association rule if
• confidence (A ⇒ B) ≥ min_confidence
𝑠𝑢𝑝𝑝𝑜𝑟𝑡_𝑐𝑜𝑢𝑛𝑡 (𝐴 ∪ 𝐵)
• where, 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 (𝐴 ⇒ 𝐵 ) = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡_𝑐𝑜𝑢𝑛𝑡 (𝐴)

Example: From the previous example, we found that {I1, I2, I3} and {I1, I2, I5} itemsets
were frequent. Now, if the given min_confidence is 50%, let’s derive the association
rules.
For frequent itemset {I1, I2, I3}
{I1, I2} ⇒ {I3} ; confidence = sup(I1, I2, I3) / sup(I1, I2) = 2/4*100 = 50%
{I1, I3} ⇒ {I2} ; confidence = sup(I1, I2, I3) / sup(I1, I3) = 2/4*100 = 50%
{I2, I3} ⇒ {I1} ; confidence = sup(I1, I2, I3) / sup(I2, I3) = 2/4*100 = 50%
{I1} ⇒ {I2, I3} ; confidence = sup(I1, I2, I3) / sup(I1) = 2/6*100 = 33%
{I2} ⇒ {I1, I3} ; confidence = sup(I1, I2, I3) / sup(I2) = 2/7*100 = 28%
{I3} ⇒ {I1, I2} ; confidence = sup(I1, I2, I3) / sup(I3) = 2/6*100 = 33%
Hence, we infer that only the first 3 association rules are strong as the rest are < 50%.
For frequent itemset {I1, I2, I5}
{I1, I2} ⇒ {I5} ; confidence = sup(I1, I2, I5) / sup(I1, I2) = 2/4*100 = 50%
{I1, I5} ⇒ {I2} ; confidence = sup(I1, I2, I5) / sup(I1, I5) = 2/2*100 = 100%
{I2, I5} ⇒ {I1} ; confidence = sup(I1, I2, I5) / sup(I2, I5) = 2/2*100 = 100%
{I1} ⇒ {I2, I5} ; confidence = sup(I1, I2, I5) / sup(I1) = 2/6*100 = 33%
{I2} ⇒ {I1, I5} ; confidence = sup(I1, I2, I5) / sup(I2) = 2/7*100 = 28%
{I5} ⇒ {I1, I2} ; confidence = sup(I1, I2, I5) / sup(I5) = 2/2*100 = 100%
Hence, association rules 1, 2, 3 and 6 are strong.
Applications of Apriori Algorithm
• Education field: Extracting association rules in data mining of admitted students
through characteristics and specialties.
• Medical field: Analysis of the patient’s database.
• Forestry: Analysis of probability and intensity of forest fire with the forest fire data.
• Recommender system: Used by companies like Amazon and Google for the auto-
complete feature.
Drawbacks of Apriori Algorithm
• Using Apriori needs a generation of candidate itemsets. These itemsets may be
large in number if the itemset in the database is huge.
• Apriori needs multiple scans of the database to check the support of each itemset
generated and this leads to high costs.
Frequent Pattern Growth Algorithm (FP Growth Algorithm)
• This algorithm is an improvement to the Apriori method.
• A frequent pattern is generated without the need for candidate generation.
• FP growth algorithm represents the database in the form of a tree called a
frequent pattern tree or FP tree.
• This tree structure will maintain the association between the itemsets.
FP Growth Algorithm Example 1
For the given Transaction Database, find the Frequent Itemset:
Given, minimum support count = 3.
Now, lets construct the frequency distribution
table for each distinct item.
Now remove the items whose support count
(frequency) is < 3.
The resulting items with sup_count is:
L = {K:5, E:4, M:3, O:3, Y:3}.
Now, re-writing the given data after removing not
frequent items and arranging them in descending order of
sup_count.
Constructing FP Tree

Item Conditional Pattern Base Conditional FP-Tree Frequent Patterns


Generated
Y {{K, E, M, O: 1}, {K, E, O: 1}, {K: 3} {K, Y: 3}
{K, M: 1}} (E, M, O have sup_count < 2) (multiply with item)
O {{K, E, M: 1}, {K, E: 2}} {K: 3}, {E: 3} {K, O: 3}, {E, O: 3},
{K, E, O: 3},
M {{K, E: 2}, {K: 1}} {K: 3} {K, M: 3}
E {{K: 4}} {K: 4} {K, E: 3}

FP Growth Algorithm Example 2

L = {I2:7, I1:6, I3:6, I4:2, I5:2}


FP Tree

Item Conditional Pattern Conditional FP-Tree Frequent Patterns


Base Generated
I5 {{I2, I1: 1}, {I2, I1, I3: 1}} {{I2: 2}, {I1: 2}} {I2, I5: 2}, {I1, I5: 2},
{I2, I1, I5: 2}
I4 {{I2, I1: 1}, {I2: 1}} {{I2: 2}} {I2, I4: 2}
I3 {{I2: 2}, {I2, I1:2}} | {{I2: 4}, {I1: 2}}, {{I1: 2}} {I2, I3: 4}, {I1, I3: 4},
{{I1: 2}} {I2, I1, I3: 2}
I1 {{I2: 4}} {{I2: 4}} {I2, I1: 4}
Note: In 3rd row, for generating Frequent Patterns, when we multiply the Item with
Conditional FP-Tree, we add all the sup_counts of the item, if it is seen in other
branches. When the Item is multiplied with 2 or more Items of the Conditional FP-
Tree, we write the smallest sup_count then.
Advantages of FP Growth Algorithm
• This algorithm needs to scan the database only twice when compared to Apriori
which scans the transactions for each iteration.
• The pairing of items is not done in this algorithm and this makes it faster.
• The database is stored in a compact version in memory.
• It is efficient and scalable for mining both long and short frequent patterns.
Disadvantages of FP-Growth Algorithm
• FP Tree is more cumbersome and difficult to build than Apriori.
• It may be expensive.
• When the database is large, the algorithm may not fit in the shared memory.
Extra Problems
FP Growth Example 3

FP Tree

L = {A:8, B:7, C:6, D:5, E:3}. The given data is already in descending order of sup_count
Item Conditional Pattern Conditional Frequent Patterns Generated
Base (CPB) FP Tree (CFT)
E {{A, D: 1}, {A, C, D: 1}} | {(A: 2), (D: 2)} {A, E: 2}, {D, E: 2},
{{B, C: 1}} {A, D, E: 2}
D {{A: 1}, {A, B: 1}, {(A: 4), (B: 2), {A, D: 4}, {B, D: 2}, {C, D: 2},
{A, C: 1}, {A, B, C: 1}} | (C: 2)} {A, B, D: 2}, {A, C, D: 2}
{{B, C: 1}} Note: {B, C, D: 2} is not added as both
B, C belong to different branches.
C {{A: 1}, {A, B: 3}} | {(A: 4), (B: 3)}| {A, C: 4}, {B, C: 5}, {A, B, C: 3}
{{B: 2}} {(B: 2)}
B {{A: 5}} {(A: 5}} {A, B: 5}
DMML Module 3
Introduction to Machine Learning
Various Definitions of ML
• ML is the field of study that gives computer the ability to learn without being
explicitly programmed.
• Well Posed Learning Problem: A computer program is said to learn from
experience E with respect to some task T and some performance measure P, if its
performance on T, as measured by P improves with experience E.
• Hence ML is the study of algorithms that,
o improves their performance P
o at some task T
o with experience E
• A well-defined-learning task is given by <T, P, E>
Examples:
• Checkers/Chess Learning Problem:
o Task T: playing checkers
o Performance measure P: percentage of games won against opponent
o Training experience E: playing practice games against itself
• Handwriting Recognition:
o T: recognizing hand-written words
o P: percentage of words correctly classified
o E: A dataset of labelled handwritten words
• Self-Driving Robot:
o T: Driving on highways using vision sensors
o P: Average distance travelled before an error
o E: A sequence of images and steering commands recorded while observing a
human driver
• To filter emails as spam or not
o T: Classifying emails as spam or not
o P: The fraction of emails accurately classified as spam or not spam
o E: dataset of emails labelled as spam and not spam
• Fruit Prediction Problem
o T: forecasting different fruits for recognition
o P: able to predict maximum variety of fruits
o E: training machine with the largest datasets of fruits images
• Face Recognition Problem
o T: predicting different types of faces
o P: able to predict maximum types of faces
o E: datasets of different face images
• Automatic Translation of documents
o T: translating one type of language used in a document to other language
o P: able to convert one language to other efficiently
o E: dataset of different types of languages
Types of Machine Learning
Criteria Supervised ML Unsupervised ML Reinforcement ML
Trained using Works on
Learns by using
Definition unlabelled data interacting with the
labelled data
without any guidance. environment
No – predefined
Type of data Labelled data Unlabelled data
data
Type of Regression and Association and Exploitation or
problems classification Clustering Exploration
Supervision Extra supervision No supervision No supervision
Linear Regression,
K – Means,
Logistic Q – Learning,
Algorithms C – Means, Apriori,
Regression, SVM, SARSA
Hierarchical Clustering
KNN etc.
Calculate Discover underlying Learn a series of
Aim
outcomes patterns action
Recommendation
Risk Evaluation, Self-Driving Cars,
Application System, Anomaly
Forecast Sales Gaming, Robotics
Detection

Pictorial
Repr.
Supervised ML:

• Supervised learning is the subcategory of machine learning that focuses on


learning a classification or regression model, that is, learning from labelled training
data .
Unsupervised ML:

• Unsupervised learning is a branch of machine learning that is concerned with


unlabelled data. Common tasks in unsupervised learning are clustering analysis
(assigning group memberships) and dimensionality reduction (compressing data
onto a lower-dimensional subspace or manifold).

Reinforcement ML
• Reinforcement is the process of learning from rewards while performing a series of
actions.
• In reinforcement learning, we do not tell the learner or agent, for example, a
robot, which action to take but merely assign a reward to each action and/or the
overall outcome.
• Instead of having correct/false" label for each step, the learner must discover or
learn a behaviour that maximizes the reward for a series of actions.
• Typical applications of reinforcement learning involve playing games and some
form of robots, e.g., drones, warehouse robots, and more recently self-driving cars.
Parameters Supervised ML Unsupervised ML

Process In a supervised learning In unsupervised learning model,


model, input and output only input data will be given
variables will be given.
Input Data Algorithms are trained using Algorithms are trained against
labelled data. data which is not labelled
Algorithms Used Support vector machine Unsupervised algorithms can be
(SVM), Neural network, divided into different categories:
Linear and logistics like Cluster algorithms, K-means,
regression, random forest, Hierarchical clustering, etc.
and Classification trees.
Computational Supervised learning is a Unsupervised learning is
Complexity simpler method. computationally complex
Use of Data Supervised learning model Unsupervised learning does not
uses training data to learn a use output data.
link between the input and
the outputs.
Accuracy of Highly accurate and Less accurate and trustworthy
Results trustworthy method. method.
Real Time Learning method takes place Learning method takes place in
Learning offline. real time. (Ex: consumer analysis)
Number of Number of classes is known. Number of classes is not known.
Classes
Concept Learning
• Concept: It is a subset of objects or events defined over a larger set. For example,
we refer to the set of everything (i.e., all objects) as the set of things. Animals are a
subset of things, and birds are a subset of animals.
• Concept learning can be formulated as a problem of searching through a
predefined space of potential hypotheses for the hypothesis that best fits the
training examples.
• Hence, we are trying to learn the definition of a concept from given examples.
• Example of a Concept Learning task: To infer the “best” concept description from
the set of all possible hypotheses. Here, “best” means “which best generalizes all
the elements of the instance space”.
• Most General Hypotheses: Everyday is good for playing outside
<?, ?, ?, ?, ?>
• Most Specific Hypotheses: No day is good for playing outside
<0, 0, 0, 0, 0>
• Hence, the hypotheses <sunny, warm, ?, ?, ?> classifies that, all the instances
where sky is ‘sunny’ and temperature is ‘warm’ as ‘Play and others as ‘No Play’
Find-S Algorithm
1. Initialize h to the most specific hypotheses (Ex: <Φ, Φ, Φ, Φ, Φ>)
2. For each positive training instance x :
For each attribute constant a in h :
If the constant a is satisfied by x:
Do nothing
Else:
Replace a in h by the next more
general constraint that is satisfied
by x
3. Output hypotheses h
Version Space
• A version space is a hierarchical representation of knowledge that enables you to
keep track of all the useful information supplied by a sequence of learning
examples without remembering any of the examples.
• A version space description consists of two complementary trees:
o One that contains nodes connected to overly general models
o One that contains nodes connected to overly specific models

• The key idea in version space learning is that specialization of the general models
and generalization of the specific models may ultimately lead to just one correct
model that matches all observed positive examples and does not match any
negative examples.

Candidate Elimination Algorithm


• The candidate-Elimination algorithm computes the version space containing all
(and only those) hypotheses from H that are consistent with an observed sequence
of training examples
• It begins by initializing the version space to the set of all hypotheses in H, that is, by
initializing the G boundary set to contain the most general hypotheses in H
G0 ← {<?, ?, ?, ?, ?, ?, ?>}
• And initializing the S boundary set to contain the most specific hypothesis.
S0 ← {<0, 0, 0, 0, 0, 0, 0>}
• These two boundary sets delimit the entire hypothesis space because every other
hypothesis in H is both more general than S0 and more specific than G0.
• As each training example is considered, the S and G boundary sets are generalized
and specialized, respectively to eliminate from the version space any hypothesis
found inconsistent with the new training example.
• After all the examples have been processed, the computed version space contains
all the hypotheses consistent with these examples and hypotheses.
DMML Module 4
Regression, Bayesian Learning, SVMs
Overview of Regression
• Regression analysis is a statistical method to model the relationship between a
dependent variable (target) and an independent (predictor) variable(s)
• Regression analysis helps us to understand how the value of the dependent variable
is changing corresponding to an independent variable when other independent
variables are held fixed.
• It predicts continuous/real values such as temperature, age, salary, price, etc.
• It involves determining the best fit line, which is a line that passes through all the
data points in such a way that distance of the line from each data point is minimized.
• Thus, regression analysis helps in the prediction of a continuous variable.
Terminologies Related to the Regression Analysis
• Dependent Variable: The main factor in Regression analysis, which we want to
predict or understand is called the dependent variable. It is also called target
variable.
• Independent Variable: The factors which affect the dependent variables or which
are used to predict the values of the dependent variables are called independent
variable, also called as a predictor.
• Outliers: Outlier is an observation which contains either very low value or very
high value in comparison to other observed values. An outlier may hamper the
result, so it should be avoided.
• Multicollinearity: If the independent variables are highly correlated (meaning
they change together at a constant/linear rate) with each other than other
variables, then such condition is called Multicollinearity. It should not be present
in the dataset, because it creates problem while ranking the most affecting
variable.
• Underfitting and Overfitting: If our algorithm works well with the training dataset
but not well with test dataset, then such problem is called Overfitting. And if our
algorithm does not perform well even with training dataset, then such problem is
called underfitting.
Linear Regression
• If there is only one input variable (x), then such linear regression is called simple
linear regression.
• And if there is more than one input variable, then such linear regression is
called multiple linear regression.
• Mathematically, we can represent a linear regression as: y = m*x + c

Logistic Regression
• Logistic regression is a method used to predict a dependent variable, given a set of
independent variables, such that the dependent variable is categorical.
• Logistic regression produces results in a binary format which is used to predict the
value of a categorical dependent variable. Hence, the outcome of logistic regression
is discrete/categorical like (Yes, No) or (0, 1), (True, False), (High, Low) etc
• It is a predictive analysis algorithm which works on the concept of probability.
• Logistic regression uses sigmoid/logistic function which is a complex cost function.
This sigmoid function is used to model the data in logistic regression.
1
• The sigmoid function: 𝑦 = 𝑓 (𝑥 ) = 1+𝑒 −𝑥
• Where, f(x) lies between 0 and 1 and x is the input to the function.

Linear Regression Logistic Regression


It predicts a continuous dependent It predicts a categorical dependent
variable using a given set of independent variable using a given set of independent
variables. variables.
It is used for solving Regression It is used for solving Binary Classification
problem. problems.
Here, we find the best fit line, by which Here, we find the S-curve by which we
we can easily predict the output. can classify the samples.
Least square estimation method is used Maximum likelihood estimation method
for estimation of accuracy. is used for estimation of accuracy.
The output for Linear Regression must The output of Logistic Regression must
be a continuous value, such as price, be a Categorical value such as 0 or 1, Yes
age, etc. or No, etc.
Linear regression uses the straight-line Logistic regression uses the sigmoid
equation: y = mx + c 𝟏
function: 𝒚 = 𝒇(𝒙) =
𝟏+𝒆−𝒙

Naïve Bayes
• The Naive Bayes classifier works on the principle of conditional probability, as given
by the Bayes theorem.
• Naive Bayes algorithm falls under Supervised - classification.
• Applications:
o Face Recognition: As a classifier, it is used to identify the faces or its other
features, like nose, mouth, eyes, etc.
o Weather Prediction: It can be used to predict if the weather will be good or
bad.
o Medical Diagnosis: Doctors can diagnose patients by using the information
that the classifier provides. Healthcare professionals can use Naive Bayes to
indicate if a patient is at high risk for certain diseases and conditions, such as
heart disease, cancer, and other ailments.
o News Classification: With the help of a Naive Bayes classifier, Google News
recognizes whether the news is political, world news, and so on.
Bayesian Network
• Popularly known as Belief Networks, Bayesian Networks are used to model
uncertainties by using Directed Acyclic Graphs (DAG).
• A Directed Acyclic Graph is used to represent a Bayesian Network and like any
other statistical graph, a DAG contains a set of nodes and links, where the links
denote the relationship between the nodes.
• A DAG models the uncertainty of an event occurring based on the Conditional
Probability Distribution (CDP) of each random variable. A Conditional Probability
Table (CPT) is used to represent the CPD of each variable in the network.
• Applications:
o Disease Diagnosis: Bayesian Networks are commonly used in the field of
medicine for the detection and prevention of diseases. They can be used to
model the possible symptoms and predict whether or not a person is diseased.
o Optimized Web Search: Bayesian Networks are used to improve search
accuracy by understanding the intent of a search and providing the most
relevant search results. They can effectively map users intent to the relevant
content and deliver the search results.
o Spam Filtering: Bayesian models have been used in the Gmail spam filtering
algorithm for years now. They can effectively classify documents by
understanding the contextual meaning of a mail. They are also used in other
document classification applications.
o Gene Regulatory Networks: GRNs are a network of genes that are comprised
of many DNA segments. They are effectively used to communicate with other
segments of a cell either directly or indirectly. Mathematical models such as
Bayesian Networks are used to model such cell behavior in order to form
predictions.
o Biomonitoring: Bayesian Networks play an important role in monitoring the
quantity of chemical dozes used in pharmaceutical drugs.
Support Vector Machines
• Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
• However, primarily, it is used for Classification problems in Machine Learning.
• An SVM model is a representation of the examples as points in space, mapped so
that the examples of the separate categories are divided by a hyperplane that is as
far as possible from each category.
• New examples are then mapped into that same space and predicted to belong to a
category based on which side of the hyperplane they fall.
• The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.

• SVM can be of two types:


o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then such
data is termed as non-linear data and classifier used is called as Non-linear SVM
classifier.
• Applications:
o Face detection – SVM classify parts of the image as a face and nonface and
create a square boundary around the face.
o Classification of images – Use of SVMs provides better search accuracy for
image classification. It provides better accuracy in comparison to the
traditional query-based searching techniques.
o Bioinformatics – It includes protein classification and cancer classification. We
use SVM for identifying the classification of genes, patients on the basis of
genes and other biological problems
o Handwriting recognition – We use SVMs to recognize hand written characters
used widely.
SVM Kernel
• SVM algorithms use a set of mathematical functions that are defined as the kernel.
The function of kernel is to take data as input and transform it into the required
form. Different SVM algorithms use different types of kernel functions.
• Kernel plays a vital role in classification and is used to analyse some patterns in the
given dataset.
• They are very helpful in solving a non-linear problem by using a linear classifier.
• Kernels help us to deal with high dimensional data in a very efficient manner
• Kernels are a way to solve non-linear problems with the help of linear classifiers. This
is known as the kernel trick method.
• Kernel Function is a method used to take data as input and transform into the
required form of processing data.
• The value can be any type of kernel from linear to polynomial. If the value of the
kernel is linear then the decision boundary would be linear and two-dimensional.
• These kernel functions also help in giving decision boundaries for higher dimensions.
• These functions can be different types. For example, linear, nonlinear, polynomial,
radial basis function (RBF), and sigmoid.
Linear Kernel
• It is the most basic type of kernel, usually one dimensional in nature. It proves to be
the best function when there are lots of features.
• The linear kernel is mostly preferred for text-classification problems as most of these
kinds of classification problems can be linearly separated.
• It is given by the following equation: K(x i , x j ) = x i . x j + c ('.' Is dot product)
where, x i , x j represents the data, we want to classify.
Polynomial Kernel
• It is a more generalized representation of the linear kernel. It is not as preferred as
other kernel functions as it is less efficient and accurate.
• It is given by: K(x i , x j ) = (x i . x j ) d , where ‘.’ is the dot product of both the numbers
and d is the degree of the polynomial.
Gaussian Radial Basis Function (RBF)
• It is one of the most preferred and used kernel functions in svm.
• It is usually chosen for non-linear data. It helps to make proper separation when
there is no prior knowledge of data.
• It is given as: K(x i , x j ) = exp(-γ||x i – x j ||) 2
where, the value of gamma (γ) ranges from 0 to 1. The preferred value of γ is 0.1
Sigmoid Kernel
• It is mostly preferred for neural networks. This kernel function is similar to a two-
layer perceptron model of the neural network, which works as an activation function
for neurons.
• It is given as: K(x i , x j ) = tanh(αx a y + c)
Conclusion on SVM Kernels
• The linear kernel is mostly preferred for text classification problems as it performs
well for large datasets.
• Gaussian kernels tend to give good results when there is no additional information
regarding data that is not available.
• RBF kernel is also a kind of Gaussian kernel which projects the high dimensional data
and then searches a linear separation for it.
• Polynomial kernels give good results for problems where all the training data is
normalized.
DMML Module 5
Clustering, ANNs, Instance Based Learning
Clustering
• Cluster analysis, or clustering, is an unsupervised machine learning task.
• Clustering is the task of dividing the population or data points into a number of
groups such that data points in the same groups are more similar to other data
points in the same group and dissimilar to the data points in other groups.
• It involves automatically discovering natural grouping in data. Clustering algorithms
only interpret the input data and find natural groups or clusters in feature space.
• A good clustering method will produce high quality clusters in which:
o the intra-class/intra-cluster similarity is high
o the inter-class similarity is low.
• The quality of a clustering result also depends on both the similarity measure used
by the method and its implementation.
• The quality of a clustering method is also measured by its ability to discover some
or all of the hidden patterns.

Why Clustering?

• It helps you to glance through the data to pull out some patterns or structures
before going deeper into analysing the data for specific findings.
• Organising data into clusters helps in identifying the underlying structure in the
data and finds applications across industries.
• For example, clustering could be used to classify disease in the field of medical
science, and can also be used in customer classification in marketing research

Types of Clustering Algorithms


1. Partitional algorithms: Construct various partitions and then evaluate them by
some criterion. Ex: k-means
2. Hierarchical algorithms: Create a hierarchical decomposition of the set of objects
using some criterion.
3. Density based: based on connectivity and density function
Centroid/Partition Method
• These methods partition the objects into k clusters and each partition forms one
cluster.

Centroids
(Cluster centres)

• This method is used to optimize an objective criterion similarity function such as


when the distance is a major parameter
• The most common example of partitioning clustering is the K-Means Clustering
algorithm.
• In this type, the dataset is divided into a set of k groups, where k is used to define
the number of pre-defined groups.
• The cluster center is created in such a way that the distance between the data points
of one cluster is minimum as compared to another cluster centroid.
Hierarchical Method
• The clusters formed in this method forms a tree-type structure based on the
hierarchy. New clusters are formed using the previously formed one. It is divided
into two categories:
o Agglomerative (bottom-up approach)
o Divisive (top-down approach)
• Hierarchical clustering can be used as an alternative for the partitioned clustering
as there is no requirement of pre-specifying the number of clusters to be
created.
• In this technique, the dataset is divided into clusters to create a tree-like structure,
which is also called a dendrogram.
• The observations or any number of clusters can be selected by cutting the tree at
the correct level.
• The most common example of this method is the Agglomerative Hierarchical
algorithm.

Density based Methods


• These methods consider the clusters as the dense region having some similarity
and different from the lower dense region of the space. These methods have good
accuracy and ability to merge two clusters.
• It searches the data space for varied density of data points and isolates the different
density regions. It then assigns the data points within the same region as clusters

Applications of Clustering
• Marketing : It can be used to characterize & discover customer segments for
marketing purposes. Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing programs.
• Biology : It can be used for classification among different species of plants and
animals.
• Libraries : It is used in clustering different books on the basis of topics and
information.
• Insurance : It is used to acknowledge the customers, their policies and identifying
the frauds. Identifying groups of motor insurance policy holders with a high
average claim cost.
• City Planning: It is used to make groups of houses and to study their values based
on their geographical locations and other factors present.
• Earthquake studies: By learning the earthquake-affected areas we can determine
the dangerous zones.
• Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of products.
• Netflix also uses this technique to recommend the movies and web-series to its
users as per the watch history.
K-means Clustering
The algorithm can be stated as follows.
• First it selects k number of objects at random from the set of n objects. These k
objects are treated as the centroids or center of gravities of k clusters.
• For each of the remaining objects, it is assigned to the closest centroid. Thus, it
forms a collection of objects assigned to each centroid and is called a cluster.
• Next, the centroid of each cluster is then updated (by calculating the mean values
of attributes of each object).
• The assignment and update procedure is until it reaches some stopping criteria
(like, number of iteration, centroids remain unchanged or no assignment, etc.)
Advantages:
• The algorithm is simple, easy to understand and implement.
• It is also efficient. The time taken to cluster K-means rises linearly with the
number of data points.
Disadvantages
• The user needs to specify an initial value of k
• The process of finding the clusters may not converge.
Choosing the value of ‘k’ in K-means
• The performance of the K-means clustering algorithm depends upon highly efficient
clusters that it forms.
• But choosing the optimal number of clusters is a big task. We use the Elbow Method
to find the optimal number of clusters.
Elbow Method
• We use within-sum-of-squares (WSS) as a measure to find the optimum number of
clusters that can be formed for a given data set.
• WSS is defined as the sum of the squared distance between each member of the
cluster and its centroid.
𝑚

𝑊𝑆𝑆 = ∑(𝑥𝑖 − 𝑐𝑖 )2
𝑖=1
• The WSS is measured for each value of K. The value of K, which has the least amount
of WSS, is taken as the optimum value.
• Now, we draw a curve between WSS and the number of clusters.
• The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.

• Since the graph shows the sharp bend, which looks like an elbow, hence it is known
as the elbow method.
Hierarchical Clustering
• Hierarchical clustering algorithms falls into following two categories.
o Agglomerative hierarchical algorithms (bottom-up approach) − In
agglomerative hierarchical algorithms, each data point is treated as a single
cluster and then successively merge or agglomerate the pairs of clusters. The
hierarchy of the clusters is or tree structure.
o Divisive hierarchical algorithms (Top-down approach) − On the other hand,
in divisive hierarchical algorithms, all the data points are treated as one big
cluster and the process of clustering involves dividing the one big cluster into
various small clusters.
• Advantages:
o It is easy to understand and implement.
o We don’t have to pre-specify any particular number of clusters.
o Can obtain any desired number of clusters by cutting the Dendrogram at the
proper level.
o They may correspond to meaningful classification.
o Easy to decide the number of clusters by merely looking at the Dendrogram.
• Disadvantages:
o Hierarchical Clustering does not work well on vast amounts of data.
o Once a decision is made to combine two clusters, it can not be undone.
• The closest distance between the two clusters is crucial for the hierarchical
clustering.
• There are various ways to calculate the distance between two clusters, and these
ways decide the rule for clustering.
o Single Linkage: It is the Shortest Distance between the closest points of the
clusters. Consider the below image:
o Complete Linkage: It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter
clusters than single-linkage.

o Average Linkage: It is the linkage method in which the distance between


each pair of datasets is added up and then divided by the total number of
datasets to calculate the average distance between two clusters. It is also one
of the most popular linkage methods.

o Centroid Method: It is the linkage method in which clusters with minimum


distance between their centroids are combined.
Agglomerative Clustering
1. Let each point be a cluster
2. Compute the distance matrix
3. Repeat
a. Merge the two closest clusters
b. Update the distance matrix
4. Until only one cluster remains
Artificial Neural Networks (ANN)

• A biological neuron has three types of main components:


o Dendrites: receives signals from other neurons.
o Soma (or cell body): sums the incoming signals. When sufficient input is
received, the cell fires; that is it transmit a signal over its axon to other cells.
o Axon
• ANN is an information processing system that has certain performance
characteristics in common with biological neurons.

• Here, x1 to xm are the input to the neuron. Y is the output of the neuron. Each
input link is weighted, w1 to wm. ‘b’ is the bias.
• The Net Input is calculated as Yin = b + x1w1 + x2w2 + … + xmwm
• The output Y is calculated as Y = f(Yin) where f is the activation function.
Architecture of an ANN

• Input Layer:
o As the name suggests, it accepts inputs in several different formats provided
by the programmer.
• Hidden Layer:
o The hidden layer presents in-between input and output layers. It performs all
the calculations to find hidden features and patterns.
• Output Layer:
o The input goes through a series of transformations using the hidden layer,
which finally results in output that is conveyed using this layer.
o It determines weighted total is passed as an input to an activation function to
produce the output. Activation functions choose whether a node should fire
or not. Only those who are fired make it to the output layer. There are
distinctive activation functions available that can be applied upon the sort of
task we are performing

Perceptron Learning Algorithm


A Perceptron is an algorithm for supervised learning of binary classifiers. This
algorithm enables neurons to learn and processes elements in the training set one at
a time.

where, theta (θ) is the threshold.


Backpropagation
• While designing a Neural Network, in the beginning, we initialize weights with
some random values or any variable for that fact.
• So, it’s not necessary that whatever weight values we have selected will be correct,
or it fits our model the best.
• We have selected some weight values in the beginning, but our model output is a
• To reduce the error, we need to somehow explain the model to change the
parameters (weights), such that error becomes minimum.
• This is where backpropagation comes into play.
What is backpropagation?
• Backpropagation is the essence of neural network training.
• Backpropagation is a supervised learning algorithm, for training Multi-layer
Perceptrons (Artificial Neural Networks).
• It is the method of fine-tuning the weights of a neural network based on the error
rate obtained in the previous epoch (i.e., iteration).
• Proper tuning of the weights allows you to reduce error rates and make the
model reliable by increasing its generalization.
• The Backpropagation algorithm looks for the minimum value of the error function
in weight space using a technique called the delta rule or gradient descent. The
weights that minimize the error function is then considered to be a solution to the
learning problem.

Backpropagation Algorithm the following three phases.

• Phase 1 − Feed Forward Phase


• Phase 2 − Back Propagation of error
• Phase 3 − Updating of weights
Backpropagation Algorithm
Initialization
• Initialize the following to start the training: Weights and Learning rate (α)
• For easy calculation and simplicity, take some small random values.

Phase 1: Feed Forward Phase


1. Each input unit receives input signal x i and sends it to the hidden unit for all
1≤𝑖≤𝑛
2. Calculate the net input 𝑯𝒊𝒏𝒋 at each hidden unit 𝑯𝒋 using the following relation
for all 𝟏 ≤ 𝒋 ≤ 𝒑
𝑛

𝐻𝑖𝑛𝑗 = 𝑏𝑗 + ∑ 𝑥𝑖 𝑤𝑖𝑗
𝑖=1
where, 𝒃𝒋 is the bias on hidden unit 𝑯𝒋 and 𝒘𝒊𝒋 is the weight on the link coming
from input unit 𝑿𝒊 to 𝑯𝒋
Now calculate the net output by applying the following activation function:
𝐻𝑗 = 𝑓(𝐻𝑖𝑛𝑗 )
Send these output signals of the hidden layer units to the output layer units.
3. Calculate the net input 𝒀𝒊𝒏𝒌 at each output unit 𝒀𝒌 using the following relation
for all 𝟏 ≤ 𝒌 ≤ 𝒎
𝑝

𝑌𝑖𝑛𝑘 = 𝑏𝑘 + ∑ 𝐻𝑗 𝑤𝑗𝑘
𝑗=1

where, 𝒃𝒌 is the bias on output unit 𝒀𝒌 and 𝒘𝒋𝒌 is the weight on the link coming
from hidden unit 𝑯𝒋 to 𝒀𝒌
Now calculate the net output by applying the following activation function:
𝑌𝑘 = 𝑓(𝑌𝑖𝑛𝑘 )
Phase 2: Back Propagation of error
1. Compute the error correcting term, in correspondence with the target pattern
received at each output unit, as follows
𝛿𝑘 = (𝑡𝑘 − 𝑌𝑘 ) 𝑓′(𝑌𝑖𝑛𝑘 )
On this basis, update the deltas of weight and bias of output layer as follows:
𝛥𝑤𝑗𝑘 = 𝛼 𝛿𝑘 𝐻𝑗
𝛥𝑏𝑘 = 𝛼 𝛿𝑘
Then, send 𝜹𝒌 back to the hidden layer.
2. Now each hidden unit will be the sum of its delta inputs from the output units.
𝑚

𝛿𝑖𝑛𝑗 = ∑ 𝛿𝑘 𝑤𝑘𝑗
𝑘=1

Error term can be calculated as follows:


𝛿𝑗 = 𝛿𝑖𝑛𝑗 𝑓′(𝐻𝑖𝑛𝑗 )
On this basis, update the deltas of weight and bias of hidden layer as follows:
𝛥𝑤𝑖𝑗 = 𝛼 𝛿𝑗 𝑋𝑖
𝛥𝑏𝑗 = 𝛼 𝛿𝑗
Phase 3: Updating of weights
1. Each output unit 𝒀𝒌 updates the weight and bias as follows:
𝑤𝑗𝑘 (𝑛𝑒𝑤) = 𝑤𝑗𝑘 (𝑜𝑙𝑑 ) + 𝛥𝑤𝑗𝑘
𝑏𝑘 (𝑛𝑒𝑤) = 𝑏𝑘 (𝑜𝑙𝑑 ) + 𝛥𝑏𝑘
2. Each hidden unit 𝑯𝒋 updates the weight and bias as follows:
𝑤𝑖𝑗 (𝑛𝑒𝑤) = 𝑤𝑖𝑗 (𝑜𝑙𝑑 ) + 𝛥𝑤𝑖𝑗
𝑏𝑗 (𝑛𝑒𝑤) = 𝑏𝑗 (𝑜𝑙𝑑 ) + 𝛥𝑏𝑗
Continue the above 3 phases for every training pair if the stopping condition is not
true. The stopping condition, may either be the number of epochs reached or if the
target output matches the actual output.
Instance Based Learning (IBL)
• Eager Learning: It is a type of ML model where, given a set of training tuples, it
constructs a classification model before receiving new data (i.e., test data) to
classify. Eg: ANNs, Decision Tree, Naive Bayes etc.
• Lazy Learning (aka Instance based Learning): It simply stores the training data or
only does some minor pre-processing and waits for a test sample to be given to it.
It doesn’t learn anything from the data and only learns and starts classifying once a
test data is given. Hence, it takes less time learning and more time classifying data.
• Examples of IBL/Lazy learning algorithms: K-Nearest Neighbours (KNN), locally
weighted regression, case-based reasoning etc.
• Advantages of IBL:
o Instead of estimating for the entire instance set, local approximations can be
made to the target function.
o This type of learning can adapt to new data easily, one which is collected as
we go.
• Disadvantages of IBL:
o Classification costs are high
o Large amount of memory required to store the data, and each query involves
starting the identification of a local model from scratch.

K-Nearest Neighbours
• K-nearest neighbours (KNN) algorithm is a type of supervised ML algorithm which
can be used for both classification as well as regression predictive problems.
• KNN is a lazy learning algorithm because it does not have a specialized training
phase and uses all the data for training while classification.
• KNN is also a non-parametric learning algorithm because it doesn’t assume
anything about the underlying data.
• It requires three inputs:
o The training data
o A distance metric like Euclidean or Manhattan to compute the distance
between the test data and each individual training data
o The value of k, the number of neighbours to select/retrieve.
• To classify an unknown test data:
o It computes the distance of the test data to other training records
o Then, it identifies the k nearest neighbours
o Use class labels of nearest neighbours to determine the class label of an
unknown record (Ex: by taking the majority vote)
• Ideally, the value of k must not be a multiple of the number of class labels.
Otherwise, it may cause ties. (Ex: if number of classes in training data is 3, then k
should not be 3, 6, 9 etc)
• Popular distance metrics:

o Euclidean: √(𝑥2 − 𝑥1 )2 + (𝑦2 − 𝑦1 )2


o Squared Euclidean: (𝑥2 − 𝑥1 )2 + (𝑦2 − 𝑦1 )2
o Manhattan: |(𝑥2 − 𝑥1 )| + |(𝑦2 − 𝑦1 )|
Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
• It has relatively high accuracy but there are much better supervised learning
models than KNN.
Disadvantages of KNN Algorithm:
• Always needs to determine the value of K which may be complex some time.
• The computation cost is high because of calculating the distance between the
data points for all the training samples.
• High memory storage required as compared to other supervised learning
algorithms
• Prediction rate is slow.
Types of Errors
• There are two main types of errors present in any machine learning model.
• They are Reducible Errors and Irreducible Errors.
• Irreducible errors are errors which will always be present in a machine learning
model, because of unknown variables, and whose values cannot be reduced.
• Reducible errors are those errors whose values can be further reduced to improve
a model. They are caused because our model’s output function does not match the
desired output function and can be optimized.
• We can further divide reducible errors into two: Bias and Variance
o Bias:
▪ Bias is the difference between our actual and predicted values. Bias is
the simple assumptions that our model makes about our data to be able
to predict new data.

▪ When the Bias is high, assumptions made by our model are too basic,
the model can’t capture the important features of our data.
▪ This means that our model hasn’t captured patterns in the training data
and hence cannot perform well on the testing data too.
▪ This instance, where the model cannot find patterns in our training set
and hence fails for both train and test data, is called Underfitting.
o Variance
▪ We can define variance as the model’s sensitivity to fluctuations in the
data. Our model may learn from noise. This will cause our model to
consider trivial features as important.
▪ Variance is the change in prediction accuracy of ML model between
training data and test data.
▪ This simply means that if an ML model is predicting with an accuracy of
"x" on training data and its prediction accuracy on test data is "y" then
variance = x - y
▪ If the variance is high, our model will capture all the features of the
train data given to it, including the noise, will tune itself to the data,
and predict it very well, but when given new test data, it cannot predict
on it as it is too specific to the training data.
▪ Hence, our model will perform really well on training data and get high
accuracy but will fail to perform on new, test data. New data may not
have the exact same features and the model thus won’t be able to
predict it very well. This is called Overfitting.

• When building a supervised machine-learning algorithm, the goal is to achieve low


bias and low variance for the most accurate predictions. A model that exhibits
small variance and high bias will underfit the target, while a model with high
variance and little bias will overfit the target.
Ensemble Learning Methods
• Ensemble methods are techniques that aim at improving the accuracy of results in
models by combining multiple models instead of using a single model.
• The combined models increase the accuracy of the results significantly by
offsetting each individual model’s variances and biases. This has boosted the
popularity of ensemble methods in machine learning.
• Thus, the main idea behind ensemble learning is to group weak learners together
to form one strong learner that has better accuracy than any individual weak
learner.
• Three main types of ensemble learning:
o Bagging
▪ It considers homogenous weak learners, focuses on reducing variance
▪ It involves fitting many decision trees on different samples of the same
dataset and averaging the predictions.
o Boosting
▪ It considers homogenous weak learners, focuses on reducing bias
▪ It involves adding ensemble members sequentially that correct the
predictions made by prior models and outputs a weighted average of
the predictions.
o Stacking
▪ It considers heterogenous weak learners
▪ It involves fitting many different model types on the same data and
using another model to learn how best to combine the predictions.
• Alternatively, ensemble learning can also be classified as:
o Parallel ensemble methods:
▪ In this kind of Ensemble method, the base learner is generated in
parallel order in which data dependency is not there.
▪ Every data in the base learner is generated independently.
▪ Example: Stacking
o Sequential ensemble methods:
▪ In this kind of Ensemble method, there are sequentially generated base
learners in which data dependency resides.
▪ Every other data in the base learner is having some dependency on
previous data.
▪ So, the previous mislabeled data are tuned based on its weight to get
the performance of the overall system improved.
▪ Example: Boosting

Bagging
• Bagging is used when our objective is to reduce the variance of a decision tree.
Here the concept is to create a few subsets of data from the training sample,
which is chosen randomly with replacement.
• Now each collection of subset data is used to prepare their decision trees, thus,
we end up with an ensemble of various models.
• The average of all the assumptions from numerous tress is used, which is more
powerful than a single decision tree.
• Random Forest is an expansion over bagging. It takes one additional step to
predict a random subset of data. It also makes the random selection of features
rather than using all features to develop trees. When we have numerous random
trees, it is called the Random Forest.
Stacking
• It is an ensemble method that seeks a diverse group of members by varying the
model types fit on the training data and using a model to combine predictions.

• Stacking mainly differ from bagging and boosting on that it considers


heterogeneous weak learners (different learning algorithms are combined)
whereas bagging and boosting consider mainly homogeneous weak learners.

Boosting
• The key property of boosting ensembles is the idea of correcting prediction
errors. The models are fit and added to the ensemble sequentially such that the
second model attempts to correct the predictions of the first model, the third
corrects the second model, and so on.
• It is a method that in general decreases the bias error and builds strong predictive
models. The term ‘Boosting’ refers to a family of algorithms which converts a weak
learner to a strong learner.
• Ex: Ada Boost algorithm: It commences by training decision trees. Every
observation during this procedure has an equal weight assigned to it.
• After analysing the first tree, we raise the weights of every observation that we
find complicated to classify. On the other hand, we decrease the weights for the
ones in which classification is not an issue.
• Therefore, we will notice the second tree growing on the weighted data. The
original idea for this is to make improvements upon the first tree’s predictions.
• Therefore, the last ensemble model’s predictions will be the overall weighted
predictions provided by former tree models.
Random Forest
• Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique. It can be used for both Classification and Regression
problems in ML.
• It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of
the model.
• Random Forest is a classifier that contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset.
• Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the
final output.
• Random forest algorithm combines multiple decision-trees, resulting in a forest of
trees, hence the name Random Forest.
• One problem that might occur with one big (deep) single DT is that it can overfit.
• The point of RF is to prevent overfitting. It does this by creating random subsets of
the features and building smaller (shallow) trees using the subsets and then it
combines the subtrees.
• The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.
Working of a Random Forest
• Random Forest works in two-phase first is to create the random forest by
combining N decision tree, and second is to make predictions for each tree created
in the first phase
First Phase
1. Randomly select k records from a total of m records where k < m.
2. Among the k records, calculate the node D using the best split point.
3. Split the node into daughter nodes using the best split.
4. Repeat 1 to 3 steps until leaf nodes have been finalized.
5. Build forest by repeating steps 1 to 4 for 𝓃 number of times to create 𝓃 number of
trees.
Second Phase
• In the second stage, we make predictions using the trained random forest
algorithm.
• We take the test features and use the rules of each randomly created decision
tree to predict the outcome and stores the predicted outcome.
• Then, we calculate the votes for each predicted target.
• Finally, we consider the highest voted predicted target as the final prediction from
the random forest algorithm.
Applications of Random Forests
• Banking: Banking sector mostly uses this RF for the identification of loan risk.
• Medicine: Disease trends and risks of the disease can be identified using RF
• Land Use: We can identify the areas of similar land use
• Marketing: Marketing trends can be identified
• Health care: Identifying the treatment/ medicine based on history and symptoms.
• E-Commerce: recommended products and customer experience
• Stock Market: Predicting stock market portfolio performance.
Advantages of Random Forests
• Solves the Decision tree overfit problem. RFs with enough trees won’t overfit
• Random Forest can be used for both classification and regression problems.
• Random forests are extremely flexible and have very high accuracy.
• RFs maintains accuracy even when a large proportion of the data are missing.
Disadvantages of Random Forests
• RFs are complex, harder, time-consuming to construct than decision trees
• Large number of trees can make the algorithm to slow and ineffective for real-time
predictions.
Key differences between RFs and DTs
• A Random Forest is a set of multiple decision-trees.
• Decision-trees are computationally faster as compared to random forests.
• Deep decision-trees may suffer from overfitting. Random forest prevents
overfitting by creating trees on random forests.
• Random forest is difficult to interpret. But a decision-tree is easily interpretable
and can be converted to rules.

You might also like