Introduction to Data Mining
Introduction to Data Mining
What is Data?
• Data is any factual information or measurement that is collected and used for
making a decision, reasoning, or any calculation.
• We need to process the data in order to extract and understand the information
that the data is telling us.
• Data can be either qualitative (expressed as text – example color of the
product, description of the product features) or quantitative (expressed as
numbers – number of items with color green is 4)
What is Data Mining?
• Data Mining is defined as extracting information from huge sets of data.
• In other words, we can say that data mining is the procedure of mining
knowledge from data.
• Data mining is the use of efficient techniques for the analysis of very large
collections of data and the extraction of useful and possibly unexpected
patterns in data.
• Temporal data
o Time Series Data: Time series data is a special type of sequential data in which
each record is a time series, i.e., a series of measurements taken over time. For
example, daily prices of various stocks.
• Spatial Data
o Some objects have spatial attributes, such as positions or areas, as well as
other types of attributes. An example of spatial data is weather data that is
collected for a variety of geographical locations.
o Spatial databases are databases that, in addition to usual data, store
geographical information like maps, and global or regional positioning. Such
spatial databases present new challenges to data mining algorithms.
• Structured vs Unstructured
o Structured data is highly-organized and formatted so that it's easily searchable
in relational databases. Ex: names, dates, addresses, credit card numbers etc.
o Unstructured data has no predefined format or organization, making it much
more difficult to collect, process, and analyze. Ex: text, video files, audio files,
mobile activity, social media posts, satellite imagery, surveillance imagery etc.
• Data cleaning
o Fill in missing values, remove noise, identify or remove outliers, and resolve
inconsistencies
• Data integration
o Data is merged from multiple sources so integration required—
o Data Migration tools, Data Synchronization tools
• Data reduction/selection
o Obtains a reduced representation of the data set that is much smaller in
volume, yet produces the nearly same/same analytical results.
o Dimensionality Reduction: In dimensionality reduction, data encoding
schemes are applied so as to obtain a reduced or “compressed”
representation of the original data.
o Numerosity reduction: In numerosity reduction, the data are replaced by
alternative, smaller representations using parametric models (e.g., regression
or log-linear models) or nonparametric models (e.g., histograms, clusters,
sampling, or data aggregation).
o Data compression
• Data transformation
o Normalization
o Transforming data to single data type.
o Ex: All entries for date field should be of the same types
Data Cleaning
• Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning
routines attempt to fill in missing values, smooth out noise while identifying
outliers, and correct inconsistencies in the data.
• Lots of potentially incorrect data
o Incomplete: lacking attribute values or containing only aggregate data. Ex:
Occupation=“ ” (missing data)
o Noisy: containing noise, errors, or outliers. Ex: Salary = −10 (an error)
o Inconsistent: containing discrepancies in codes or names. Ex: Age = 42,
Birthday = 03/07/2010, discrepancy between duplicate records
o Intentional: disguised missing data Ex: Jan. 1 as everyone’s birthday
The mode is the number that is repeated more often than any other, so 13 is the
mode.
The largest value in the list is 21, and the smallest is 13, so the range is 21 – 13 = 8.
Mean = 15, Median = 14, Mode = 13, Range = 8
Data Integration
• Data Integration is the merging of data from multiple data stores. Data Mining
often requires merging of data from multiple sources (database, data cubes and
flat files)
• Careful Integration is required to reduce and avoid redundancies and
inconsistencies
Entity Identification Problem
• It deals with matching equivalent real-world entities from multiple data sources.
• For example, how can the data analyst or the computer be sure that customer_id
in one database and cust_number in another refer to the same attribute?
• Examples of metadata for each attribute include the name, meaning, data type,
and range of values permitted for the attribute, and null rules for handling blank,
zero, or null values. Such metadata can be used to help avoid errors in schema
integration.
Redundancy and Correlation Analysis
• An attribute (such as age) may be redundant if it can be “derived” from another
attribute or set of attributes (like DOB).
• Inconsistencies in attribute or dimension naming can also cause redundancies in
the resulting data set.
• Some redundancies can be detected by correlation analysis.
• Given two attributes, such analysis can measure how strongly one attribute implies
the other.
• For nominal data, we use the χ2 (chi-square) test.
• For numeric attributes, we can use the correlation coefficient and covariance.
o Correlation: is a measure that indicates how strongly two variables are
related. Lies between -1 and +1.
o Covariance: is a measure indicating the extent to which two random
variables change. It is a measure of Correlation. Lies between -∞ to + ∞
Expected
predictor variables are related / dependent on each other.
Male Female Count (row)
Expected
Degree of freedom = (nos. of row - 1)(nos. of cols - 1) = 1 and Critical Value = 10.28
(From Chi Square Distribution Table at .001 error and 1 degree freedom). Since our
computed value (507.93) is above this, we can conclude that preferred_reading and
gender are strongly correlated.
• If r > 0 → x & y are positively correlated (x’s values increase as y’s value increases)
• Higher the value of r, the more strongly corelated they are.
• If r < 0 → x and y are negatively correlated, r = 0 →independent of each other
Covariance of Numeric Data
Data Reduction/Selection
• Complex data analysis and mining on huge amounts of data can take a long time,
making such analysis impractical or infeasible.
• Data reduction techniques can be applied to obtain a reduced representation of
the data set that is much smaller in volume, yet closely maintains the integrity of
the original data.
• Data reduction strategies
o Dimensionality reduction:
▪ It is the process of reducing the number of random variables or
attributes under consideration.
▪ Curse of dimensionality
• Curse of Dimensionality refers to a set of problems that arise
when working with high-dimensional data.
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to
clustering, outlier analysis, becomes less meaningful
▪ Aims of Dimensionality Reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
▪ Ex: Wavelet Transforms, Principal Components Analysis, Attribute
Subset Selection
o Numerosity reduction (some simply call it: Data Reduction)
▪ Regression and Log-Linear Models
▪ Histograms, clustering, sampling
▪ Data cube aggregation
o Data compression
Data Reduction Techniques
• Dimensionality Reduction:
o Wavelet Transforms
▪ Data are transformed to preserve relative distance between objects at
different levels of resolution
▪ Allow natural clusters to become more distinguishable
▪ Used for image compression
▪ Wavelet transforms can be applied to multidimensional data such as a
data cube. This is done by first applying the transform to the first
dimension, then to the second, and so on.
▪ Wavelet transforms give good results on sparse or skewed data and on
data with ordered attributes.
o Principal Components Analysis (PCA)
▪ Find a projection that captures the largest amount of variation in data
▪ The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance
matrix, and these eigenvectors define the new space
▪ Works for numeric data only
2. Z-score Normalization
Ex:
Data: 8, 10, 15, 20
Mean = 13.25
SD: 4.6
3. Min-max normalization
Ex: Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to 73,600 − 12,000 (1.0 − 0) + 0 = 0.716
98,000 − 12,000
Discretization
• Used to reduce the number of values for a given continuous attribute, by dividing
the attribute into a range of intervals.
• Interval value labels can be used to replace actual data values. These methods are
typically recursive, where a large amount of time is spent on sorting the data at
each step.
• The smaller the number of distinct values to sort, the faster these methods should
be.
• Many discretization techniques can be applied recursively in order to provide a
hierarchical or multi resolution partitioning of the attribute values known as
concept hierarchy.
• Binning
o It is a top-down splitting technique based on a specified number of bins
o Binning does not use class information and is therefore an unsupervised
discretization technique.
o Binning Methods:
▪ Equal-width (distance) partitioning
Divides the range into N intervals of equal size.
If A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B –A)/N.
Ex: 4, 8, 15, 21, 21, 24, 25, 28, 34
W = (B –A)/N = (34 – 4) / 3 = 10 (N value -> Number of bins; our choice)
Bin 1: 4-14, Bin2: 15-24, Bin 3: 25-34
• Bin 1: 4, 8
• Bin 2: 15, 21, 21, 24
• Bin 3: 25, 28, 34
▪ Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing approximately same
number of samples. Ex: 4, 8, 15, 21, 21, 24, 25, 28, 34
Equal-depth (frequency) partitioning:
• Bin 1: 4, 8, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 28, 34
• Histogram Analysis
o It is an unsupervised discretization technique because it does not use class
information.
o A histogram partitions the values of an attribute, A, into disjoint ranges called
buckets or bins.
• Cluster Analysis
o A clustering algorithm can be applied to discretize a numeric attribute, A, by
partitioning the values of A into clusters or groups.
o Clustering takes the distribution of A into consideration, as well as the
o closeness of data points, and therefore is able to produce high-quality
discretization results.
• Decision Tree Analysis
o This technique employs a top-down splitting approach.
o Unlike the other methods mentioned so far, decision tree approaches to
discretization are supervised, that is, they make use of class label
information.
• Corelation Analysis
o ChiMerge is a χ2 -based discretization method which employs a bottom-up
approach by finding the best neighboring intervals and then merging them to
form larger intervals, recursively.
o ChiMerge is supervised in that it uses class information.
• Concept Hierarchy Generation
o It organizes concepts (i.e., attribute values) hierarchically and is usually
associated with each dimension in a data warehouse
o Concept hierarchy formation: Recursively reduce the data by collecting and
replacing low level concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)
o Concept hierarchies can be explicitly specified by domain experts and/or data
warehouse designers
DMML Module 2
What Is Association Mining?
• Association Rule Mining is a Data Mining technique that finds patterns in data.
• The patterns found by Association Rule Mining represent relationships between
items. When this is used with sales data, it is referred to as Market Basket
Analysis.
Definitions
• Itemset
o A collection of one or more items
o Example: {Milk, Bread, Diaper}
o k-itemset: An itemset that contains k items
• Support count (σ)
o Frequency of occurrence of an itemset
o E.g. σ ({Milk, Bread, Diaper}) = 2
• Support
o Fraction of transactions that contain an itemset
o E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
o An itemset whose support is greater than or equal to a minsup threshold
The Apriori Principle
Main Observations:
• If an itemset is frequent, then all of its subsets must also be frequent
• If an itemset is not frequent, then all of its supersets cannot be frequent
Candidate Generation
• Construct a candidate of size k+1 by combining two frequent itemsets of size k
• Prune the generated k+1 itemsets that do not have all k-subsets to be frequent
Two-step approach
1. Frequent Itemset Generation
o Generate all itemsets whose support ≥ minsup
2. Rule Generation
o Generate high confidence rules from each frequent itemset, where each
rule is a partitioning of a frequent itemset into LHS and RHS.
Example: Frequent itemset: {A,B,C,D}
Rule: AB→CD
Apriori Algorithm Example
Given, the following data with minimum support count = 2
Step 1: K=1
i) Create a table containing support count of each item present in dataset called
Candidate Set C1.
ii) Compare candidate set C1’s support count with minimum
support count (here min_support=2).
If sup_count of candidate set items is less than min_support
then remove those items. This gives us itemset L1.
As every item’s sup_count >= 2, no items are removed.
Step 2: K=2
i) Now generate C2 by joing L1 with L1. After joining, we get C2 as
ii) Now check if all subsets of an itemset are frequent or not and if not
frequent remove that itemset.
Example: subset of {I1, I2} are {I1}, {I2} and they are frequent. Check
similarly for each itemset.
iii) Now find support count of these itemsets by searching in dataset
iv) Now remove the itemsets from C2 whose sup_count is less than
minimum support count to get L2.
C2
L2
Step 3: K=3
i) Now generate C3 by joining L2 with L2. Condition of joining Lk-1 and Lk-1 is that it
should have (K-2) elements in common. So here, for L2, first element should match.
So C3 generated by joining L2 is {I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},
{I2, I4, I5}, {I2, I3, I5}
ii) Check if all subsets of these itemsets are frequent or not and if not, then remove
that itemset.
Here subset of {I1, I2, I3} are {I1, I2}, {I2, I3}, {I1, I3} which are frequent.
For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Check for every itemset.
iii) Now, find support count of these remaining itemset and remove those that have
sup_count < minimum support count. After this, we get L3:
Not Frequent List
{I1, I4}, {I3, I4}, {I3, I5}, {I4, I5}, {I1, I3, I5},
{I2, I3, I4}, {I2, I4, I5}, {I2, I3, I5}
Step 4: K=4
i) Generate C4 by joining L3 and L3. Condition of joining Lk-1 and Lk-1 is that it should
have (K-2) elements in common. So here, for L3, first 2 elements should match.
ii) Check all subsets of these itemsets are frequent or not
Here itemset formed by joining L3 is {I1, I2, I3, I5} but its subset contains {I1, I3, I5},
which is not frequent. Hence, no itemset in C4.
We stop here because no frequent itemsets are found further.
Hence Frequent Itemsets are {I1, I2, I3} and {I1, I2, I5}
Association Rules from frequent Subsets
For each frequent Itemset X,
For every proper non-empty subset A of X, and if B = X - A,
then A ⇒ B is an association rule if
• confidence (A ⇒ B) ≥ min_confidence
𝑠𝑢𝑝𝑝𝑜𝑟𝑡_𝑐𝑜𝑢𝑛𝑡 (𝐴 ∪ 𝐵)
• where, 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 (𝐴 ⇒ 𝐵 ) = 𝑠𝑢𝑝𝑝𝑜𝑟𝑡_𝑐𝑜𝑢𝑛𝑡 (𝐴)
Example: From the previous example, we found that {I1, I2, I3} and {I1, I2, I5} itemsets
were frequent. Now, if the given min_confidence is 50%, let’s derive the association
rules.
For frequent itemset {I1, I2, I3}
{I1, I2} ⇒ {I3} ; confidence = sup(I1, I2, I3) / sup(I1, I2) = 2/4*100 = 50%
{I1, I3} ⇒ {I2} ; confidence = sup(I1, I2, I3) / sup(I1, I3) = 2/4*100 = 50%
{I2, I3} ⇒ {I1} ; confidence = sup(I1, I2, I3) / sup(I2, I3) = 2/4*100 = 50%
{I1} ⇒ {I2, I3} ; confidence = sup(I1, I2, I3) / sup(I1) = 2/6*100 = 33%
{I2} ⇒ {I1, I3} ; confidence = sup(I1, I2, I3) / sup(I2) = 2/7*100 = 28%
{I3} ⇒ {I1, I2} ; confidence = sup(I1, I2, I3) / sup(I3) = 2/6*100 = 33%
Hence, we infer that only the first 3 association rules are strong as the rest are < 50%.
For frequent itemset {I1, I2, I5}
{I1, I2} ⇒ {I5} ; confidence = sup(I1, I2, I5) / sup(I1, I2) = 2/4*100 = 50%
{I1, I5} ⇒ {I2} ; confidence = sup(I1, I2, I5) / sup(I1, I5) = 2/2*100 = 100%
{I2, I5} ⇒ {I1} ; confidence = sup(I1, I2, I5) / sup(I2, I5) = 2/2*100 = 100%
{I1} ⇒ {I2, I5} ; confidence = sup(I1, I2, I5) / sup(I1) = 2/6*100 = 33%
{I2} ⇒ {I1, I5} ; confidence = sup(I1, I2, I5) / sup(I2) = 2/7*100 = 28%
{I5} ⇒ {I1, I2} ; confidence = sup(I1, I2, I5) / sup(I5) = 2/2*100 = 100%
Hence, association rules 1, 2, 3 and 6 are strong.
Applications of Apriori Algorithm
• Education field: Extracting association rules in data mining of admitted students
through characteristics and specialties.
• Medical field: Analysis of the patient’s database.
• Forestry: Analysis of probability and intensity of forest fire with the forest fire data.
• Recommender system: Used by companies like Amazon and Google for the auto-
complete feature.
Drawbacks of Apriori Algorithm
• Using Apriori needs a generation of candidate itemsets. These itemsets may be
large in number if the itemset in the database is huge.
• Apriori needs multiple scans of the database to check the support of each itemset
generated and this leads to high costs.
Frequent Pattern Growth Algorithm (FP Growth Algorithm)
• This algorithm is an improvement to the Apriori method.
• A frequent pattern is generated without the need for candidate generation.
• FP growth algorithm represents the database in the form of a tree called a
frequent pattern tree or FP tree.
• This tree structure will maintain the association between the itemsets.
FP Growth Algorithm Example 1
For the given Transaction Database, find the Frequent Itemset:
Given, minimum support count = 3.
Now, lets construct the frequency distribution
table for each distinct item.
Now remove the items whose support count
(frequency) is < 3.
The resulting items with sup_count is:
L = {K:5, E:4, M:3, O:3, Y:3}.
Now, re-writing the given data after removing not
frequent items and arranging them in descending order of
sup_count.
Constructing FP Tree
FP Tree
L = {A:8, B:7, C:6, D:5, E:3}. The given data is already in descending order of sup_count
Item Conditional Pattern Conditional Frequent Patterns Generated
Base (CPB) FP Tree (CFT)
E {{A, D: 1}, {A, C, D: 1}} | {(A: 2), (D: 2)} {A, E: 2}, {D, E: 2},
{{B, C: 1}} {A, D, E: 2}
D {{A: 1}, {A, B: 1}, {(A: 4), (B: 2), {A, D: 4}, {B, D: 2}, {C, D: 2},
{A, C: 1}, {A, B, C: 1}} | (C: 2)} {A, B, D: 2}, {A, C, D: 2}
{{B, C: 1}} Note: {B, C, D: 2} is not added as both
B, C belong to different branches.
C {{A: 1}, {A, B: 3}} | {(A: 4), (B: 3)}| {A, C: 4}, {B, C: 5}, {A, B, C: 3}
{{B: 2}} {(B: 2)}
B {{A: 5}} {(A: 5}} {A, B: 5}
DMML Module 3
Introduction to Machine Learning
Various Definitions of ML
• ML is the field of study that gives computer the ability to learn without being
explicitly programmed.
• Well Posed Learning Problem: A computer program is said to learn from
experience E with respect to some task T and some performance measure P, if its
performance on T, as measured by P improves with experience E.
• Hence ML is the study of algorithms that,
o improves their performance P
o at some task T
o with experience E
• A well-defined-learning task is given by <T, P, E>
Examples:
• Checkers/Chess Learning Problem:
o Task T: playing checkers
o Performance measure P: percentage of games won against opponent
o Training experience E: playing practice games against itself
• Handwriting Recognition:
o T: recognizing hand-written words
o P: percentage of words correctly classified
o E: A dataset of labelled handwritten words
• Self-Driving Robot:
o T: Driving on highways using vision sensors
o P: Average distance travelled before an error
o E: A sequence of images and steering commands recorded while observing a
human driver
• To filter emails as spam or not
o T: Classifying emails as spam or not
o P: The fraction of emails accurately classified as spam or not spam
o E: dataset of emails labelled as spam and not spam
• Fruit Prediction Problem
o T: forecasting different fruits for recognition
o P: able to predict maximum variety of fruits
o E: training machine with the largest datasets of fruits images
• Face Recognition Problem
o T: predicting different types of faces
o P: able to predict maximum types of faces
o E: datasets of different face images
• Automatic Translation of documents
o T: translating one type of language used in a document to other language
o P: able to convert one language to other efficiently
o E: dataset of different types of languages
Types of Machine Learning
Criteria Supervised ML Unsupervised ML Reinforcement ML
Trained using Works on
Learns by using
Definition unlabelled data interacting with the
labelled data
without any guidance. environment
No – predefined
Type of data Labelled data Unlabelled data
data
Type of Regression and Association and Exploitation or
problems classification Clustering Exploration
Supervision Extra supervision No supervision No supervision
Linear Regression,
K – Means,
Logistic Q – Learning,
Algorithms C – Means, Apriori,
Regression, SVM, SARSA
Hierarchical Clustering
KNN etc.
Calculate Discover underlying Learn a series of
Aim
outcomes patterns action
Recommendation
Risk Evaluation, Self-Driving Cars,
Application System, Anomaly
Forecast Sales Gaming, Robotics
Detection
Pictorial
Repr.
Supervised ML:
Reinforcement ML
• Reinforcement is the process of learning from rewards while performing a series of
actions.
• In reinforcement learning, we do not tell the learner or agent, for example, a
robot, which action to take but merely assign a reward to each action and/or the
overall outcome.
• Instead of having correct/false" label for each step, the learner must discover or
learn a behaviour that maximizes the reward for a series of actions.
• Typical applications of reinforcement learning involve playing games and some
form of robots, e.g., drones, warehouse robots, and more recently self-driving cars.
Parameters Supervised ML Unsupervised ML
• The key idea in version space learning is that specialization of the general models
and generalization of the specific models may ultimately lead to just one correct
model that matches all observed positive examples and does not match any
negative examples.
Logistic Regression
• Logistic regression is a method used to predict a dependent variable, given a set of
independent variables, such that the dependent variable is categorical.
• Logistic regression produces results in a binary format which is used to predict the
value of a categorical dependent variable. Hence, the outcome of logistic regression
is discrete/categorical like (Yes, No) or (0, 1), (True, False), (High, Low) etc
• It is a predictive analysis algorithm which works on the concept of probability.
• Logistic regression uses sigmoid/logistic function which is a complex cost function.
This sigmoid function is used to model the data in logistic regression.
1
• The sigmoid function: 𝑦 = 𝑓 (𝑥 ) = 1+𝑒 −𝑥
• Where, f(x) lies between 0 and 1 and x is the input to the function.
Naïve Bayes
• The Naive Bayes classifier works on the principle of conditional probability, as given
by the Bayes theorem.
• Naive Bayes algorithm falls under Supervised - classification.
• Applications:
o Face Recognition: As a classifier, it is used to identify the faces or its other
features, like nose, mouth, eyes, etc.
o Weather Prediction: It can be used to predict if the weather will be good or
bad.
o Medical Diagnosis: Doctors can diagnose patients by using the information
that the classifier provides. Healthcare professionals can use Naive Bayes to
indicate if a patient is at high risk for certain diseases and conditions, such as
heart disease, cancer, and other ailments.
o News Classification: With the help of a Naive Bayes classifier, Google News
recognizes whether the news is political, world news, and so on.
Bayesian Network
• Popularly known as Belief Networks, Bayesian Networks are used to model
uncertainties by using Directed Acyclic Graphs (DAG).
• A Directed Acyclic Graph is used to represent a Bayesian Network and like any
other statistical graph, a DAG contains a set of nodes and links, where the links
denote the relationship between the nodes.
• A DAG models the uncertainty of an event occurring based on the Conditional
Probability Distribution (CDP) of each random variable. A Conditional Probability
Table (CPT) is used to represent the CPD of each variable in the network.
• Applications:
o Disease Diagnosis: Bayesian Networks are commonly used in the field of
medicine for the detection and prevention of diseases. They can be used to
model the possible symptoms and predict whether or not a person is diseased.
o Optimized Web Search: Bayesian Networks are used to improve search
accuracy by understanding the intent of a search and providing the most
relevant search results. They can effectively map users intent to the relevant
content and deliver the search results.
o Spam Filtering: Bayesian models have been used in the Gmail spam filtering
algorithm for years now. They can effectively classify documents by
understanding the contextual meaning of a mail. They are also used in other
document classification applications.
o Gene Regulatory Networks: GRNs are a network of genes that are comprised
of many DNA segments. They are effectively used to communicate with other
segments of a cell either directly or indirectly. Mathematical models such as
Bayesian Networks are used to model such cell behavior in order to form
predictions.
o Biomonitoring: Bayesian Networks play an important role in monitoring the
quantity of chemical dozes used in pharmaceutical drugs.
Support Vector Machines
• Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
• However, primarily, it is used for Classification problems in Machine Learning.
• An SVM model is a representation of the examples as points in space, mapped so
that the examples of the separate categories are divided by a hyperplane that is as
far as possible from each category.
• New examples are then mapped into that same space and predicted to belong to a
category based on which side of the hyperplane they fall.
• The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.
Why Clustering?
• It helps you to glance through the data to pull out some patterns or structures
before going deeper into analysing the data for specific findings.
• Organising data into clusters helps in identifying the underlying structure in the
data and finds applications across industries.
• For example, clustering could be used to classify disease in the field of medical
science, and can also be used in customer classification in marketing research
Centroids
(Cluster centres)
Applications of Clustering
• Marketing : It can be used to characterize & discover customer segments for
marketing purposes. Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing programs.
• Biology : It can be used for classification among different species of plants and
animals.
• Libraries : It is used in clustering different books on the basis of topics and
information.
• Insurance : It is used to acknowledge the customers, their policies and identifying
the frauds. Identifying groups of motor insurance policy holders with a high
average claim cost.
• City Planning: It is used to make groups of houses and to study their values based
on their geographical locations and other factors present.
• Earthquake studies: By learning the earthquake-affected areas we can determine
the dangerous zones.
• Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of products.
• Netflix also uses this technique to recommend the movies and web-series to its
users as per the watch history.
K-means Clustering
The algorithm can be stated as follows.
• First it selects k number of objects at random from the set of n objects. These k
objects are treated as the centroids or center of gravities of k clusters.
• For each of the remaining objects, it is assigned to the closest centroid. Thus, it
forms a collection of objects assigned to each centroid and is called a cluster.
• Next, the centroid of each cluster is then updated (by calculating the mean values
of attributes of each object).
• The assignment and update procedure is until it reaches some stopping criteria
(like, number of iteration, centroids remain unchanged or no assignment, etc.)
Advantages:
• The algorithm is simple, easy to understand and implement.
• It is also efficient. The time taken to cluster K-means rises linearly with the
number of data points.
Disadvantages
• The user needs to specify an initial value of k
• The process of finding the clusters may not converge.
Choosing the value of ‘k’ in K-means
• The performance of the K-means clustering algorithm depends upon highly efficient
clusters that it forms.
• But choosing the optimal number of clusters is a big task. We use the Elbow Method
to find the optimal number of clusters.
Elbow Method
• We use within-sum-of-squares (WSS) as a measure to find the optimum number of
clusters that can be formed for a given data set.
• WSS is defined as the sum of the squared distance between each member of the
cluster and its centroid.
𝑚
𝑊𝑆𝑆 = ∑(𝑥𝑖 − 𝑐𝑖 )2
𝑖=1
• The WSS is measured for each value of K. The value of K, which has the least amount
of WSS, is taken as the optimum value.
• Now, we draw a curve between WSS and the number of clusters.
• The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
• Since the graph shows the sharp bend, which looks like an elbow, hence it is known
as the elbow method.
Hierarchical Clustering
• Hierarchical clustering algorithms falls into following two categories.
o Agglomerative hierarchical algorithms (bottom-up approach) − In
agglomerative hierarchical algorithms, each data point is treated as a single
cluster and then successively merge or agglomerate the pairs of clusters. The
hierarchy of the clusters is or tree structure.
o Divisive hierarchical algorithms (Top-down approach) − On the other hand,
in divisive hierarchical algorithms, all the data points are treated as one big
cluster and the process of clustering involves dividing the one big cluster into
various small clusters.
• Advantages:
o It is easy to understand and implement.
o We don’t have to pre-specify any particular number of clusters.
o Can obtain any desired number of clusters by cutting the Dendrogram at the
proper level.
o They may correspond to meaningful classification.
o Easy to decide the number of clusters by merely looking at the Dendrogram.
• Disadvantages:
o Hierarchical Clustering does not work well on vast amounts of data.
o Once a decision is made to combine two clusters, it can not be undone.
• The closest distance between the two clusters is crucial for the hierarchical
clustering.
• There are various ways to calculate the distance between two clusters, and these
ways decide the rule for clustering.
o Single Linkage: It is the Shortest Distance between the closest points of the
clusters. Consider the below image:
o Complete Linkage: It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter
clusters than single-linkage.
• Here, x1 to xm are the input to the neuron. Y is the output of the neuron. Each
input link is weighted, w1 to wm. ‘b’ is the bias.
• The Net Input is calculated as Yin = b + x1w1 + x2w2 + … + xmwm
• The output Y is calculated as Y = f(Yin) where f is the activation function.
Architecture of an ANN
• Input Layer:
o As the name suggests, it accepts inputs in several different formats provided
by the programmer.
• Hidden Layer:
o The hidden layer presents in-between input and output layers. It performs all
the calculations to find hidden features and patterns.
• Output Layer:
o The input goes through a series of transformations using the hidden layer,
which finally results in output that is conveyed using this layer.
o It determines weighted total is passed as an input to an activation function to
produce the output. Activation functions choose whether a node should fire
or not. Only those who are fired make it to the output layer. There are
distinctive activation functions available that can be applied upon the sort of
task we are performing
𝐻𝑖𝑛𝑗 = 𝑏𝑗 + ∑ 𝑥𝑖 𝑤𝑖𝑗
𝑖=1
where, 𝒃𝒋 is the bias on hidden unit 𝑯𝒋 and 𝒘𝒊𝒋 is the weight on the link coming
from input unit 𝑿𝒊 to 𝑯𝒋
Now calculate the net output by applying the following activation function:
𝐻𝑗 = 𝑓(𝐻𝑖𝑛𝑗 )
Send these output signals of the hidden layer units to the output layer units.
3. Calculate the net input 𝒀𝒊𝒏𝒌 at each output unit 𝒀𝒌 using the following relation
for all 𝟏 ≤ 𝒌 ≤ 𝒎
𝑝
𝑌𝑖𝑛𝑘 = 𝑏𝑘 + ∑ 𝐻𝑗 𝑤𝑗𝑘
𝑗=1
where, 𝒃𝒌 is the bias on output unit 𝒀𝒌 and 𝒘𝒋𝒌 is the weight on the link coming
from hidden unit 𝑯𝒋 to 𝒀𝒌
Now calculate the net output by applying the following activation function:
𝑌𝑘 = 𝑓(𝑌𝑖𝑛𝑘 )
Phase 2: Back Propagation of error
1. Compute the error correcting term, in correspondence with the target pattern
received at each output unit, as follows
𝛿𝑘 = (𝑡𝑘 − 𝑌𝑘 ) 𝑓′(𝑌𝑖𝑛𝑘 )
On this basis, update the deltas of weight and bias of output layer as follows:
𝛥𝑤𝑗𝑘 = 𝛼 𝛿𝑘 𝐻𝑗
𝛥𝑏𝑘 = 𝛼 𝛿𝑘
Then, send 𝜹𝒌 back to the hidden layer.
2. Now each hidden unit will be the sum of its delta inputs from the output units.
𝑚
𝛿𝑖𝑛𝑗 = ∑ 𝛿𝑘 𝑤𝑘𝑗
𝑘=1
K-Nearest Neighbours
• K-nearest neighbours (KNN) algorithm is a type of supervised ML algorithm which
can be used for both classification as well as regression predictive problems.
• KNN is a lazy learning algorithm because it does not have a specialized training
phase and uses all the data for training while classification.
• KNN is also a non-parametric learning algorithm because it doesn’t assume
anything about the underlying data.
• It requires three inputs:
o The training data
o A distance metric like Euclidean or Manhattan to compute the distance
between the test data and each individual training data
o The value of k, the number of neighbours to select/retrieve.
• To classify an unknown test data:
o It computes the distance of the test data to other training records
o Then, it identifies the k nearest neighbours
o Use class labels of nearest neighbours to determine the class label of an
unknown record (Ex: by taking the majority vote)
• Ideally, the value of k must not be a multiple of the number of class labels.
Otherwise, it may cause ties. (Ex: if number of classes in training data is 3, then k
should not be 3, 6, 9 etc)
• Popular distance metrics:
▪ When the Bias is high, assumptions made by our model are too basic,
the model can’t capture the important features of our data.
▪ This means that our model hasn’t captured patterns in the training data
and hence cannot perform well on the testing data too.
▪ This instance, where the model cannot find patterns in our training set
and hence fails for both train and test data, is called Underfitting.
o Variance
▪ We can define variance as the model’s sensitivity to fluctuations in the
data. Our model may learn from noise. This will cause our model to
consider trivial features as important.
▪ Variance is the change in prediction accuracy of ML model between
training data and test data.
▪ This simply means that if an ML model is predicting with an accuracy of
"x" on training data and its prediction accuracy on test data is "y" then
variance = x - y
▪ If the variance is high, our model will capture all the features of the
train data given to it, including the noise, will tune itself to the data,
and predict it very well, but when given new test data, it cannot predict
on it as it is too specific to the training data.
▪ Hence, our model will perform really well on training data and get high
accuracy but will fail to perform on new, test data. New data may not
have the exact same features and the model thus won’t be able to
predict it very well. This is called Overfitting.
Bagging
• Bagging is used when our objective is to reduce the variance of a decision tree.
Here the concept is to create a few subsets of data from the training sample,
which is chosen randomly with replacement.
• Now each collection of subset data is used to prepare their decision trees, thus,
we end up with an ensemble of various models.
• The average of all the assumptions from numerous tress is used, which is more
powerful than a single decision tree.
• Random Forest is an expansion over bagging. It takes one additional step to
predict a random subset of data. It also makes the random selection of features
rather than using all features to develop trees. When we have numerous random
trees, it is called the Random Forest.
Stacking
• It is an ensemble method that seeks a diverse group of members by varying the
model types fit on the training data and using a model to combine predictions.
Boosting
• The key property of boosting ensembles is the idea of correcting prediction
errors. The models are fit and added to the ensemble sequentially such that the
second model attempts to correct the predictions of the first model, the third
corrects the second model, and so on.
• It is a method that in general decreases the bias error and builds strong predictive
models. The term ‘Boosting’ refers to a family of algorithms which converts a weak
learner to a strong learner.
• Ex: Ada Boost algorithm: It commences by training decision trees. Every
observation during this procedure has an equal weight assigned to it.
• After analysing the first tree, we raise the weights of every observation that we
find complicated to classify. On the other hand, we decrease the weights for the
ones in which classification is not an issue.
• Therefore, we will notice the second tree growing on the weighted data. The
original idea for this is to make improvements upon the first tree’s predictions.
• Therefore, the last ensemble model’s predictions will be the overall weighted
predictions provided by former tree models.
Random Forest
• Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique. It can be used for both Classification and Regression
problems in ML.
• It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of
the model.
• Random Forest is a classifier that contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset.
• Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the
final output.
• Random forest algorithm combines multiple decision-trees, resulting in a forest of
trees, hence the name Random Forest.
• One problem that might occur with one big (deep) single DT is that it can overfit.
• The point of RF is to prevent overfitting. It does this by creating random subsets of
the features and building smaller (shallow) trees using the subsets and then it
combines the subtrees.
• The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.
Working of a Random Forest
• Random Forest works in two-phase first is to create the random forest by
combining N decision tree, and second is to make predictions for each tree created
in the first phase
First Phase
1. Randomly select k records from a total of m records where k < m.
2. Among the k records, calculate the node D using the best split point.
3. Split the node into daughter nodes using the best split.
4. Repeat 1 to 3 steps until leaf nodes have been finalized.
5. Build forest by repeating steps 1 to 4 for 𝓃 number of times to create 𝓃 number of
trees.
Second Phase
• In the second stage, we make predictions using the trained random forest
algorithm.
• We take the test features and use the rules of each randomly created decision
tree to predict the outcome and stores the predicted outcome.
• Then, we calculate the votes for each predicted target.
• Finally, we consider the highest voted predicted target as the final prediction from
the random forest algorithm.
Applications of Random Forests
• Banking: Banking sector mostly uses this RF for the identification of loan risk.
• Medicine: Disease trends and risks of the disease can be identified using RF
• Land Use: We can identify the areas of similar land use
• Marketing: Marketing trends can be identified
• Health care: Identifying the treatment/ medicine based on history and symptoms.
• E-Commerce: recommended products and customer experience
• Stock Market: Predicting stock market portfolio performance.
Advantages of Random Forests
• Solves the Decision tree overfit problem. RFs with enough trees won’t overfit
• Random Forest can be used for both classification and regression problems.
• Random forests are extremely flexible and have very high accuracy.
• RFs maintains accuracy even when a large proportion of the data are missing.
Disadvantages of Random Forests
• RFs are complex, harder, time-consuming to construct than decision trees
• Large number of trees can make the algorithm to slow and ineffective for real-time
predictions.
Key differences between RFs and DTs
• A Random Forest is a set of multiple decision-trees.
• Decision-trees are computationally faster as compared to random forests.
• Deep decision-trees may suffer from overfitting. Random forest prevents
overfitting by creating trees on random forests.
• Random forest is difficult to interpret. But a decision-tree is easily interpretable
and can be converted to rules.