PM Notes
PM Notes
Kernel functions are a cornerstone in machine learning and statistical analysis, used primarily to
handle non-linear data distributions. They enable algorithms to operate in a higher-dimensional
feature space without explicitly calculating the transformation. This approach is known as the
"kernel trick."
A kernel function is a mathematical function that computes the similarity between two data points
in a transformed feature space. Mathematically, it’s represented as:
o Kernel functions transform such data into higher dimensions, where linear patterns
might emerge.
2. Efficient Computation:
3. Distribution Analysis:
1. Linear Kernel:
2. Polynomial Kernel:
4. Sigmoid Kernel:
o Example: A Gaussian kernel in KDE smooths data points to create a continuous PDF.
3. Clustering:
4. Dimensionality Reduction:
Efficiency: Through the kernel trick, computations in high-dimensional spaces are simplified.
Feature ordering involves ranking or prioritizing features (variables) in a dataset based on their
relevance or importance to the clustering or classification task. The goal is to identify features that
contribute significantly to grouping data points and eliminate or down-weight those that add noise
or redundancy.
o Methods include:
2. Rank Features:
o Choose the top kkk features to use in clustering. kkk is determined based on domain
knowledge or performance evaluation (e.g., through cross-validation).
Improved Clustering Quality: Eliminates irrelevant features that can distort distance metrics.
What is Clustering?
Clustering is an unsupervised learning technique used to group data points into clusters based on
their similarity. The quality of clustering largely depends on the features used. Irrelevant or
redundant features can:
Feature ordering can significantly impact clustering algorithms by ensuring that only the most
relevant features contribute to cluster formation.
1. Filter Methods:
2. Wrapper Methods:
3. Embedded Methods:
2. Improved Scalability:
3. Elimination of Redundancy:
Applications
Customer Segmentation: In marketing, feature ordering ensures clustering focuses on
relevant customer attributes.
Image Segmentation: Prioritizes features like texture, color, or intensity for effective
clustering.
1. Histogram Estimation
A histogram is one of the simplest ways to estimate the distribution of a dataset. It divides the data
into intervals (bins) and counts the number of data points in each bin.
3. Count Frequency:
4. Visualize:
Adaptability in Histograms:
Dynamic Binning: Adjust bin widths based on data density. Narrower bins in dense areas
capture details; wider bins in sparse areas reduce noise.
Adaptive Bin Placement: Use techniques like Sturges' Rule, Scott's Rule, or the Freedman-
Diaconis Rule to determine bin sizes dynamically.
Computational Efficiency:
o They require only one pass through the data to compute frequencies.
Window-based methods are a class of non-parametric density estimation techniques that use a
sliding "window" or kernel function to compute the density around each point.
How It Works:
o Smaller hhh in dense regions for finer detail, larger hhh in sparse regions to reduce
noise.
Computational Efficiency:
Window methods can be computationally intensive for large datasets due to summation
over all data points.
Applications:
Non-parametric regression.
Limited (discontinuous
Accuracy High (smooth and continuous).
representation).
Sensitivity to Binning High (bins significantly impact results). Less sensitive (kernel smooths
Aspect Histogram Estimation Window (Kernel) Estimation
data).
What is Entropy?
The goal is to select features that minimize uncertainty about the target variable, ensuring that the
selected features are both informative and non-redundant.
1. Mutual Information:
2. Joint Entropy:
o Avoids selecting features that are highly correlated or redundant by evaluating the
combined entropy of feature subsets.
4. Select a subset of features that minimizes joint entropy while maximizing mutual
information.
Advantages:
Computational Efficiency: Focuses on statistical properties, reducing the need for iterative
modeling.
Orthogonal expansion is a mathematical technique that transforms features into an orthogonal basis,
ensuring that the selected features are uncorrelated. This approach is particularly useful in high-
dimensional spaces where redundancy among features is common.
1. Orthogonality:
⋅X2=0
o Two features X1X_1X1 and X2X_2X2 are orthogonal if: X1⋅X2=0X_1 \cdot X_2 = 0X1
1. Decompose Features:
2. Select Components:
o Map the selected components to the original feature space for interpretability.
Advantages:
Applications:
Computational Lower computational cost for Higher due to matrix decomposition but
Complexity individual feature evaluations. efficient for large correlated datasets.
5. Criteria for selection between k means and forg algorithm for partition
clustering detailed explanation
Both K-Means and Fuzzy C-Means (FCM) are partition-based clustering algorithms that divide a
dataset into kkk clusters. However, their suitability depends on the nature of the data and the
specific requirements of the clustering task. Here's a detailed explanation of the criteria for choosing
between these two algorithms.
K-Means:
Assigns each data point to a single cluster based on proximity to the cluster centroid.
Objective: Minimize the sum of squared distances between data points and their assigned
cluster centroids.
Assigns each data point a membership value indicating its degree of belonging to each
cluster.
Objective: Minimize a weighted sum of squared distances, where weights are derived from
membership degrees.
Soft clustering: A data point can belong to multiple clusters with varying degrees.
A. Nature of Data
1. Distinct Clusters:
o K-Means works well with compact, spherical clusters and when there is no overlap.
2. Overlapping Clusters:
o Use FCM: When clusters overlap and there are no clear boundaries.
o FCM assigns partial membership, which is more realistic for overlapping data
distributions.
1. Hard Clustering:
o Use K-Means: If the goal is to assign each data point to a single cluster.
2. Soft Clustering:
o Use FCM: When a data point can belong to multiple clusters with varying degrees.
D. Computational Efficiency
1. K-Means:
2. FCM:
1. K-Means:
o Sensitive to noise and outliers, as it uses squared distances that amplify the effect of
extreme points.
2. FCM:
o More robust to noise and outliers because membership degrees dilute the impact of
any single point.
3. Comparison Table
Criterion K-Means Fuzzy C-Means (FCM)
Computational
High (faster) Lower (slower)
Efficiency
Orthogonal expansion involves decomposing a dataset into a set of orthogonal components. These
components are linearly independent and do not overlap in the information they represent.
Common methods for achieving orthogonal expansion include:
o Transforms data into principal components that capture the maximum variance.
2. Gram-Schmidt Process:
In feature selection, orthogonal expansion ensures that selected features are independent,
improving the performance of downstream models.Benefits of Orthogonal Expansion
1. Reduction of Redundancy:
1. Loss of Interpretability:
o Transformed features (e.g., principal components) may not have direct real-world
meanings, making it harder to interpret results.
o The process may remove features with low variance, even if they are important for
specific tasks (e.g., minority class detection).
5. Risk of Over-Simplification:
o In reducing dimensions, critical subtleties in the data might be lost, especially if too
few components are retained.
1. Image Processing:
o Orthogonal expansion (e.g., using PCA) reduces the number of pixels while
preserving the most critical visual information.
2. Text Mining:
1. Curse of Dimensionality:
o These can lead to overfitting, reduced generalization, and poor model performance.
o Processing and storing high-dimensional data require more time and resources.
o Many features may contribute little to the predictive power and may even introduce
noise.
Binary feature selection treats the inclusion of each feature as a binary decision:
Selected Features (1): Retained for the model because they are relevant and contribute to
prediction.
Discarded Features (0): Eliminated because they are irrelevant, redundant, or noisy.
1. Filter Methods:
o Features are evaluated individually based on statistical measures.
2. Wrapper Methods:
.
3. Embedded Methods:
o By removing irrelevant and redundant features, the model focuses only on the most
informative data.
o Lower dimensionality reduces the time and memory required for training and
inference.
3. Noise Reduction:
4. Improved Interpretability:
o With fewer features, the model becomes easier to interpret and explain.
A. Dimensionality Reduction
Models trained on fewer, relevant features are more robust and accurate.
B. Noise Elimination
By discarding noisy features, the signal-to-noise ratio improves, leading to better decision
boundaries.
C. Reduced Multicollinearity
Redundant features that are highly correlated are removed, preventing distortions in model
coefficients.
Fewer features mean smaller datasets, leading to faster data loading, storage, and
manipulation.
B. Simplified Algorithms
Reduces memory and CPU/GPU usage, enabling large-scale data analysis on limited
hardware.
9. Applications
1. Text Mining:
2. Bioinformatics:
3. Image Processing:
4. Fraud Detection:
Non-parametric decision-making refers to methods that do not make assumptions about the
underlying probability distributions of the data. Instead, these methods rely on the structure of the
data itself to form decision boundaries. Examples include:
An adaptive decision boundary function refers to a boundary that evolves based on the data. It:
1. Learns from the training data to determine the optimal separation between classes.
3. Changes dynamically with new data points (important for real-time or streaming data
scenarios).
A. Data-Driven Learning
Instead of assuming a functional form, these methods analyze the data distribution and class
labels to construct boundaries.
Boundaries adapt to regions of high density for specific classes, curving or bending as
needed.
B. Local Adjustments
o In Decision Trees, splits are determined recursively based on feature values, creating
boundaries that are specific to the data's structure.
C. Kernel Methods
Kernel functions (e.g., Gaussian, polynomial) project data into higher dimensions to make
complex decision boundaries possible.
Example: SVM with Radial Basis Function (RBF) adapts the boundary to fit intricate data
patterns in a higher-dimensional space.
1. Flexibility:
2. Robustness:
3. Data-Driven Approach:
o Relies solely on data, avoiding biases introduced by incorrect assumptions about
distributions.
o Kernel-based methods, such as SVM with non-linear kernels, are effective in high-
dimensional spaces.
o Especially useful where data does not conform to standard distributions, such as
image recognition, speech processing, and medical diagnosis.
Computational
Often higher Lower
Complexity
7. Applications
A. Medical Diagnosis:
Classifying diseases based on complex symptom patterns using KNN or kernel-based SVM.
B. Image Recognition:
C. Fraud Detection:
D. Speech Processing:
For each class, the discriminant function calculates a score that reflects how well an observation
belongs to that class. The basic form of the discriminant function can be expressed as:
For linear discriminant functions, this expression defines a hyperplane that separates the different
classes in the feature space. For quadratic discriminant functions, the score involves quadratic
terms, leading to curved decision boundaries.
There are several types of discriminant functions, depending on the assumptions made about the
distribution of the data. The most common types are:
In the context of discriminant functions, the Mean Squared Error (MSE) can be seen as a way to
quantify how well a model performs by minimizing the error between predicted class labels and true
class labels. While MSE is commonly associated with regression tasks, it can also be used in
classification.
For Logistic Regression: MSE is not the optimal loss function, but in some cases, classification
tasks can be posed as a regression problem. For example, when applying a linear
discriminant model in a regression context (e.g., predicting class probabilities), MSE can be
minimized.
Discriminant functions are widely used in various classification tasks across different fields. Some of
their key applications include:
a) Pattern Recognition
Discriminant functions are often used in pattern recognition tasks where the goal is to classify objects
based on their features. For example, in face recognition or handwriting recognition, a discriminant
function can help differentiate between different categories (e.g., identifying a person’s face or the
character being written).
b) Medical Diagnosis
In medical fields, discriminant analysis is used to classify patients based on diagnostic data, such as
lab test results or medical images. For instance, LDA and QDA can be applied to distinguish between
healthy and diseased patients based on various biomarkers or imaging features.
c) Spam Detection
In text classification tasks, such as email spam detection, discriminant functions can be used to
classify emails as either spam or not spam based on features like word frequency, subject, sender,
and other metadata.
d) Credit Scoring
Discriminant functions are used in financial applications to classify applicants for loans or credit cards
into categories (e.g., low-risk vs. high-risk). The classification is based on financial indicators such as
income, credit history, and debt levels.
In speech recognition systems, discriminant functions are used to classify sounds or words based on
audio features. The goal is to map a sound or speech input to the corresponding phoneme or word.
Merges the closest clusters until all Recursively splits the clusters until each
Process
data points are in one cluster. data point is in its own cluster.
Initial state Each data point is its own cluster. All data points start in one cluster.
Sensitivity to noise Sensitive to noise and outliers, as May be more robust to noise, as it starts
Feature Agglomerative Clustering Divisive Clustering
early merges can affect final clusters. with a more global view.
Conclusion
Agglomerative Hierarchical Clustering is the more widely used method, due to its simplicity,
ease of implementation, and intuitive understanding. It works well in many practical
situations and is especially suitable for small to medium-sized datasets where the number of
clusters is not known in advance.
Divisive Hierarchical Clustering can be more effective in cases where the clusters are well-
separated and distinct, but it is computationally more expensive and requires careful
handling of the splitting strategy.
Ultimately, the choice between agglomerative and divisive hierarchical clustering depends on the
nature of the dataset and the computational resources available. In most practical scenarios,
agglomerative clustering is preferred due to its lower complexity and ease of use.
11. How binary feature selection can refine clustering process and improves
classification detailed explanation
Feature selection is a critical step in both clustering and classification tasks because it helps improve
the efficiency, accuracy, and interpretability of the model. In many cases, datasets contain many
features, some of which may be irrelevant, redundant, or noisy. Binary feature selection refers to the
process of selecting relevant binary (0 or 1) features that contribute the most to distinguishing
between clusters or classes, which ultimately improves the performance of both clustering and
classification algorithms.
Clustering algorithms aim to group similar data points together. However, when using all available
features, especially binary ones, some features may introduce noise, leading to poor clustering
results. Binary feature selection helps by reducing the dimensionality and focusing on features that
help to more accurately define the clusters.
Noise Reduction: By selecting only the most relevant binary features, irrelevant or redundant
features are eliminated. These irrelevant features can distort the distance or similarity
measures used in clustering algorithms (e.g., K-means or hierarchical clustering). When noise
is reduced, clusters become more distinct and meaningful.
Selection Criteria: Binary feature selection can be achieved using various methods such as:
o Chi-Square Test: Tests the independence of binary features with respect to the
clusters.
o Mutual Information: Measures the dependency between binary features and cluster
labels. Features with high mutual information with cluster labels are retained.
o Correlation-based Feature Selection (CFS): Looks for features that have high
correlation with the cluster structure and low inter-correlation with each other.
In classification tasks, the goal is to assign labels (or classes) to new observations based on the
features of the training data. The effectiveness of a classifier depends on the features used to make
predictions. Binary feature selection plays a similar role in classification as it does in clustering,
improving the accuracy and efficiency of classifiers.
Reduced Overfitting: When too many features are included in the model, especially
irrelevant or noisy ones, the classifier may overfit the training data, meaning it learns
patterns that do not generalize well to unseen data. By selecting the most relevant binary
features, overfitting is reduced, and the model becomes better at generalizing to new data.
Faster Learning: Fewer features mean the classifier has fewer parameters to estimate during
the training process. This reduces the training time and improves the model's efficiency. For
example, algorithms like logistic regression or decision trees can train much faster when the
number of features is reduced.
Significant Features: Feature selection methods for binary features typically allow the
classifier to focus on the most relevant features. This means the resulting model will be
based on the features that best help distinguish between classes. For instance, in a medical
diagnosis model, features such as "has fever" or "has cough" might be selected because they
are strongly indicative of a disease category.
There are several methods to select the most relevant binary features for classification:
Chi-Square Test: Can be used to evaluate whether a binary feature is significantly related to
the class label. A high chi-square value indicates a strong relationship.
Recursive Feature Elimination (RFE): This method recursively removes the least significant
binary features, training the model at each step, until the best subset of features is found.
Without feature selection, the classifier might consider all these features, even if some (e.g., "Has
purchased before") are less relevant or redundant for the prediction. By performing binary feature
selection, you might find that "Has visited website" and "Subscribed to email list" are more
important in predicting whether a customer will purchase again. Removing the irrelevant or less
impactful features will result in a more efficient and accurate model.
In the context of hierarchical clustering, Ward's method is part of the broader family of
agglomerative clustering techniques and is often considered a more sophisticated and effective
method compared to simpler linkage-based methods like single linkage or complete linkage.
a) Agglomerative Approach
As with all hierarchical clustering methods, Ward's method starts with each individual data point as
its own cluster. The algorithm then proceeds as follows:
1. Initial Clusters: Initially, each data point is considered a cluster of its own.
2. Cluster Merging: At each step, Ward’s method calculates the increase in total within-cluster
variance that results from merging two clusters. This increase in variance is the key to how
the algorithm decides which clusters to merge.
3. Variance Minimization: The goal of Ward’s method is to merge clusters in such a way that
the increase in within-cluster variance is minimized. It does this by looking for the pair of
clusters whose merger leads to the smallest increase in variance.
4. Update Clusters: Once the clusters are merged, the algorithm updates the cluster centroid
(the mean of all points within the cluster) and continues the process of merging the next
closest clusters until all data points are in a single cluster or the desired number of clusters is
reached.
c) Centroid Updates
When two clusters are merged, the new centroid of the merged cluster is calculated as the weighted
average of the centroids of the two original clusters.
Complete linkage (also known as furthest point linkage) is another method of calculating the
distance between clusters in hierarchical clustering. Unlike Ward’s method, which minimizes the
increase in variance, complete linkage focuses on minimizing the maximum distance between any
pair of points, one from each of the two clusters being considered for merging.
Both Ward's method and complete linkage are agglomerative clustering techniques that aim to
create a hierarchy of clusters, but they differ in how they measure the distance between clusters and
how they affect the clustering results.
Behavior with Non- Performs better with globular clusters. Also better suited for globular clusters,
globular Clusters It may struggle with non-globular or but may perform less well on non-
irregular shapes. globular data due to the sensitivity to
Feature Ward’s Method Complete Linkage
outliers.
Distance measures, also known as similarity measures or dissimilarity measures, help quantify how
similar or dissimilar two points (or features) are in a given feature space. In feature selection, these
distance measures are used to evaluate the relationship between features and the target variable
(for classification or regression tasks) or among the features themselves.
1. Euclidean Distance
Euclidean distance is the most commonly used distance measure, especially for continuous features.
It computes the straight-line distance between two points in a multi-dimensional space.
Manhattan distance, also known as city-block distance or L1 norm, is another distance metric that
sums the absolute differences between corresponding features.
3. Cosine Similarity
Cosine similarity is a measure of the cosine of the angle between two vectors. It ranges from 0
(orthogonal, no similarity) to 1 (same direction, maximum similarity). This measure is particularly
useful for sparse high-dimensional data like text data (e.g., when using TF-IDF for text classification).
Spearman’s rank correlation is a non-parametric measure of rank correlation. It assesses how well
the relationship between two variables can be described using a monotonic function, which means it
doesn’t require the relationship to be linear.
The Chi-square test is a statistical method used to determine whether there is a significant
relationship between two categorical variables. It compares the observed frequency of occurrences
to the expected frequency if the variables were independent.
14. Along with their impact on clustering and classification for high
dimensional data detailed
In the context of high-dimensional data, the choice of distance measure is crucial for effective
clustering and classification. High-dimensional data, typically referred to as the curse of
dimensionality, presents challenges such as sparse data, increased computational complexity, and
difficulty in identifying meaningful patterns. In such cases, the distance measure used to compute
similarities or dissimilarities between data points becomes a key factor influencing the performance
of clustering and classification algorithms.
This detailed explanation will outline the various distance measures used in clustering and
classification, their impacts on high-dimensional data, and the potential challenges they introduce.
1. Euclidean Distance
Euclidean distance is the most commonly used distance metric for continuous features in both
clustering and classification tasks. It calculates the straight-line distance between two points in a
feature space.
o As the number of features increases, sparse data becomes more pronounced, and
the distance measure becomes less informative. This can negatively impact
clustering (e.g., K-means) and classification (e.g., k-NN) algorithms.
Solutions:
Dimensionality reduction techniques (like PCA) are often used to reduce the impact of high-
dimensional spaces by mapping the data to lower-dimensional spaces.
Feature selection can help by removing irrelevant features and improving the distance
measure's effectiveness.
Manhattan distance, also known as city block distance, is the sum of the absolute differences
between the coordinates of two points.
Impact on Clustering and Classification for High-Dimensional Data:
Less Sensitive to Outliers: Manhattan distance is less sensitive to extreme values than
Euclidean distance, which may help reduce the negative impact of outliers in high-
dimensional spaces. However, high-dimensional data may still present challenges, especially
when most of the features are irrelevant.
o Like Euclidean distance, Manhattan distance suffers from the curse of dimensionality.
In high-dimensional spaces, the absolute differences between points become
increasingly similar, making it difficult to distinguish between data points effectively.
o Manhattan distance can provide better performance than Euclidean distance when
the data is sparse or when there is a significant difference in the scale of features.
However, it will still suffer from the curse of dimensionality if the number of features
grows too large, and feature selection or dimensionality reduction is still necessary.
Solutions:
3. Cosine Similarity
Cosine similarity measures the cosine of the angle between two vectors. It is often used for high-
dimensional sparse data (such as text data represented by word embeddings, TF-IDF, etc.).
o Cosine similarity is very effective for text data, where most features (words) are zero
for any given document. In this case, it helps capture the directional similarity
between feature vectors, disregarding their magnitude.
o For high-dimensional data such as word vectors, cosine similarity is less affected by
the curse of dimensionality compared to Euclidean or Manhattan distances, since it
focuses on the angle rather than the distance between points.
o Cosine similarity is generally more appropriate for clustering textual data or data
where the relative orientation of feature vectors matters more than their
magnitudes. It can be used with clustering methods like K-medoids or agglomerative
clustering, though it may not work well with K-means, since K-means relies on
distance (and the centroid of cosine similarity doesn't always make sense).
Solutions:
Dimensionality reduction (e.g., using SVD or PCA) can further enhance the effectiveness of
cosine similarity in high-dimensional data, by reducing irrelevant or noisy dimensions.