Unit 1: Basics of Pattern Recognition
• Definition: Pattern Recognition is the process of classifying input data into objects or
categories based on key features.
• Applications: Handwriting recognition, fingerprint identification, speech recognition,
facial detection, medical diagnosis.
• Types of Pattern Recognition:
o Statistical: Based on statistical information (e.g., Bayesian classifier).
o Syntactic: Based on grammar rules and structure (e.g., image pattern from
pixels).
o Neural: Uses models inspired by the human brain (e.g., neural networks).
• Learning Approaches:
o Supervised Learning: Training data with known labels is used to build models.
o Unsupervised Learning: Only input data is available (no labels); goal is to find
structure (e.g., clustering).
o Reinforcement Learning: Learns through feedback and rewards.
• System Architecture:
1. Sensing: Device captures data.
2. Preprocessing: Normalize, denoise, enhance features.
3. Feature Extraction: Extract meaningful and discriminative attributes (e.g., edge,
color, texture).
4. Classification: Assign input to the most probable class.
Unit 2: Bayesian Decision Theory
Classifiers, Discriminant Functions, Decision Surfaces
• Classifier: Algorithm that assigns an input to one of the predefined classes.
• Discriminant Function: Maps a feature vector to a value; the class with the highest
value is chosen.
o gi(x)>gj(x)g_i(x) > g_j(x)gi(x)>gj(x) implies class i is preferred.
• Decision Surface: A boundary in feature space that separates different classes (e.g.,
line for 2D data).
Normal Density and Discriminant Functions
• Normal Distribution: Continuous probability distribution defined by mean μ\muμ
and variance σ2\sigma^2σ2.
• PDF (1D):
p(x)=12πσ2e−(x−μ)22σ2p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x -
\mu)^2}{2\sigma^2}}p(x)=2πσ21e−2σ2(x−μ)2
• In classification, we compute likelihood of feature vector belonging to a class using
the Gaussian model.
Discrete Features
• When input features are discrete (e.g., binary or categorical).
• Probability mass function is used.
• Bayes theorem:
P(ωi∣x)=P(x∣ωi)P(ωi)P(x)P(\omega_i|x) = \frac{P(x|\omega_i)P(\omega_i)}{P(x)}P(ωi
∣x)=P(x)P(x∣ωi)P(ωi)
• Maximum a Posteriori (MAP) classifier selects class with highest posterior
probability.
Unit 3: Parameter Estimation Methods
Maximum Likelihood Estimation (MLE)
• Estimate parameters such that the likelihood of observed data is maximized.
• For Gaussian:
o Mean: μ=1n∑xi\mu = \frac{1}{n} \sum x_iμ=n1∑xi
o Variance: σ2=1n∑(xi−μ)2\sigma^2 = \frac{1}{n} \sum (x_i - \mu)^2σ2=n1∑(xi
−μ)2
Gaussian Mixture Models (GMM)
• Models data as a mixture of multiple Gaussians.
• Each component represents a cluster or subpopulation.
• Useful in real-world problems like speaker identification.
Expectation-Maximization (EM)
• Algorithm for parameter estimation in models like GMM.
1. E-step: Estimate hidden variables (posterior probabilities).
2. M-step: Update parameters to maximize expected likelihood.
Bayesian Estimation
• Incorporates prior knowledge into parameter estimation.
• Uses Bayes’ theorem to update belief about parameters after seeing data.
• More robust when data is sparse or uncertain.
Unit 4: Hidden Markov Models (HMMs)
Discrete HMMs
• Markov process where state is hidden but generates observable symbols.
• Defined by:
o A: state transition probabilities
o B: emission probabilities (output given state)
o π: initial state distribution
• Applications: Speech recognition, bioinformatics.
Continuous HMMs
• Observations are continuous, not discrete.
• Emission probability modeled using Gaussian or GMM.
Unit 5: Dimension Reduction Methods
Fisher’s Linear Discriminant
• Projects data to a lower dimension to maximize class separability.
• Maximizes the ratio of between-class scatter to within-class scatter.
Principal Component Analysis (PCA)
• Projects data to new axes (principal components) to capture maximum variance.
• Steps:
1. Mean normalization
2. Compute covariance matrix
3. Compute eigenvectors
4. Select top k components
Parzen Window
• Non-parametric way to estimate the PDF of a random variable.
• Uses kernels (e.g., Gaussian) placed on each data point.
• Good for visualizing data distribution without assuming a specific shape.
K-Nearest Neighbours (KNN)
• For a new point, find the k closest training samples and vote for the class.
• Simple and effective; sensitive to the choice of k and distance metric.
Unit 6: Non-Parametric Techniques for Density Estimation
• No assumption about the form of the distribution.
• Techniques:
o Histogram: Divide data range into bins.
o KNN Density: Density at a point is inverse of volume needed to enclose k
nearest samples.
o Parzen Window: Smooth version using kernel functions.
• Useful when underlying distribution is unknown or multimodal.
Unit 7: Linear Discriminant Function Based Classifier
Perceptron
• Linear binary classifier: y=wTx+by = w^T x + by=wTx+b
• Learning rule: Adjust weights based on classification error.
• Converges if data is linearly separable.
Support Vector Machine (SVM)
• Finds optimal hyperplane that maximizes margin between classes.
• Can use kernel trick to handle non-linear data (e.g., polynomial, RBF).
• Solves convex optimization problem.
Unit 8: Non-Metric Methods for Pattern Classification
Non-Numeric (Nominal) Data
• Feature values are labels (e.g., male/female).
• Require non-metric classifiers (not based on distances).
Decision Trees
• Recursive structure of nodes and branches.
• Internal nodes → feature tests.
• Leaves → class labels.
• Split criteria: Information gain, Gini index, Chi-square.
• Advantages:
o Easy to understand
o Handles both categorical and numerical data
• Disadvantage: Prone to overfitting (can be solved using pruning).
Unit 9: Unsupervised Learning and Clustering
Criterion Functions
• Measure quality of clustering:
o Intra-cluster distance: Should be small.
o Inter-cluster distance: Should be large.
o Silhouette score, SSE, etc.
Clustering Algorithms:
• K-means:
1. Choose k centers randomly
2. Assign points to nearest center
3. Update centers
4. Repeat until convergence
• Hierarchical Clustering:
o Agglomerative: Merge closest clusters (bottom-up)
o Divisive: Split until only single clusters remain (top-down)
• Other Methods:
o DBSCAN: Density-based clustering, handles noise
o Mean-Shift: Moves points to high-density regions