21CSC305P- MACHINE
LEARNING
UNIT-I
Machine learning- What and Why, Supervised Learning,
Unsupervised learning, Polynomial curve fitting , Probability theory-
discrete random variables, Fundamental rules, Bayes rule,
Independence and conditional independence , Continuous random
variables, Quantiles, mean and variance, Probability densities,
Expectation and covariance
Machine learning- What and Why
• The rise of big data demands machine learning for efficient data analysis and
decision-making.
• For instance, there are around 1 trillion web pages, and every second, one hour of
video content is uploaded to YouTube, equating to 10 years of content every day.
Additionally, thousands of human genomes, each consisting of approximately 3.8
billion base pairs, have been sequenced, and Walmart handles over 1 million
transactions per hour, resulting in databases containing more than 2.5 petabytes of
information.
• Machine learning comprises techniques that can automatically detect patterns within
data and leverage these patterns to predict future data or make decisions under
uncertainty.
• The optimal approach to addressing such challenges is through probability theory,
which applies to any problem involving uncertainty.
Types of Machine Learning
Predictive or Supervised Learning:
•Goal: Learn a mapping from inputs x to outputs y, given a labeled set of input-output pairs
•Training set D: Set of input-output pairs and N: Number of training examples.
•Training input xi:
• Typically a D-dimensional vector of numbers.
• Represents features, attributes, or covariates (e.g., height and weight of a person).
• Can be complex structured objects (e.g., images, sentences, time series, molecular shapes, graphs).
•Output or response variable yi:
• Can be categorical/nominal (e.g., male or female) or real-valued (e.g., income level).
• Categorical problems are known as classification or pattern recognition.
• Real-valued problems are known as regression.
• Ordinal regression: Label space Y has a natural ordering (e.g., grades A-F).
Descriptive or Unsupervised Learning:
•Goal: Find "interesting patterns" in the data.
•Given inputs .Also known as knowledge discovery.
•No well-defined problem as patterns are not specified in advance.
Reinforcement Learning:
•Useful for learning how to act or behave when given occasional reward or punishment signals.
•Example: How a baby learns to walk.
Supervised learning
1. Classification:
• Goal of Classification: Learn a mapping from inputs x to outputs y,
where y∈{1,…,C} with C being the number of classes.
• Binary Classification: When C=2, known as binary classification (e.g.,
y∈{0,1})
• Multiclass Classification: When C>2, known as multiclass classification.
• Multi-label Classification: When class labels are not mutually exclusive
(e.g., someone classified as tall and strong), predicting multiple related
binary class labels (multiple output model).
• One way to formalize the problem is as function approximation.
Assume y=f(x) for an unknown function f; learning aims to estimate f
using a labeled training set and predict with
• Generalization: The main goal is to make predictions on novel inputs not
seen before, emphasizing the importance of generalization over fitting the
Supervised learning-Cont.
Example
• Two classes of objects with labels 0 and 1.
• Inputs are colored shapes, described by D features or attributes.
• Features are stored in an N×D design matrix.
• Input features x can be discrete, continuous, or both. Vector of training labels y.
• Test objects: blue crescent, yellow circle, and blue arrow.
• These test objects have not been seen before, requiring generalization beyond the training set.
Generalization:
•Blue crescent likely has y=1 since all blue shapes in the training set are labeled 1.
•Yellow circle's label is unclear due to mixed labels for yellow objects and circles.
•Blue arrow's label is also unclear due to lack of specific information from the training set.
Supervised learning-Cont.
The need for probabilistic predictions:
• In classification, ambiguous cases should be handled by returning a probability
distribution over possible labels given the input and training set, denoted by p(y∣x,D)
•Compute best guess as the most probable class label using
•This is the mode of the distribution p(y∣x,D) and known as a MAP estimate (maximum a
posteriori).
• Confidence in predictions is crucial, especially in risk-averse domains like medicine
and finance.
• IBM's Watson for Jeopardy uses a confidence module to decide when to answer.
• Google's SmartASS (ad selection system) predicts the click-through rate (CTR) to
maximize expected profit.
• Systems like Watson and SmartASS assess the risk of their predictions, making
decisions based on confidence levels to optimize performance and minimize errors.
Supervised learning-Cont.
Real-world applications:
(i) Document classification and email spam filtering
• In document classification, the primary objective is to categorize documents like web
pages or email messages into predefined classes C, determining p(y=c∣x,D), where x
represents the document's text representation.
• A classic example is email spam filtering, where classes are typically labeled as spam (
y=1 ) or non-spam ( y=0).
• Most classifiers assume a fixed-size input vector x. To handle variable-length documents,
a common approach is the bag of words (BoW) representation.
• Bag of Words (BoW):
• Documents are transformed into fixed-size feature vectors.
• Each vector element corresponds to a word from a predefined vocabulary.
• If a word appears in the document, its corresponding vector element is set to 1; otherwise,
it remains 0.
Supervised learning-Cont.
(ii) Classifying flowers
• The goal is to classify iris flowers into three types: setosa, versicolor, and virginica,
based on four extracted features: sepal length, sepal width, petal length, and petal
width.
Supervised learning-Cont.
(iii) Image classification and handwriting recognition
• Image classification involves categorizing images
based on their content, such as indoor vs. outdoor
scenes, orientation (horizontal vs. vertical), or
presence of specific objects like dogs.
• MNIST (which stands for “Modified National
Institute of Standards” )is a widely used dataset for
handwritten digit recognition, containing 60,000
training images and 10,000 test images of digits (0-9).
• Each image is grayscale, sized 28x28 pixels, and
represents handwritten digits by various individuals.
• Images are represented as feature vectors, where each
pixel's grayscale value (ranging from 0 to 255) serves
as a feature.
Supervised learning-Cont.
(iv) Face detection and recognition
• Object detection, or localization, involves
identifying specific objects within an image. A
notable application is face detection, which is
crucial for tasks like autofocus in cameras and
privacy features in services like Google's
StreetView.
• One approach to face detection is the sliding
window detector method. It divides the image into
small overlapping patches at various locations,
scales, and orientations.
• Each patch is classified based on whether it
exhibits face-like textures or features. Locations
where the probability of containing a face is high
are identified as potential face locations.
• Modern digital cameras often integrate face
detection systems to assist with autofocus by
identifying and focusing on faces within the frame.
• Services like Google's StreetView use face
detection to automatically blur faces to protect
privacy.
Supervised learning-Cont.
2. Regression:
• Regression is just like classification except the response variable is continuous
Here are some examples of real-world regression problems.
• Predict tomorrow’s stock market price given current market conditions and other possible side information.
• Predict the age of a viewer watching a given video on YouTube.
• Predict the location in 3d space of a robot arm end effector, given control signals (torques) sent to its various
motors.
• Predict the amount of prostate specific antigen (PSA) in the body as a function of a number of different clinical
measurements.
• Predict the temperature at any location inside a building using weather data, time, door sensors, etc.
Unsupervised learning
• The goal is to discover “interesting structure” in the data; this is sometimes called
knowledge discovery.
• Unsupervised learning formalizes tasks as density estimation, aiming to model the
probability distribution p(xi∣θ) of input data xi given parameters θ.Unlike supervised
learning, where the focus is on predicting yi given xi and θ, unsupervised learning
directly estimates the density p(xi∣θ)
• Supervised learning involves conditional density estimation p(yi∣xi,θ), where yi is the
target variable. In contrast, unsupervised learning focuses on unconditional density
estimation p(xi∣θ), where xi represents feature vectors.
• In unsupervised learning, xi is typically a vector of features, necessitating the creation
of multivariate probability models to capture dependencies between different features.
• Supervised learning often uses simpler univariate probability models with input-
dependent parameters, focusing on predicting a single variable yi. This simplification is
not applicable in unsupervised settings due to the absence of labeled output.
• It is more widely applicable than supervised learning since it does not require costly
and often sparse labeled data, making it feasible for modeling complex systems where
labeled data is limited or unavailable.
Unsupervised learning-Cont.
1. Discovering clusters:
•Clustering involves grouping data points into clusters based on similarities in their
features, without predefined labels.
•The goal is to estimate the distribution p(K∣D) over the number of clusters K, indicating
the presence of subgroups within the data.
•Model selection in clustering aims to determine the optimal number of clusters K∗ often
approximated by the mode of p(K∣D). Unlike supervised learning where classes are
predefined, unsupervised learning allows flexibility in choosing the number of clusters that
best represent the underlying structure of the data.
•Each data point i is assigned to a cluster zi∈{1,…,K} based on the probability
p(zi=k∣xi,D), where xi is the feature vector of the data point.
•Assignments zi∗ are inferred to determine the cluster membership of each data point,
illustrated by different colors representing clusters in visualizations.
Unsupervised learning-Cont.
Applications of Clustering:
•Astronomy: Clustering methods like Autoclass have been used to discover new
types of stars based on astrophysical measurements.
•E-commerce: Clustering users based on purchasing or web-surfing behavior
allows for targeted advertising and personalized recommendations.
•Biology: Clustering flow-cytometry data helps identify different sub-populations of
cells, aiding in biological research such as understanding disease mechanisms.
Unsupervised learning-Cont.
2. Discovering latent factors:
• Dimensionality reduction involves projecting high-dimensional data into a lower-
dimensional subspace that captures essential characteristics of the data.
• Despite high-dimensional appearances, data often exhibit variability across a smaller
number of latent factors. Dimensionality reduction helps in focusing on these key
factors, such as lighting, pose, or identity in face image modeling.
• PCA is a common approach for dimensionality reduction, resembling an unsupervised
form of multi-output linear regression.
• Given high-dimensional responses y, PCA infers latent low-dimensional factors z that
explain most of the variability in y.
Unsupervised learning-Cont.
Applications:
• In biology, it is common to use PCA to interpret gene microarray data, to account for the
fact that each measurement is usually the result of many genes which are correlated in
their behavior by the fact that they belong to different biological pathways.
• In natural language processing, it is common to use a variant of PCA called latent
semantic analysis for document retrieval.
• In signal processing (e.g., of acoustic or neural signals), it is common to use ICA (which
is a variant of PCA) to separate signals into their different sources.
• In computer graphics, it is common to project motion capture data to a low dimensional
space, and use it to create animations.
Unsupervised learning-Cont.
3. Discovering graph structure
• Learning sparse graphical models involves representing relationships between correlated
variables using a graph G, where nodes depict variables and edges denote direct
dependencies. This approach is pivotal in both discovering new knowledge and enhancing
joint probability density estimators.
• In systems biology, sparse graphical models are used to uncover relationships among
biological entities. For instance, graphs derived from protein phosphorylation data reveal
complex interactions within cellular networks. Similarly, neural wiring diagrams in birds can
be reconstructed from EEG data, highlighting functional connectivity patterns.
• In fields like financial portfolio management, sparse graphs help model covariance between
stocks for better prediction and decision-making. Utilizing sparse graph structures has proven
beneficial in outperforming traditional methods, enabling more effective trading strategies.
• Applications extend to traffic prediction systems, such as JamBayes, which leverage learned
graphical models to forecast traffic flow dynamics. These models contribute to accurate
predictions and efficient management of transportation networks, illustrating the broad
applicability and utility of sparse graphical learning in real-world scenarios.
Unsupervised learning-Cont.
Unsupervised learning-Cont.
4. Matrix completion
• Sometimes we have missing data, that is, variables whose values are unknown. For example, we might have
conducted a survey, and some people might not have answered certain questions.
• The corresponding design matrix will then have “holes” in it; these missing entries are often represented by
NaN, which stands for “not a number”. The goal of imputation is to infer plausible values for the missing
entries. This is sometimes called matrix completion.
• Image Inpainting: Technique to fill in missing parts of images due to scratches or occlusions, achieved by
modeling joint probability of pixels from clean images.
• Collaborative Filtering: Predicting user preferences for items (like movies) based on sparse ratings matrices,
aiming to fill in missing ratings for better recommendation systems.
• Market basket analysis:
❖ Involves examining a large, sparse binary matrix where columns represent items/products and rows
represent transactions.
❖ Each entry in the matrix indicates whether an item was purchased in a specific transaction. By analyzing
correlations among items often bought together, predictions can be made about additional items a consumer
might buy based on partial transaction data.
❖ This technique is also applicable in other domains, such as predicting file dependencies in software systems.
❖ Common methods for market basket analysis include frequent itemset mining, which generates association
rules, and probabilistic modeling, which fits a joint density model to the data.
❖ Data mining emphasizes interpretability of models, whereas machine learning focuses on model accuracy.