Introduction to Machine Learning
What is Machine Learning?
The notes define machine learning as:
"An agent is said to learn from experience with respect to some class of tasks and a performance
measure P, if the learner's performance at tasks in the class, as measured by P, improves with
experience."
This definition, often attributed to computer scientist Tom M. Mitchell, breaks down the concept
into three key components:
Agent: The machine learning model or algorithm.
Tasks (T): The problem the model is trying to solve, such as image
classification or predicting house prices.
Experience (E): The training data used to train the model.
Performance Measure (P): The metric used to evaluate how well the
model is performing on the tasks, such as accuracy or mean squared
error.
In simple terms, a machine learns when its ability to perform a task gets better after it sees and
analyzes more data.
Machine Learning Paradigms
A machine learning paradigm refers to the fundamental approach or category of an algorithm
based on how it learns from data. Each paradigm is defined by the type of data it uses and the
objective of the learning process.
The three main paradigms are:
1. Supervised Learning
How it works: The algorithm learns from a labeled dataset, which
means the data includes both the input features and the correct output
(or "label"). It's like a student learning with a teacher who provides
correct answers.
Goal: To map inputs to outputs and make accurate predictions on new,
unseen data.
Examples:
o Classification: Predicting a categorical output, such as whether
an email is "spam" or "not spam."
o Regression: Predicting a continuous numerical value, such as
the price of a house.
2. Unsupervised Learning
How it works: The algorithm is given unlabeled data and must
discover hidden patterns, structures, or relationships on its own. There
is no "teacher" providing correct answers.
Goal: To find intrinsic patterns or group the data into meaningful
categories.
Examples:
o Clustering: Grouping similar data points together, like
segmenting customers into different groups based on their
purchasing habits.
o Association: Finding rules that describe relationships between
items, such as "customers who buy diapers also tend to buy
baby wipes."
3. Reinforcement Learning
How it works: An agent learns to make decisions by interacting with
an environment. It receives a reward for desired actions and a
penalty for undesired ones, similar to training a pet with treats. It
learns through a process of trial and error to maximize its cumulative
reward over time.
Goal: To learn the optimal actions to take in a given environment to
achieve a specific goal.
Examples:
o Teaching a robot to navigate a maze.
o Training an AI to play a game like chess or Go.
Inductive Bias
Inductive bias is the set of assumptions a machine learning algorithm makes to generalize from
the training data to make predictions on unseen data. Without these assumptions, an algorithm
wouldn't be able to learn anything meaningful because it wouldn't know how to interpret patterns
it hasn't encountered before. It's essentially the model's "prior belief" about the nature of the
relationship between the inputs and outputs. Different algorithms have different inductive biases,
which makes them suitable for different types of problems.
The notes illustrate this with a simple example:
Problem: Given a person's age and income, predict if they will buy a
computer.
Training Set: The notes show a small training set with inputs (x_1,
x_2, etc.) and corresponding outputs (y_1, y_2, etc.). For example, a
person with an income of $30,000 and age of 25 did not buy a
computer (y_1 = -1), while a person with an income of $80,000 and
age of 45 did buy one (y_2 = +1).
The Process: The training data is fed into a training algorithm (e.g.,
a classifier) which uses its inductive bias to learn a model. This model,
or "agent," then takes a new input (X) and predicts an output (\hat{Y})
which is compared to the actual target output (Y) to calculate the error.
The Role of Inductive Bias in the Example
The inductive bias of the classifier would be what allows it to make a prediction for a new person
(e.g., age 30, income $50,000) that isn't in the training set.
A linear classifier (like a Logistic Regression model) has an inductive
bias that assumes the data can be separated by a straight line. It
might learn a rule like "if (income > $50,000) then Buy."
A Decision Tree has an inductive bias that prefers shorter, simpler
trees. It might learn a set of if-then rules like "if (age > 40 AND income
> $70,000) then Buy." Each algorithm's assumptions guide how it
learns and ultimately how it makes predictions. Choosing the right
algorithm for a problem often comes down to choosing the right
inductive bias.
You can learn more about this concept in this YouTube video about inductive bias. This video
provides a conceptual explanation of inductive bias and its importance in machine learning.
Applications of supervised learning 📊
The notes list several real-world applications of machine learning, most of
which fall under the category of supervised learning (specifically,
classification):
Credit card fraud detection: This involves classifying a transaction as either a
“valid transaction” or a “fraudulent one.” The model is trained on historical
data of known good and bad transactions to learn the patterns of fraud.
Sentiment analysis: This is also known as “opinion mining” or “buzz
analysis.” The goal is to classify text data (like social media posts or product
reviews) as positive, negative, or neutral.
Churn prediction: This involves predicting whether a customer will stop using
a service (i.e., “potential churner” or “not”). It’s a critical application for
businesses to identify and retain at-risk customers.
Medical diagnosis: Machine learning models can analyze medical data (e.g.,
symptoms, test results) to assist in diagnosing diseases. This is a form of risk
analysis that helps determine the likelihood of a specific condition.
Risk analysis: This is a broad application, often used to assess the likelihood
of a particular outcome, such as financial risk, patient risk, or fraud risk.
Prediction or Regression 📉
The notes then focus on regression, a specific type of supervised learning.
The key concept is that the output is not a constant (or categorical) but
continuous.
Example: The notes give the example of predicting “Temperature at time of
day.” The temperature can be any value within a range, not just a fixed
category like “hot” or “cold.”
Linear Regression
Minimize sum squared errors: The core principle of linear regression is to find
the best-fitting line through the data points by minimizing the sum of the
squared distances from each point to the line. This is the objective function
of the algorithm.
With sufficient data, simple enough: The notes state that linear regression
works well when there is enough data, and the relationship between the
variables is relatively simple.
With many dimensions, challenge is to avoid overfitting: In more complex
problems with many features (“dimensions”), the model can become too
specific to the training data and fail to generalize to new data. This is known
as overfitting.
Regularization: This is a technique used to combat overfitting by adding a
penalty to the model for having too many dimensions or for being too
complex.
Higher order functions and basis transformations: These are advanced
techniques used to apply linear regression to non-linear problems. By
transforming the input data, linear regression can fit more complex, curvy
relationships. The notes give an example of transforming a single feature
(x_1) into multiple features (x_1, x_1^2, x_1^3, …) to fit a higher-order
polynomial.
Applications: The notes list applications for regression, such as “Time series
predictions,” and provide the examples of “Rainfall” and “Phone time,” which
are continuous values that change over time.
Linear regression is a supervised machine learning algorithm used to predict
a continuous output value (like price or temperature) based on one or more
input variables. It works by finding the “best-fit” straight line that represents
the relationship between the input variables and the output variable.
The core idea is to model the relationship with a linear equation, like y = mx
+ b.
Y is the dependent variable (the value you want to predict).
X is the independent variable (the input you are using to make the
prediction).
M is the slope of the line, which shows how much y is expected to change for
every one-unit change in x.
B is the y-intercept, the predicted value of y when x is zero.
The algorithm calculates the values for m and b that minimize the difference
between the predicted values and the actual data points. This is done using a
method called the least squares method, which minimizes the sum of the
squared errors.
Key Concepts
Simple vs. Multiple Linear Regression: Simple linear regression
involves only one independent variable, while multiple linear
regression uses two or more.
Continuous Output: The output of linear regression is a continuous,
numerical value, distinguishing it from classification algorithms that
predict discrete categories.
Overfitting: If the model is too complex and fits the training data too
closely (especially with many variables), it may fail to generalize to
new data. Regularization is a technique used to prevent this by
penalizing complex models.
Applications 💡
Linear regression is used in many fields, including:
Sales forecasting: Predicting future sales based on past data and
marketing spend.
Real estate: Estimating house prices based on features like size,
number of rooms, and location.
Finance: Predicting stock prices or assessing financial risk.
The notes explain several concepts, including unsupervised learning, its
applications, and a specific technique called association rule mining.
Unsupervised Learning
Unsupervised learning is a type of machine learning that works with
unlabeled data. Unlike supervised learning, there’s no “correct” output
provided for the training data. The goal is for the algorithm to find
hidden patterns, structures, or relationships within the data on its own.
The notes mention several applications of this paradigm:
Customer Data: This refers to customer segmentation, where an
algorithm discovers different “classes of customers” based on their
purchasing habits, demographics, or browsing behavior.
Image Pixels: The notes mention discovering “regions” in images,
which is a common task in image segmentation. An unsupervised
algorithm can group similar pixels together to identify objects or
different parts of a picture without being told what they are.
Words: The notes mention finding “synonyms” and “topics” in
documents. This is a task in Natural Language Processing (NLP), where
models can cluster similar words or group related documents together
without needing pre-labeled categories.
Association Rule Mining
This is a specific unsupervised learning technique used to discover
frequent patterns and rules in large datasets. It finds conditional
dependencies between items, often expressed as an “if-then” rule.
The process has two main stages:
Find frequent patterns: The algorithm first identifies combinations of
items that appear together frequently in the data.
Derive associations from these patterns: Once the frequent patterns
are found, the algorithm creates association rules from them. For
example, if milk and bread are a frequent pattern, the rule could be “If
a customer buys milk, they will also buy bread.”
The notes list three types of data where this is applied:
Sequences: Finding patterns in time-series data or a sequence of
events, like in “fault analysis.”
Transactions: This is the classic application of market basket analysis,
where the goal is to find relationships between items bought together
in transactions.
Graphs: This is used for “social network analysis” to find relationships
and communities within a network.