0% found this document useful (0 votes)
50 views36 pages

ML Unit 1 Important Questions and Answers

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views36 pages

ML Unit 1 Important Questions and Answers

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

1.

List and explain some successful applications of


machine learning.
Or Explain Machine Learning Process in detail with
an example application
Machine learning is one of the most exciting technologies that one would have
ever come across. As is evident from the name, it gives the computer that which
makes it more similar to humans: The ability to learn. Machine learning is
actively being used today, perhaps in many more places than one would expect.
Today, companies are using Machine Learning to improve business decisions,
increase productivity, detect disease, forecast weather, and do many more things.
With the exponential growth of technology, we not only need better tools
to understand the data we currently have, but we also need to prepare ourselves
for the data we will have. To achieve this goal we need to build intelligent
machines. We can write a program to do simple things. But most of the time,
Hardwiring Intelligence in it is difficult. The best way to do it is to have some
way for machines to learn things themselves. A mechanism for learning – if a
machine can learn from input then it does the hard work for us. This is
where Machine Learning comes into action. Some of the most common examples
are:
 Image Recognition
 Speech Recognition
 Recommender Systems
 Fraud Detection
 Self Driving Cars
 Medical Diagnosis
 Stock Market Trading
 Virtual Try On
Image Recognition
Image Recognition is one of the reasons behind the boom one could have
experienced in the field of Deep Learning. The task which started from
classification between cats and dog images has now evolved up to the level of
Face Recognition and real-world use cases based on that like employee
attendance tracking.
Also, image recognition has helped revolutionized the healthcare industry by
employing smart systems in disease recognition and diagnosis methodologies.

Speech Recognition
Speech Recognition based smart systems like Alexa and Siri have certainly come
across and used to communicate with them. In the backend, these systems are
based basically on Speech Recognition systems. These systems are designed such
that they can convert voice instructions into text.
One more application of the Speech recognition that we can encounter in our day-
to-day life is that of performing Google searches just by speaking to it.

Recommender Systems
As our world has digitalized more and more approximately every tech giants try
to provide customized services to its users. This application is possible just
because of the recommender systems which can analyze a user’s preferences and
search history and based on that they can recommend content or services to them.
An example of these services is very common for example youtube. It
recommends new videos and content based on the user’s past search patterns.
Netflix recommends movies and series based on the interest provided by users
when someone creates an account for the very first time.
Fraud Detection
In today’s world, most things have been digitalized varying from buying
toothbrushes or making transactions of millions of dollars everything is
accessible and easy to use. But with this process of digitization cases
of fraudulent transactions and fraudulent activities have increased. Identifying
them is not that easy but machine learning systems are very efficient in these
tasks.
Due to these applications only whenever the system detects red flags in a user’s
activity than a suitable notification be provided to the administrator so, that these
cases can be monitored properly for any spam or fraud activities.

Self Driving Cars


It would have been assumed that there is certainly some ghost who is driving a
car if we ever saw a car being driven without a driver but all thanks to machine
learning and deep learning that in today’s world, this is possible and not a story
from some fictional book. Even though the algorithms and tech stack behind these
technologies are highly advanced but at the core it is machine learning which has
made these applications possible.
The most common example of this use case is that of the Tesla cars which are
well-tested and proven for autonomous driving.
Medical Diagnosis
If you are a machine learning practitioner or even if you are a student then you
must have heard about projects like breast cancer Classification, Parkinson’s
Disease Classification, Pneumonia detection, and many more health-related tasks
which are performed by machine learning models with more than 90% of
accuracy.
Not even in the field of disease diagnosis in human beings but they work perfectly
fine for plant disease-related tasks whether it is to predict the type of disease it is
or to detect whether some disease is going to occur in the future.

Stock Market Trading


Stock Market has remained a hot topic among working professionals and even
students because if you have sufficient knowledge of the markets and the forces
which drives them then you can make fortune in this domain. Attempts have been
made to create intelligent systems which can predict future price trends and
market value as well.
This can be considered as one of the applications of time series
forecasting because stock price data is nothing but sequential data in which the
time at which data has been taken is of utmost importance.
Virtual Try On
Have you ever purchased your specs or lenses from Lenskart? If yes then you
must have come across its feature where you can try different frames virtually
without actually purchasing them or visiting the outlet. This has become possible
just because of the machine learning systems only which identify certain
landmarks on a person’s face and then place the specs virtually on your face using
those landmarks.

2.List some disciplines and examples of their


influence and issues on machine learning.
1. Mathematics
Mathematics is the backbone of machine learning. It helps in designing
algorithms, making predictions, and optimizing models.
 Influence:
o Linear Algebra – Used in data representation (e.g., matrices for
neural networks).
o Probability & Statistics – Help in making predictions, uncertainty
estimation, and pattern recognition.
o Calculus – Used in optimization techniques like gradient descent to
improve ML models.
 Examples:
o Face recognition uses linear algebra to process image data.
o Weather prediction uses probability to estimate future conditions.
 Issues:
o High-dimensional data makes computations complex.
o Many ML models act as "black boxes," meaning we don’t fully
understand how they work.

2. Computer Science
Computer science provides the programming tools and frameworks for
implementing ML models efficiently.
 Influence:
o Algorithms – Data structures and algorithms improve ML
performance.
o Software Engineering – Helps build scalable ML systems.
o Big Data Processing – Enables handling large datasets using
technologies like Hadoop and Spark.
 Examples:
o Google search uses ML to provide better results.
o Self-driving cars use ML for decision-making.
 Issues:
o Training deep learning models requires high computational power.
o Poorly designed ML models can introduce security vulnerabilities
(e.g., adversarial attacks).

3. Neuroscience
Neuroscience inspires deep learning models, particularly neural networks.
 Influence:
o Neural networks are modeled after human brain neurons.
o Cognitive science helps in developing AI with human-like learning
abilities.
 Examples:
o Chatbots use deep learning to understand and respond to human
language.
o AI in medical research mimics brain functions to analyze diseases.
 Issues:
o Neural networks do not fully replicate human thinking.
o AI still lacks consciousness and true reasoning abilities.

4. Ethics and Philosophy


ML must be fair and accountable, avoiding biases and ethical problems.
 Influence:
o Ensures AI follows ethical principles like fairness and privacy.
o Discusses AI's impact on jobs and human rights.
 Examples:
o AI hiring systems must avoid bias against certain groups.
o Social media algorithms should not spread misinformation.
 Issues:
o Bias in training data leads to unfair decisions.
o AI can be used for harmful purposes (e.g., deepfake technology).

5. Psychology and Cognitive Science


Understanding human behavior helps AI become more natural and interactive.
 Influence:
o Helps in designing AI that interacts well with humans.
o Aids in developing recommendation systems (e.g., Netflix,
YouTube).
 Examples:
o AI assistants like Siri and Google Assistant use psychology-based
models.
o Personalized learning platforms adapt to students’ learning styles.
 Issues:
o AI struggles with understanding emotions and sarcasm.
o Human-like AI raises concerns about manipulation (e.g., deepfake
videos).

6. Economics and Game Theory


ML helps in business decision-making and strategic planning.
 Influence:
o AI-driven financial models predict stock market trends.
o Game theory helps AI make optimal decisions in competitive
environments.
 Examples:
o Uber uses AI to optimize ride pricing.
o AI is used in algorithmic trading to maximize profits.
 Issues:
o AI-driven financial systems can create market instability.
o AI pricing algorithms may lead to unfair competition.
7. Linguistics
Natural Language Processing (NLP) enables machines to understand and process
human language.
 Influence:
o Speech and text recognition allow AI to communicate.
o NLP models analyze text sentiment and meaning.
 Examples:
o Google Translate uses NLP for real-time language translation.
o AI chatbots in customer service understand and respond to users.
 Issues:
o AI struggles with sarcasm, slang, and cultural variations.
o NLP models can spread misinformation or generate biased content.

8. Law and Policy


Legal frameworks guide AI development to ensure responsible usage.
 Influence:
o Data privacy laws regulate AI usage (e.g., GDPR).
o AI governance prevents unethical practices.
 Examples:
o GDPR protects user data in AI-driven platforms.
o AI in court cases helps analyze legal documents.
 Issues:
o Laws struggle to keep up with fast AI advancements.
o Lack of global AI regulations leads to ethical concerns.

9. Physics
Physics contributes to AI optimization and simulations.
 Influence:
o Quantum computing enhances ML model efficiency.
o Physics-based AI helps in scientific discoveries.
 Examples:
o AI models predict weather changes using physics simulations.
o Quantum ML is being researched for faster computations.
 Issues:
o Quantum computing is still in its early stages.
o Some physics problems are too complex for ML to model accurately.

10. Biology and Medicine


ML is transforming healthcare through data analysis and automation.
 Influence:
o AI assists in medical imaging and drug discovery.
o Personalized medicine uses ML to suggest treatments.
 Examples:
o AI detects cancer from X-rays with high accuracy.
o AI predicts disease outbreaks based on data patterns.
 Issues:
o ML models in healthcare must be highly accurate to avoid
misdiagnosis.
o Data privacy concerns arise when handling patient information.

3.Explain in detail the types of machine learning


Algorithms.
Or Analyse the supervised learning and
unsupervised learning algorithms.
Machine learning (ML) algorithms are categorized into different types based on
how they learn from data. The main types are:
1. Supervised Learning
2. Unsupervised Learning
3. Semi-Supervised Learning
4. Reinforcement Learning

Supervised Learning
 Supervised learning is a process of providing input data as well as
correct output data to the machine learning model. The aim of a
supervised learning algorithm is to find a mapping function to map
the input variable(x) with the output variable(y).
 In the real-world, supervised learning can be used for Risk
Assessment, Image classification, Fraud Detection, spam filtering,
etc.

How Supervised Learning Works?


 In supervised learning, models are trained using labelled dataset,
where the model learns about each type of data. Once the training
process is completed, the model is tested on the basis of test data (a
subset of the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below


example and diagram:

Suppose we have a dataset of different types of shapes which includes


square, rectangle, triangle, and Polygon. Now the first step is that we need to
train the model for each shape.

o If the given shape has four sides, and all the sides are equal, then it
will be labelled as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled
as hexagon.
Now, after training, we test our model using the test set, and the task of the
model is to identify the shape.

The machine is already trained on all types of shapes, and when it finds a
new shape, it classifies the shape on the bases of a number of sides, and
predicts the output.

Steps Involved in Supervised Learning:

o First Determine the type of training dataset


o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and
validation dataset.
o Determine the input features of the training dataset, which should
have enough knowledge so that the model can accurately predict the
output.
o Determine the suitable algorithm for the model, such as support vector
machine, decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of
training datasets.
o Evaluate the accuracy of the model by providing the test set. If the
model predicts the correct output, which means our model is accurate.

Types of supervised Machine learning Algorithms:


Supervised learning can be further divided into two types of problems:
1. Regression

Regression algorithms are used if there is a relationship between the input


variable and the output variable. It is used for the prediction of continuous
variables, such as Weather forecasting, Market Trends, etc. Below are some
popular Regression algorithms which come under supervised learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification

Classification algorithms are used when the output variable is categorical,


which means there are two classes such as Yes-No, Male-Female, True-
false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Advantages of Supervised learning:

o With the help of supervised learning, the model can predict the output
on the basis of prior experiences.
o In supervised learning, we can have an exact idea about the classes of
objects.
o Supervised learning model helps us to solve various real-world
problems such as fraud detection, spam filtering, etc.

Disadvantages of supervised learning:

o Supervised learning models are not suitable for handling the complex
tasks.
o Supervised learning cannot predict the correct output if the test data is
different from the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes
of object.
Unsupervised Learning
 Unsupervised learning is a type of machine learning in which models
are trained using unlabeled dataset and are allowed to act on that data
without any supervision.
 Unsupervised learning cannot be directly applied to a regression or
classification problem because unlike supervised learning, we have
the input data but no corresponding output data. The goal of
unsupervised learning is to find the underlying structure of dataset,
group that data according to similarities, and represent that
dataset in a compressed format.
 Example: Suppose the unsupervised learning algorithm is given an
input dataset containing images of different types of cats and dogs.
The algorithm is never trained upon the given dataset, which means it
does not have any idea about the features of the dataset. The task of
the unsupervised learning algorithm is to identify the image features
on their own. Unsupervised learning algorithm will perform this task
by clustering the image dataset into the groups according to
similarities between images.

Why use Unsupervised Learning?


Below are some main reasons which describe the importance of
Unsupervised Learning:

o Unsupervised learning is helpful for finding useful insights from the


data.
o Unsupervised learning is much similar as a human learns to think by
their own experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data
which make unsupervised learning more important.
o In real-world, we do not always have input data with the
corresponding output so to solve such cases, we need unsupervised
learning.

Working of Unsupervised Learning


Working of unsupervised learning can be understood by the below diagram:
Here, we have taken an unlabeled input data, which means it is not
categorized and corresponding outputs are also not given. Now, this
unlabeled input data is fed to the machine learning model in order to train it.
Firstly, it will interpret the raw data to find the hidden patterns from the data
and then will apply suitable algorithms such as k-means clustering, Decision
tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects
into groups according to the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm:


The unsupervised learning algorithm can be further categorized into two
types of problems:
o Clustering: Clustering is a method of grouping the objects into
clusters such that objects with most similarities remains into a group
and has less or no similarities with the objects of another group.
Cluster analysis finds the commonalities between the data objects and
categorizes them as per the presence and absence of those
commonalities.
o Association: An association rule is an unsupervised learning method
which is used for finding the relationships between variables in the
large database. It determines the set of items that occurs together in
the dataset. Association rule makes marketing strategy more effective.
Such as people who buy X item (suppose a bread) are also tend to
purchase Y (Butter/Jam) item. A typical example of Association rule
is Market Basket Analysis.

Unsupervised Learning algorithms:


Below is the list of some popular unsupervised learning algorithms:

o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition

Advantages of Unsupervised Learning

o Unsupervised learning is used for more complex tasks as compared to


supervised learning because, in unsupervised learning, we don't have
labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data
in comparison to labeled data.

Disadvantages of Unsupervised Learning

o Unsupervised learning is intrinsically more difficult than supervised


learning as it does not have corresponding output.
o The result of the unsupervised learning algorithm might be less
accurate as input data is not labeled, and algorithms do not know the
exact output in advance.
Semi-Supervised Learning
 Semi-Supervised learning is a type of Machine Learning algorithm
that represents the intermediate ground between Supervised and
Unsupervised learning algorithms. It uses the combination of labeled
and unlabeled datasets during the training period.
 The basic disadvantage of supervised learning is that it requires hand-
labeling by ML specialists or data scientists, and it also requires a
high cost to process. Further unsupervised learning also has a limited
spectrum for its applications. To overcome these drawbacks of
supervised learning and unsupervised learning algorithms, the
concept of Semi-supervised learning is introduced. In this
algorithm, training data is a combination of both labeled and
unlabeled data. However, labeled data exists with a very small amount
while it consists of a huge amount of unlabeled data. Initially, similar
data is clustered along with an unsupervised learning algorithm, and
further, it helps to label the unlabeled data into labeled data. It is why
label data is a comparatively, more expensive acquisition than
unlabeled data.
 We can imagine these algorithms with an example. Supervised
learning is where a student is under the supervision of an instructor at
home and college. Further, if that student is self-analyzing the same
concept without any help from the instructor, it comes under
unsupervised learning. Under semi-supervised learning, the student
has to revise itself after analyzing the same concept under the
guidance of an instructor at college.

Assumptions followed by Semi-Supervised Learning


To work with the unlabeled dataset, there must be a relationship between the
objects. To understand this, semi-supervised learning uses any of the
following assumptions:

o ContinuityAssumption:
As per the continuity assumption, the objects near each other tend to
share the same group or label. This assumption is also used in
supervised learning, and the datasets are separated by the decision
boundaries. But in semi-supervised, the decision boundaries are added
with the smoothness assumption in low-density boundaries.
o Cluster assumptions- In this assumption, data are divided into
different discrete clusters. Further, the points in the same cluster share
the output label.
o Manifold assumptions- This assumption helps to use distances and
densities, and this data lie on a manifold of fewer dimensions than
input space.
o The dimensional data are created by a process that has less degree of
freedom and may be hard to model directly. (This assumption
becomes practical if high).

Working of Semi-Supervised Learning


Semi-supervised learning uses pseudo labeling to train the model with less
labeled training data than supervised learning. The process can combine
various neural network models and training ways. The whole working of
semi-supervised learning is explained in the below points:

o Firstly, it trains the model with less amount of training data similar to
the supervised learning models. The training continues until the model
gives accurate results.
o The algorithms use the unlabeled dataset with pseudo labels in the
next step, and now the result may not be accurate.
o Now, the labels from labeled training data and pseudo labels data are
linked together.
o The input data in labeled training data and unlabeled training data are
also linked.
o In the end, again train the model with the new combined input as did
in the first step. It will reduce errors and improve the accuracy of the
model.
Reinforcement Learning (RL)
 Reinforcement Learning (RL) is a branch of machine learning that
focuses on how agents can learn to make decisions through trial and
error to maximize cumulative rewards. RL allows machines to learn
by interacting with an environment and receiving feedback based on
their actions. This feedback comes in the form of rewards or
penalties.
 Reinforcement Learning revolves around the idea that an agent (the
learner or decision-maker) interacts with an environment to achieve a
goal. The agent performs actions and receives feedback to optimize its
decision-making over time.
Agent: The decision-maker that performs actions.
Environment: The world or system in which the agent operates.
State: The situation or condition the agent is currently in.
Action: The possible moves or decisions the agent can make.
Reward: The feedback or result from the environment based on the
agent’s action.
How Reinforcement Learning Works?

The RL process involves an agent performing actions in an environment,


receiving rewards or penalties based on those actions, and adjusting its
behavior accordingly. This loop helps the agent improve its decision-making
over time to maximize the cumulative reward.
Here’s a breakdown of RL components:
 Policy: A strategy that the agent uses to determine the next action based
on the current state.
 Reward Function: A function that provides feedback on the actions
taken, guiding the agent towards its goal.
 Value Function: Estimates the future cumulative rewards the agent will
receive from a given state.
 Model of the Environment: A representation of the environment that
predicts future states and rewards, aiding in planning.
Reinforcement Learning Example: Navigating a Maze
Imagine a robot navigating a maze to reach a diamond while avoiding fire
hazards. The goal is to find the optimal path with the least number of
hazards while maximizing the reward:
 Each time the robot moves correctly, it receives a reward.
 If the robot takes the wrong path, it loses points.
The robot learns by exploring different paths in the maze. By trying various
moves, it evaluates the rewards and penalties for each path. Over time, the
robot determines the best route by selecting the actions that lead to the
highest cumulative reward.

The robot’s learning process can be summarized as follows:


1. Exploration: The robot starts by exploring all possible paths in the maze,
taking different actions at each step (e.g., move left, right, up, or down).
2. Feedback: After each move, the robot receives feedback from the
environment:
 A positive reward for moving closer to the diamond.
 A penalty for moving into a fire hazard.
3. Adjusting Behavior: Based on this feedback, the robot adjusts its
behavior to maximize the cumulative reward, favoring paths that avoid
hazards and bring it closer to the diamond.
4. Optimal Path: Eventually, the robot discovers the optimal path with the
least number of hazards and the highest reward by selecting the right
actions based on past experiences.

Types of Reinforcements in RL
1. Positive Reinforcement
Positive Reinforcement is defined as when an event, occurs due to a
particular behavior, increases the strength and the frequency of the behavior.
In other words, it has a positive effect on behavior.
 Advantages: Maximizes performance, helps sustain change over time.
 Disadvantages: Overuse can lead to excess states that may reduce
effectiveness.
2. Negative Reinforcement
Negative Reinforcement is defined as strengthening of behavior because a
negative condition is stopped or avoided.
 Advantages: Increases behavior frequency, ensures a minimum
performance standard.
 Disadvantages: It may only encourage just enough action to avoid
penalties.

Applications of Reinforcement Learning

1. Robotics – Automates tasks in manufacturing, optimizing movements


for efficiency.
2. Game Playing – Develops advanced strategies for chess, Go, and
video games, often surpassing human players.
3. Industrial Control – Optimizes real-time industrial operations like
oil refining.
4. Personalized Training – Adapts instructional content to individual
learning patterns.

Advantages of Reinforcement Learning

 Solves complex problems beyond conventional techniques.


 Continuously learns and corrects errors.
 Interacts with environments for adaptive learning.
 Handles uncertain and dynamic environments.

Disadvantages of Reinforcement Learning

 Overkill for simple tasks.


 Requires high computational resources.
 Performance depends on well-designed reward functions.
 Difficult to debug and interpret agent decisions.
4.Explain in detail about Testing Machine Learning
Algorithms.
 Testing machine learning (ML) algorithms is crucial to
ensure they perform well on unseen data. This process
involves evaluating model accuracy, generalization, and
robustness. Below, we discuss different aspects of testing
ML models in detail.

1. Importance of Testing ML Algorithms


Testing ML algorithms ensures that models:
⬛ Generalize well to new data.
⬛ Avoid overfitting (memorizing training data instead of learning patterns).
⬛ Are fair and unbiased when making predictions.
⬛ Perform efficiently in real-world applications.

2. Steps in Testing ML Algorithms


Step 1: Splitting the Dataset
Before testing, the dataset is split into three parts:
 Training Set (70-80%) – Used to train the model.
 Validation Set (10-15%) – Used to fine-tune the model’s
hyperparameters.
 Test Set (10-15%) – Used to evaluate the final model’s performance.
˙. Why split data?


9
 Prevents data leakage, where the model learns from the test data before
evaluation.
 Ensures the model is tested on unseen data for fair evaluation.

Step 2: Performance Metrics for Evaluation


Different ML tasks require different evaluation metrics.
A. Classification Metrics (For Categorical Output)
Used when predicting classes, such as spam detection (Spam/Not Spam).
1. Accuracy – Percentage of correct predictions.

o Issue: Not reliable for imbalanced datasets (e.g., detecting rare


diseases).

ROC-AUC (Receiver Operating Characteristic - Area Under Curve)


o Measures how well a model differentiates between classes.
o Higher AUC (closer to 1) = Better classification performance.
Step 3: Overfitting & Underfitting Detection
 Overfitting: The model memorizes training data but fails on new data.
Q˙• Fix: Use regularization, dropout, or more diverse data.
 Underfitting: The model is too simple and fails to learn patterns.
˙ Fix: Use complex models, more features, or longer training.
Q•

Step 4: Cross-Validation Testing


To make sure the model works well on different data splits, use:
1. K-Fold Cross-Validation:
o Splits data into K subsets and tests the model K times.
o Reduces bias from a single train-test split.
2. Leave-One-Out Cross-Validation (LOOCV):
o Trains on all data except one instance, which is used for testing.
o Works well for small datasets but is computationally expensive.

3. Advanced Testing Techniques


A. A/B Testing
 Compares two models in a real-world environment to see which performs
better.
 Example: Facebook tests different AI algorithms for user
recommendations.
B. Robustness Testing
 Tests how the model performs on noisy or adversarial data.
 Example: Image classification models are tested with blurred images.
C. Stress Testing
 Evaluates model behavior under extreme conditions.
 Example: Testing self-driving car AI in unusual lighting or weather.

4. Debugging ML Models
If the model performs poorly, check:
✔ Data Quality – Are there missing values or incorrect labels?
✔ Feature Engineering – Are the right features being used?
✔ Hyperparameters – Are the model settings optimal?
✔ Bias & Fairness – Is the model treating all data groups fairly?

5. Automating ML Testing
ML testing can be automated using tools like:
 Scikit-learn (Python) – Provides built-in evaluation metrics.
 TensorFlow & PyTorch – For testing deep learning models.
 MLflow – Tracks experiments and model performance.
 Google Cloud AI & AWS SageMaker – For large-scale ML testing.

5.Evaluate the important statistical concepts of


Machine learning in detail.

Machine Learning Statistics: In the field of machine learning (ML),


statistics plays a pivotal role in extracting meaningful insights from data to
make informed decisions. Statistics provides the foundation upon which
various ML algorithms are built, enabling the analysis, interpretation, and
prediction of complex patterns within datasets.
Applications of Statistics in Machine Learning
Statistics is a key component of machine learning, with broad applicability
in various fields.
 Feature engineering relies heavily on statistics to convert geometric
features into meaningful predictors for machine learning algorithms.
 In image processing tasks like object recognition and segmentation,
statistics accurately reflect the shape and structure of objects in images.
 Anomaly detection and quality control benefit from statistics by
identifying deviations from norms, aiding in the detection of defects in
manufacturing processes.
 Environmental observation and geospatial mapping leverage statistical
analysis to monitor land cover patterns and ecological trends
effectively.
Overall, statistics plays a crucial role in machine learning, driving insights
and advancements across diverse industries and applications.

Types of Statistics
There are commonly two types of statistics, which are discussed below:
 Descriptive Statistics: "Descriptive Statistics" helps us simplify and
organize big chunks of data. This makes large amounts of data easier to
understand.
 Inferential Statistics: "Inferential Statistics" is a little different. It uses
smaller data to draw conclusions about a larger group. It helps us
predict and draw conclusions about a population.
Descriptive Statistics
Descriptive statistics summarize and describe the features of a dataset,
providing a foundation for further statistical analysis.

Mean Median Mode

Median is the middle of a


sample when arranged from
lowest to highest or highest to
Mean is calculated by lowest. in order to find the Mode is
summing all values present in median, the data must be the most
the sample divided by total sorted. frequently
number of values present in the For odd number of data points: occurring
sample. Median=(n+12)th value in
Median=(2n+1)th the
Mean(μ)=SumofValuesNumbe For even number of data dataset.
rofValues Mean(μ)=Numberof points:
ValuesSumofValues Median=Averageof(n2)th
value and its next value
Median=Averageof (2n)th
value and its next value

Measures of Dispersion
 Range: The difference between the maximum and minimum values.
 Variance: The average squared deviation from the mean, representing
data spread.
 Standard Deviation: The square root of variance, indicating data
spread relative to the mean.
 Interquartile Range: The range between the first and third quartiles,
measuring data spread around the median.
Measures of Shape
 Skewness: Indicates data asymmetry.
Types of Skewed data
 Kurtosis: Measures the peakedness of the data distribution.

Types of Skewed data


Covariance and Correlation
Covariance Correlation

Correlation measures the


strength and direction of the
linear relationship between
Covariance measures the degree to two variables. It is represented
which two variables change together. by correlation coefficient
which ranges from -1 to 1. A
Cov(x,y)=∑(Xi−X‾)(Yi−Y‾)nCov(x,y)=n positive correlation indicates a
∑(Xi−X)(Yi−Y) direct relationship, while a
negative correlation implies an
inverse relationship.
Covariance Correlation

Pearson's correlation
coefficient is given by:

ρ(X,Y)=cov(X,Y)σXσYρ(X,Y)
=σXσYcov(X,Y)

Visualization Techniques

 Histograms – Show data distribution.


 Box Plots – Highlight data spread and outliers.
 Scatter Plots – Illustrate variable relationships.

Probability Theory

 Random Variables – Variables with random outcomes.


 Probability Distributions – Describe likelihoods of outcomes.
o Binomial – Fixed trials with success/failure.
o Poisson – Counts events in a fixed interval.
o Normal – Symmetric continuous distribution.
 Law of Large Numbers – Sample mean approaches population mean
as size grows.
 Central Limit Theorem – Sample means approximate a normal
distribution with large samples.

Inferential Statistics

 Population & Sample – Entire group vs. subset for analysis.


 Estimation –
o Point – Single value estimate.
o Interval – Confidence interval range.
 Hypothesis Testing –
o Null vs. Alternative Hypothesis – No effect vs. significant
effect.
o Errors – Type I (false positive), Type II (false negative).
o p-Values – Measure significance.
o t-Tests & z-Tests – Compare means.
 ANOVA – Tests mean differences across groups.
 Chi-Square Tests – Evaluate categorical relationships.
Correlation & Regression

 Correlation – Measures variable relationships.


o Pearson – Linear relationships.
o Spearman – Monotonic relationships.
 Regression Analysis –
o Simple & Multiple Linear Regression – Predict relationships.
o Assumptions – Linearity, independence, normality.
o Evaluation Metrics – R², Adjusted R², RMSE.

Bayesian Statistics
Bayesian statistics incorporate prior knowledge with current evidence to
update beliefs.
Bayes' Theorem is a fundamental concept in probability theory that relates
conditional probabilities. It is named after the Reverend Thomas Bayes,
who first introduced the theorem. Bayes' Theorem is a mathematical
formula that provides a way to update probabilities based on new evidence.
The formula is as follows:
P(A∣B)=P(B)P(B∣A)⋅P(A), where
 P(A∣B): The probability of event A given that event B has occurred
(posterior probability).
 P(B∣A): The probability of event B given that event A has occurred
(likelihood).
 P(A): The probability of event A occurring (prior probability).
 P(B): The probability of event B occurring.
6.Explain in detail Turning data into probabilities
and Bias Variance Tradeoff

B. Probability Distributions in ML
 Discrete Distributions (for categorical data):
o Bernoulli Distribution: Binary outcomes (e.g., coin flip).

o Binomial Distribution: Multiple Bernoulli trials (e.g., number of


heads in 10 flips).
o Poisson Distribution: Rare event modeling (e.g., number of
customer complaints per hour).
 Continuous Distributions (for real-valued data):
o Normal (Gaussian) Distribution: Used in natural datasets like
height, weight.
o Exponential Distribution: Time between events (e.g., time until
next earthquake).
C. Probabilistic ML Models
 Naïve Bayes Classifier: Uses Bayes’ theorem to classify text, images, etc.
 Logistic Regression: Models probability of an event (e.g., predicting
customer churn).
 Hidden Markov Models (HMMs): Used in speech recognition.
What is Bias?
The bias is known as the difference between the prediction of the values by
the Machine Learning model and the correct value. Being high in biasing gives
a large error in training as well as testing data. It recommended that an algorithm
should always be low-biased to avoid the problem of underfitting. By high bias,
the data predicted is in a straight line format, thus not fitting accurately in the
data in the data set. Such fitting is known as the Underfitting of Data. This
happens when the hypothesis is too simple or linear in nature. Refer to the graph
given below for an example of such a situation.

High Bias in the Model

In such a problem, a hypothesis looks like follows.


What is Variance?
The variability of model prediction for a given data point which tells us the
spread of our data is called the variance of the model. The model with high
variance has a very complex fit to the training data and thus is not able to fit
accurately on the data which it hasn’t seen before. As a result, such models
perform very well on training data but have high error rates on test data. When
a model is high on variance, it is then said to as Overfitting of Data. Overfitting
is fitting the training set accurately via complex curve and high order hypothesis
but is not the solution as the error with unseen data is high. While training a data
model variance should be kept low. The high variance data looks as follows.

High Variance in the Model


In such a problem, a hypothesis looks like follows.
Bias Variance Tradeoff
If the algorithm is too simple (hypothesis with linear equation) then it may be
on high bias and low variance condition and thus is error-prone. If algorithms
fit too complex (hypothesis with high degree equation) then it may be on high
variance and low bias. In the latter condition, the new entries will not perform
well. Well, there is something between both of these conditions, known as a
Trade-off or Bias Variance Trade-off. This tradeoff in complexity is why there
is a tradeoff between bias and variance. An algorithm can’t be more complex
and less complex at the same time. For the graph, the perfect tradeoff will be
like this.

We try to optimize the value of the total error for the model by using the Bias-
Variance Tradeoff.

The best fit will be given by the hypothesis on the tradeoff point. The error to
complexity graph to show trade-off is given as –

Region for the Least Value of Total Error


This is referred to as the best point chosen for the training of the algorithm
which gives low error in training as well as testing data.

7. Design a Machine Learning States and Workflow


in detail with an example application.
Machine Learning Lifecycle & Example Application
The Machine Learning Lifecycle is a structured approach to developing,
deploying, and maintaining ML models. It consists of multiple stages that ensure
the model’s accuracy, efficiency, and long-term usability. By following this
lifecycle, businesses and researchers can solve complex problems, gain data-
driven insights, and create scalable, sustainable models.

Steps in the Machine Learning Lifecycle


1. Problem Definition
The first step is to clearly define the business problem and its objectives. This
includes understanding the end goal, the type of predictions required, and the
impact of the solution.
✔ Collaborate with stakeholders to outline the problem.
✔ Define key objectives, expected outcomes, and constraints.
✔ Identify whether the problem is classification, regression, clustering, etc.
Example: Predict customer churn for a telecom company to reduce customer loss.
3. Data Collection
Gathering relevant data is essential, as the quality of data directly affects model
performance.
✔ Ensure data relevance, diversity, and sufficiency.
✔ Sources include databases, APIs, surveys, and web scraping.
✔ Consider ethical data collection practices.
Example: Collect customer interaction history, billing information, service
complaints, and call duration records.
4. Data Cleaning & Preprocessing
Raw data often contains errors, missing values, and inconsistencies. Cleaning
ensures that the model learns from high-quality data.
✔ Handle missing values, outliers, and duplicates.
✔ Normalize numerical features and encode categorical data.
✔ Convert unstructured data (e.g., text, images) into usable formats.
Example: Fill missing contract details, remove invalid customer entries, and
normalize monthly bill amounts.
5. Exploratory Data Analysis (EDA)
EDA helps in understanding the structure of data and identifying key patterns and
relationships.
✔ Use statistical summaries & visualizations to find trends.
✔ Identify correlations between features.
✔ Detect potential biases or imbalances in the dataset.
Example: Analyze customer service call frequency and its impact on churn rates.
5. Feature Engineering & Selection
This step focuses on refining or creating new features that improve model
performance.
✔ Select only the most relevant features to reduce complexity.
✔ Engineer new meaningful features from existing data.
✔ Use techniques like Principal Component Analysis (PCA) for dimensionality
reduction.
Example: Create a new feature called "Average Call Duration per Month" to
capture customer behavior patterns.
6. Model Selection
Choosing the right model is critical to obtaining the best predictions. The choice
depends on the problem type and dataset characteristics.
✔ Consider logistic regression, decision trees, SVM, deep learning, etc.
✔ Evaluate models based on their interpretability, accuracy, and complexity.
✔ Compare different algorithms using validation techniques.
Example: Use Random Forest to classify customers as "Likely to Churn" or "Not
Likely to Churn" based on historical data.
7. Model Training
Once a model is selected, it is trained using the dataset. The model learns patterns
and relationships to make predictions.
✔ Use supervised learning (for labeled data) or unsupervised learning (for
unlabeled data).
✔ Split the data into training and validation sets.
✔ Adjust model parameters to improve learning efficiency.
Example: Train the model on past customer data to predict future churn
probabilities.
8. Model Evaluation & Tuning
The model is tested on unseen data to assess its performance and identify areas
for improvement.
✔ Use evaluation metrics like accuracy, precision, recall, and F1-score.
✔ Perform hyperparameter tuning using grid search or random search.
✔ Check for overfitting and underfitting using validation curves.
Example: If Random Forest performs poorly, adjust tree depth and the number of
estimators to enhance accuracy.
9. Model Deployment
After successful evaluation, the model is integrated into a real-world system
where it can make predictions on new data.
✔ Deploy the model as an API, web service, or within a business system.
✔ Ensure scalability and real-time prediction capability.
✔ Implement security measures to protect user data.
Example: Deploy the churn prediction model within a telecom CRM system to
flag high-risk customers and trigger retention offers.
10. Model Monitoring & Maintenance
Once deployed, the model’s performance should be continuously monitored to
maintain its accuracy.
✔ Regularly update the model with new data.
✔ Detect concept drift, where patterns in data change over time.
✔ Implement feedback loops to improve model robustness.
Example: Retrain the churn prediction model every six months with updated
customer data to reflect changing behavior.

Example Application: Customer Churn Prediction


v’
z‘ Problem: A telecom company wants to predict customer churn to improve
retention strategies.
z‘

v Data: Customer service history, billing details, contract duration, complaints.
v ‘’z Model: Random Forest classifier.
‘ ’ Deployment: Integrated into the telecom’s CRM system.
z
v
zv‘ Outcome: Flags high-risk customers and suggests personalized retention

offers.

You might also like