Question Bank - Student
Question Bank - Student
Data Type Labeled data (input-output pairs) Unlabeled data (no specific
output)
Objective Predict or classify output based on input Find hidden patterns or structures
in data
Training Process Learns from labeled examples Learns from the data without
explicit labels
Applications Classification, Regression Clustering, Association,
Dimensionality reduction
1. Problem Definition: Define the problem by specifying the initial state, the goal state, and the
set of possible actions.
2. Search: Explore possible actions and states to find a sequence that leads to the goal state.
8. Identify some early trends observed in the field of machine learning.(April/May 2024)
Rule-Based Systems: Initial approaches were based on explicitly programmed rules, using if-
then logic to make decisions.
Statistical Methods: Early trends also focused on statistical models and pattern recognition,
including linear regression and clustering algorithms.
9. Compare linear and nonlinear machine learning algorithms. (April/May 2024)
Linear Algorithms: Assume a linear relationship between input features and output. Examples
include Linear Regression and Linear SVM. These are simpler, easier to interpret, and less prone to
overfitting.
Nonlinear Algorithms: Can model complex, nonlinear relationships. Examples include Decision
Trees and Neural Networks. These are more flexible but often computationally intensive and more
prone to overfitting.
10. What are the different types of techniques available to reduce the dimensionality?
(April/May 2023)
Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
t-Distributed Stochastic Neighbor Embedding (t-SNE)
Auto encoders
Feature Selection:
11. Define Data preparation and its process.
After collecting the data, we need to prepare it for further steps. Data preparation is a step where
we put our data into a suitable place and prepare it to use in our machine learning training.
Data exploration:
It is used to understand the nature of data that we have to work with. We need to understand the
characteristics, format, and quality of data. A better understanding of data leads to an effective
outcome. In this, we find Correlations, general trends, and outliers.
Data pre-processing:
1. Tabular Representation: Organizing data in rows and columns, suitable for structured data
such as spreadsheets.
2. Vector Representation: Representing data as vectors, commonly used for text data (e.g., TF-
IDF, word embeddings).
Machine learning life cycle involves seven major steps, which are given below:
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Deployment
18. Differences between Artificial Intelligence (AI) and Machine learning (ML):
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
"Time series analysis is a statistical technique dealing in time series data, or trend analysis."
A time-series contains sequential data points mapped at a certain successive time duration, it
incorporates the methods that attempt to surmise a time series in terms of understanding either the
underlying concept of the data points in the time series or suggesting or making predictions.
PART B
1. Define machine learning. Discuss in detail about the types of learning. (Nov 2022/Nov 2023)
Dissect the challenges and techniques associated with handling high dimensional data in machine
2.
learning. (April/May 2024)
3. Explain the following uninformed search strategies with examples.(Nov/Dec 2023)
Define Machine Learning. What are the different types of Machine Learning?
4.
ii. Explain Linearity and Nonlinearity Techniques (April 2023/Dec 2022)
5. Examine various techniques for data representation in machine learning. (April/May 2024)
Explain the fundamental concepts of supervised learning and unsupervised learning. Illustrate the
6.
workflow of Machine Learning process in detail.(April/May 2024)
7. Explain the process of turning data into probabilities in machine learning.(April/May 2024)
8. Explain the Machine learning life cycle techniques.
Create an integration of Machine Learning models into real-world applications. Provide
9.
examples from various domains such as healthcare, finance, and transportation.
Discuss in detail about the different types of data representation and Visualizations. (Nov/Dec
10
2023)
UNIT – II
SUPERVISED LEARNING
Learning a Class from Examples, Linear, Non-linear, Multi-class and Multi-label
classification, Decision Trees: ID3, Classification and Regression Trees,
Regression: Linear Regression, Multiple Linear Regression, Logistic Regression,
Bayesian Network, Bayesian Classifier
CART is a predictive algorithm used in Machine learning and it explains how the target
variable’s values can be predicted based on other matters. It is a decision tree where each
fork is split into a predictor variable and each node has a prediction for the target variable
at the end.
The CART algorithm works via the following process:
Classification Regression
In this problem statement, the target In this problem statement, the target variables are
variables are discrete. continuous.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, which can be
described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
● Homoscedasticity Assumption
● No autocorrelations
Regression is a statistical method used to model and analyze the relationship between a
dependent variable (also called the response or outcome) and one or more independent variables
(also called predictors or features). The goal of regression analysis is to understand how the
dependent variable changes when any of the independent variables are varied, and to predict the
dependent variable based on new data.
Types of Regression:
● Linear Regression
● Logistic Regression
● Ridge Regression
● Lasso Regression
● Overfitting.
● High Variance.
● low bias.
Supervised learning is a type of machine learning where the model is trained on a labeled
dataset. This means the algorithm learns from input-output pairs, where the input features are
associated with the correct output (label). The goal of supervised learning is for the model to
learn a mapping from inputs to outputs so that it can make predictions on new, unseen data.
Regression analysis helps in the prediction of a continuous variable. There are various scenarios
in the real world where we need some future predictions such as weather condition, sales
prediction, marketing trends, etc., for such a case we need some technology which can make
predictions more accurately. So for such a case we need Regression analysis which is a
statistical method and used in machine learning and data science. Below are some other reasons
for using Regression analysis:
1. Regression estimates the relationship between the target and the independent variable.
2. It is used to find the trends in data.
3. It helps to predict real/continuous values.
4. By performing the regression, we can confidently determine the most important factor, the
least important factor, and how each factor is affecting the other factors.
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
The complete process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
● Based on the best split points of each input in Step 1, the new “best” split point is
identified.
polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using
a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value
of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover
such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial features
of given degree and then modeled using a linear model. Which means the datapoints are
best fitted using a polynomial line.
o Model the relationship between the two variables. Such as the relationship between
Income and expenditure, experience and Salary, etc.
o Forecasting new observations. Such as Weather forecasting according to temperature,
Revenue of a company according to the investments in a year, etc.
o In the above image, we have taken a dataset which is arranged non-linearly. So if we try
to cover it with a linear model, then we can clearly see that it hardly covers any data
point. On the other hand, a curve is suitable to cover most of the data points, which is of
the Polynomial model.
Hence, if the datasets are arranged in a non-linear fashion, then we should use the Polynomial
Regression model instead of Simple Linear Regression
PART – B
1 What are the primary problems with decision trees, especially with regard to overfitting?
2 Explain in detail about Implementation of the Naïve Bayes algorithm with a suitable
example python program.
3 Explain the concepts of Classification and Regression Trees. Identify how they are used in
practice for both classification and regression tasks. (April/May 2024)
4 Develop Logistic Regression in detail with an example program.(April/May 2023)
7 Distinguish simple linear regression and multiple linear regression. How do you handle
multiple predictors in regression analysis? (April/May 2024)
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
10 Build ID3 and derive the procedure to construct a decision tree using ID3. (Nov/Dec 2023)
UNIT 3
ADVANCED SUPERVISED AND ENSEMBLE LEARNING
Ensemble Learning refers to a machine learning technique where multiple models are
combined to create a stronger predictive model. The goal is to improve the performance
and accuracy of the model by aggregating the predictions of several models. Popular
ensemble methods include:
Linearity: It can only solve linearly separable problems, meaning it struggles with data
that cannot be separated by a straight line or hyperplane.
Single-layer: The perceptron architecture is limited to a single layer of neurons, which
restricts its ability to model complex relationships.
Convergence issues: If the data is not linearly separable, the perceptron may not
converge to a solution.
● Linear Kernel
● Polynomial Kernel
● Sigmoid Kernel
8. Compare the differences between Linear and Non-Linear SVM.
A Linear SVM is a type of Support Vector Machine used when the data is linearly
separable. It works by finding a hyperplane that maximizes the margin between two
classes, ensuring that the distance from the hyperplane to the nearest data points (support
vectors) is as large as possible. The optimal hyperplane is determined by solving an
optimization problem that minimizes classification error while maximizing the margin.
Linear SVM is efficient and works well with linearly separable data.
Steps:
● Choose the number k.
● Calculate the distance between the query point and all training points
regression.
● Hard Voting: Uses the majority vote to decide the final class.
● Soft Voting: Averages the predicted probabilities and selects the class with the
highest average probability.
Advantages:
Less Sensitive to Noise: Bagging reduces variance without overemphasizing noisy data
points, unlike boosting, which can overfit on noise.
Stability: Bagging works well with high-variance models (e.g., decision trees), making
predictions more stable.
Steps:
The Voting scheme is an ensemble method where multiple models vote on the final
output. For classification, each model predicts a class, and the class with the majority
of votes is selected as the final prediction (majority voting). For regression, the
average of the outputs from all models is taken as the final prediction.
20. How does Adaboost work to improve the performance of weak learners?
AdaBoost works by sequentially training weak learners, where each new learner
focuses on the errors made by the previous ones. It assigns higher weights to
misclassified data points, forcing the model to correct its mistakes. The predictions of all
learners are then combined, with each learner's influence determined by its accuracy.
AdaBoost improves the overall model by emphasizing difficult cases and refining the
weak learners’ predictions.
PART – B
4. Build the Machine Learning model to implement the Loan Status Prediction
using Support Vector Machine (SVM) Algorithm (use dataset name as
“Customer_details. csv”). (April/May 2023)
6. You are building a voting ensemble with three classifiers: Logistic Regression,
SVM, and Decision Tree. Discuss how model diversity impacts the
performance of voting-based ensembles. Justify your answer with examples..
8. Explain the concept of Stacking in ensemble learning. How does it differ from
other ensemble methods like bagging and boosting? Explain the process of
building a stacked model and discuss the advantages and challenges of stacking.
9. Compare and contrast AdaBoost and XGBoost. How do they differ in terms of
boosting mechanisms and performance?
10. Given a dataset with features such as age, income, and purchase history, apply
the Random Forest algorithm to predict whether a customer will buy a product
(binary classification: yes, or no). Outline the steps involved in applying the
Random Forest model and explain how you would evaluate its performance.
UNIT - IV
UNSUPERVISED LEARNING
PART A
1. What is Clustering?
Clustering is an unsupervised machine learning technique that groups similar data points into
clusters. Clustering scans unlabeled data and groups data points with similar features together.
Clustering can be used in many real-world applications, such as patient studies, marketing,
biomedical, and geospatial databases.
2. List out the applications of clustering algorithms. (Nov/Dec 2023)
1. Market segmentation.
2. Customer behaviour analysis.
3. Document categorization.
4. Image segmentation.
5. Anomaly detection in cybersecurity.
6. Genomics and bioinformatics.
7. Social network analysis.
8. Recommender systems.
K-mode Clustering is a variant of K-means clustering that is used for categorical data. In K-
means, centroids are defined as the mean of numerical values, but K-mode clustering uses
modes (most frequent values) to define the centroid for each cluster. The algorithm works
similarly to K-means:
3. Update the mode based on the most frequent categories in each cluster. K-mode is used
for clustering categorical attributes and is widely used in market segmentation and customer
data analysis.
4. How does the K-means algorithm determine the optimal number of clusters?
Elbow Method: Plot the sum of squared errors (SSE) for different values of k. The point where
the SSE starts to level off (the "elbow") suggests the optimal k.
Silhouette Score: Measures how similar an object is to its own cluster compared to other
clusters. The optimal k maximizes the average silhouette score.
Gap Statistic: Compares the performance of clustering against random data to identify the
best k.
5.What is the Expectation Maximization algorithm used for?
The Expectation Maximization (EM) algorithm is used to estimate the parameters of a statistical
model when there are unobserved latent variables or missing data in the observed data,
essentially finding the maximum likelihood estimates of those parameters by iteratively
performing "expectation" and "maximization" steps based on the incomplete information
available
o It is a top-down approach.
o Starts with a single cluster containing all data points and splits them iteratively.
o It is a bottom-up approach.
o Starts with each data point as an individual cluster and merges them iteratively.
Gaussian Mixture Models represent data as a mixture of multiple Gaussian distributions, where
each Gaussian corresponds to a cluster. The model uses probabilistic measures to assign data
points to clusters based on their likelihood.
A dendrogram is a tree-like diagram that shows the hierarchical relationship between objects in
a hierarchical clustering algorithm. It's a network structure that's made up of a root node,
branches, and leaves. The main purpose of a dendrogram is to help determine how to best group
objects into clusters.
Selection Methods:
o Elbow Method: Analysing the variance explained as a function of K.
o Silhouette Score: Measuring the quality of clustering.
o Domain knowledge or trial-and-error.
13. When Will the Curse of Dimensionality Occur and How to Solve It?
● Occurrence: When data has too many dimensions, leading to sparse data and reduced
algorithm performance.
● Solutions:
o Dimensionality reduction techniques like PCA or LLE.
o Feature selection and engineering.
o Regularization methods.
The primary limitation of K-means clustering is its sensitivity to the initial selection of cluster
centroids, which can lead to suboptimal clustering results if not chosen carefully, and the
requirement to pre-define the number of clusters ("k") within the data, which can be challenging
to determine accurately in many cases.
Fuzzy Clustering is a type of clustering algorithm in machine learning that allows a data point to
belong to more than one cluster with different degrees of membership. Unlike traditional
clustering algorithms, such as k-means or hierarchical clustering, which assign each data point
to a single cluster, fuzzy clustering assigns a membership degree between 0 and 1 for each data
point for each cluster.
● Fuzzy Clustering:
● Hard Clustering:
While both GMM (Gaussian Mixture Model) and K-means clustering are unsupervised learning
algorithms used for grouping data, the key difference is that GMM assigns data points to
clusters probabilistically based on a mixture of Gaussian distributions, allowing for soft cluster
assignments and handling complex cluster shapes, whereas K-means uses a hard assignment
based on the nearest centroid, making it better for simple, spherical clusters.
Genetic modeling in clustering is a technique that uses genetic algorithms to find optimal
solutions for clustering problems. These algorithms are inspired by evolution and use
mathematics to implement the idea of survival of the fittest. They can search for a better
solution from many possible ones, and are less sensitive to the initial cluster centres.
● Data visualization:
● Feature extraction:
1. What is K-Mode clustering? Examine how it differs from K-means clustering and
give an example with details. (April/May 2024)
2. Examine about Hierarchical clustering algorithm and its types.
6. Analyze the steps in k-means algorithm. Cluster the following set of 4 objects into
two clusters using k-means A (3,5), B (4,5), C (1,3), D (2,4). Consider the objects A
and C as the initial cluster centers. (April/May 2023)
7. Illustrate Principal Component Analysis (PCA) method of dimensionality reduction
technique with suitable examples. (Nov/Dec 2023)
8. Explain in detail about the K-nearest neighbor algorithm using a given dataset.
6 7 Pass
7 8 Pass
5 5 Fail
8 8 Pass
9. Evaluate how genetic modeling techniques can be integrated with clustering
algorithms to optimize cluster assignments with examples. (April/May 2024)
10. Solve the multi-dimensional problem for the given network using Self-Organizing
Map.
1. What is K-Mode clustering? Examine how it differs from K-means clustering and
give an example with details. (April/May 2024)
2. Examine about Hierarchical clustering algorithm and its types.
X1:(1,0,1,0)
X2:(1,0,0,0)
X3:(1,1,1,1)
X4:(0,1,1,0)
UNIT V
APPLICATIONS OF MACHINE LEARNING
● Based on the accuracy of each model, we will use the algorithm with the highest
accuracy after testing all the models.
7. What is a Support Vector Machine?
Support Vector Machine (SVM) is a supervised learning algorithm used for classification and
regression problems. The main objective of SVM is to find a hyperplane in an N( total number
of features)-dimensional space that differentiates the data points. So we need to find a plane that
creates the maximum margin between two data point classes.
8. What are Support Vectors in SVM?
Support Vectors are data points that are nearest to the hyperplane. It influences the position and
orientation of the hyperplane. Removing the support vectors will alter the position of the
hyperplane. The support vectors help us build our support vector machine model.
20. What are the various ways that a fraudulent transaction can take place? [MAY 2023]
various ways that a fraudulent transaction can take place such as fake accounts, fake IDs, and
stealing money in the middle of a transaction.
21. Write any five Most popular Machine learning tools [MAY 2023]
PyTorch
TensorFlow
Colab
KNIME
Apache Mahoot
22. Write some assumptions about the Genuine Emails.
1. The genuine emails are the ones that are sent to the recipients with conveying useful
information.
2. The recipient expects those emails or reads those emails to get the new information.
23. Colab is supported under which platform?
Cloud services
24. Name the tool used for Data loading & Transformation and Data preprocessing &
visualization.
Rapid Miner
27. How the diagnostic sensory data is used in the Medical diagnosis process. [MAY 2024]
This diagnostic sensory data can then be given to a machine learning system which can then
analyze the signals and classify the medical conditions into different predetermined types.
28. Name a few applications that use speech recognition technology to follow voice
instructions.
Google Assistant, Siri, Cortana, and Alexa
PART B
1. Identify the techniques used to improve the accuracy of email spam and malware detection
systems. [MAY 2024]
2. Choose the ethical considerations involved in deploying machine learning for online fraud
detection. [MAY 2024]
3. Explain the role of Machine Learning in Image recognition and Medical diagnosis. Explain
any one application with its implementation. [MAY 2023]
4. Build the Machine Learning model to implement Email Spam classification using Naïve
Bayes or support vector machines. [MAY 2023]
5. Discuss about the application of machine learning in email spam and malware filtering.
[DEC 2022
6. Write a program for online fraud detection. [DEC 2022
7. Write the applications of machine learning in speech recognition, email Spam Malware
filtering, and online fraud detection. [DEC 2023]
PART C
1. Write short notes on precision and recall and explain the implementation program for Image
Recognition. [DEC 2023]
2. Describe how to evaluate machine learning models built for Speech Recognition.
3. Creating an email spam program for malicious purposes is unethical and illegal. However,
we can address email spam from the perspective of detecting and filtering spam emails,
which is a common problem in machine learning and cybersecurity. [MAY 2022]
4. Explain Online fraud detection in detail with a suitable program. [MAY 2024]
5. You are tasked with building a medical diagnosis system to assist healthcare providers in
identifying possible diseases based on a patient's symptoms. The system should predict
potential diseases and their likelihood based on a provided dataset containing:
● Symptoms reported by patients
● Patient demographic data (e.g., age, gender, weight, height)
● Laboratory test results (if available)
● Disease diagnoses corresponding to the symptoms and test results
The goal is to create an intelligent system that improves the efficiency of medical diagnoses,
reduces diagnostic errors, and aids medical professionals in decision-making.[MAY 2023]