0% found this document useful (0 votes)
47 views27 pages

Introduction To Machine Learning

The document provides answers to exam questions from the Winter-2024 and Summer-2025 'Introduction to Machine Learning' papers, covering topics such as human vs. machine learning, applications of machine learning, and examples of supervised, unsupervised, and reinforcement learning. It also discusses Python libraries like NumPy and Pandas, data preprocessing steps, performance evaluation methods, and the supervised learning process. Additionally, it includes code snippets for data visualization and manipulation using Matplotlib and NumPy.

Uploaded by

teamrgxrexerea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views27 pages

Introduction To Machine Learning

The document provides answers to exam questions from the Winter-2024 and Summer-2025 'Introduction to Machine Learning' papers, covering topics such as human vs. machine learning, applications of machine learning, and examples of supervised, unsupervised, and reinforcement learning. It also discusses Python libraries like NumPy and Pandas, data preprocessing steps, performance evaluation methods, and the supervised learning process. Additionally, it includes code snippets for data visualization and manipulation using Matplotlib and NumPy.

Uploaded by

teamrgxrexerea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Of course!

Here are the answers to all questions from both the Winter-2024 and
Summer-2025 "Introduction to Machine Learning" exam papers.

❄️ Winter-2024 Exam (4350702)


Q.1: Fundamentals of Machine Learning
(a) Define human learning. How does human learning differ from machine learning? (3
Marks)
●​ Human Learning is the process of acquiring knowledge, skills, values, or behaviors
through experience, study, or being taught. It involves intuition, reasoning,
consciousness, and the ability to understand complex context from very few
examples.
●​ Difference from Machine Learning:
○​ Data Requirement: Humans can learn from a few examples, whereas
machines typically require large amounts of data.
○​ Process: Human learning is a cognitive and often abstract process. Machine
learning is a computational process based on statistical algorithms.
○​ Interpretation: Humans understand the "why" behind concepts. Machines
learn patterns and correlations without true understanding.

(b) Explain applications of machine learning. (4 Marks)

Machine learning is used in many fields. Key applications include:


●​ Recommendation Engines: Suggesting products on Amazon, movies on Netflix, or
music on Spotify based on your past behavior.
●​ Image & Speech Recognition: Unlocking your phone with your face (face
recognition), converting voice to text (Siri, Google Assistant), and identifying objects
in images.
●​ Spam Filtering: Automatically classifying emails as "spam" or "not spam."
●​ Medical Diagnosis: Analyzing medical images (like X-rays or MRIs) to detect
diseases like cancer at an early stage.
●​ Financial Services: Detecting fraudulent credit card transactions and predicting
stock market trends.
●​ Self-Driving Cars: Using ML algorithms to perceive the environment and navigate
without human input.

(c) Provide an example of an application for each type of machine learning


(supervised, unsupervised, and reinforcement learning). (7 Marks)
●​ Supervised Learning:
○​ Application: House Price Prediction.
○​ Explanation: The model is trained on a dataset of houses where the features
(e.g., size in square feet, number of bedrooms, location) and the correct
selling price (the label) are known. The goal is to predict the price of a new,
unseen house.
●​ Unsupervised Learning:
○​ Application: Customer Segmentation.
○​ Explanation: A retail company uses clustering algorithms on its customer
data (e.g., purchase history, browsing habits) to group similar customers
together. This helps in creating targeted marketing campaigns for different
segments without any prior labels of "customer types."
●​ Reinforcement Learning:
○​ Application: Training a Bot to Play a Game (e.g., Chess or Mario).
○​ Explanation: The bot (agent) learns by interacting with the game
(environment). It gets a reward for good moves (like capturing an opponent's
piece in chess or gaining points in Mario) and a penalty for bad moves (like
losing a piece or dying). Through trial and error, it learns a strategy (policy) to
maximize its total reward.

(c) OR: List three popular tools or technologies used in machine learning and explain
their significance. (7 Marks)
1.​ Scikit-learn:
○​ Description: A powerful and user-friendly Python library for general-purpose
machine learning.
○​ Significance: It is extremely popular for beginners and experts because it
provides a wide range of algorithms for classification, regression, and
clustering with a simple and consistent interface (.fit(), .predict()). It's the go-to
tool for most traditional ML tasks.
2.​ TensorFlow & PyTorch:
○​ Description: Open-source frameworks specifically designed for deep
learning and building neural networks.
○​ Significance: These are the industry standards for complex tasks like image
recognition, natural language processing (NLP), and AI research. They allow
developers to build, train, and deploy large-scale neural network models
efficiently, especially with GPU support.
3.​ Pandas:
○​ Description: A Python library for data manipulation and analysis.
○​ Significance: Machine learning is impossible without clean, well-structured
data. Pandas provides the fundamental data structure, the DataFrame, which
allows data scientists to easily load, clean, filter, transform, and analyze data
before feeding it into a machine learning model.

Q.2: Python Libraries for Machine Learning


(a) List and explain in brief commonly used Mathematical Functions in NumPy. (3
Marks)
●​ np.mean(): Calculates the average of the array elements.
●​ np.std(): Calculates the standard deviation, a measure of data spread.
●​ np.sum(): Calculates the sum of all elements.
●​ np.min() / np.max(): Find the minimum or maximum value in an array.
●​ np.sqrt(): Calculates the square root of each element.
●​ np.log(): Calculates the natural logarithm of each element.

(b) Create a bar plot using Matplotlib with the following data: x=['Rohit', 'Virat',
'Shikhar', 'Gill'], y=[45, 89, 13, 54]. Label the X-axis as "Player" and y-axis as "Score".
(4 Marks)

Python

import matplotlib.pyplot as plt

# Data
x = ['Rohit', 'Virat', 'Shikhar', 'Gill']
y = [45, 89, 13, 54]

# Create bar plot


plt.bar(x, y)

# Add labels
plt.xlabel("Player")
plt.ylabel("Score")

# Add a title
plt.title("Player Scores")

# Display the plot


plt.show()

(c) Write a NumPy program to implement the following operations: (7 Marks)

1) To find the maximum and minimum value of a given any single dimensional array

2) To compute the mean, standard deviation, and variance of a given array along the second
axis.

Python

import numpy as np

# --- Part 1 ---


print("--- Part 1: Max and Min in a 1D Array ---")
# Create a single dimensional array
arr_1d = np.array([15, 98, 34, 7, 56, 102, 4])

# Find maximum and minimum values


max_val = arr_1d.max()
min_val = arr_1d.min()

print(f"Array: {arr_1d}")
print(f"Maximum value: {max_val}")
print(f"Minimum value: {min_val}")
print("\n" + "="*40 + "\n")

# --- Part 2 ---


print("--- Part 2: Stats along the Second Axis (axis=1) ---")
# Create a 2D array
arr_2d = np.array([[10, 20, 30],
[40, 50, 60]])

# The second axis (axis=1) means we compute stats for each row
mean_val = np.mean(arr_2d, axis=1)
std_dev = np.std(arr_2d, axis=1)
variance = np.var(arr_2d, axis=1)

print(f"2D Array:\n{arr_2d}")
print(f"Mean along axis=1 (for each row): {mean_val}")
print(f"Standard Deviation along axis=1: {std_dev}")
print(f"Variance along axis=1: {variance}")

(OR Q.2 a) Create a NumPy array with values [9,8,7,6,5,4]. Access the third element of
the array. (3 Marks)

Python

import numpy as np

# Create the NumPy array


my_array = np.array([9, 8, 7, 6, 5, 4])

# Access the third element (index 2 because indexing starts from 0)


third_element = my_array[2]

print(f"The array is: {my_array}")


print(f"The third element is: {third_element}") # Output will be 7

(OR Q.2 b) Write and explain syntax of following operation in Pandas Data Frame: 1)
Remove Duplicate Rows 2) Clean Empty Cells (NaN values). (4 Marks)
1.​ Remove Duplicate Rows
○​ Syntax: df.drop_duplicates()
○​ Explanation: This method returns a new DataFrame with duplicate rows
removed. By default, it considers all columns to identify duplicates. It keeps
the first occurrence of a duplicated row.
2.​ Clean Empty Cells (NaN values)
○​ Syntax: df.dropna()
○​ Explanation: This method removes rows (or columns) that contain missing
values (NaN). By default, it drops any row that has at least one missing value.
You can use df.dropna(axis=1) to drop columns with missing values instead.

(OR Q.2 c) List and explain steps involved in building a model in scikit-learn? How
can you load a dataset in scikit-learn? (7 Marks)

Steps to Build a Model in Scikit-learn:


1.​ Choose a Model: Import the desired model class from the scikit-learn library. For
example, from sklearn.linear_model import LinearRegression.
2.​ Instantiate the Model: Create an instance of the model, where you can set
hyperparameters. For example, model = LinearRegression().
3.​ Prepare Data: Arrange your data into a features matrix X (the inputs) and a target
vector y (the desired output).
4.​ Fit the Model: Train the model on your data by calling the fit() method. For example,
model.fit(X, y). This is where the learning happens.
5.​ Make Predictions: Use the trained model to make predictions on new, unseen data
by calling the predict() method. For example, new_predictions =
model.predict(X_new).

How to Load a Dataset in Scikit-learn:

Scikit-learn comes with several small, built-in datasets for practice. You can load them using
specific functions from the sklearn.datasets module.

Python

# Example: Loading the Iris dataset


from sklearn.datasets import load_iris

# Load the dataset


iris = load_iris()

# The features and target are available as attributes


X = iris.data
y = iris.target

print("Iris dataset loaded successfully!")


print("Number of samples:", X.shape[0])
print("Number of features:", X.shape[1])
Q.3: Data Preprocessing and Model Evaluation
(a) Discuss the steps involved in reading a CSV file in Pandas. (3 Marks)
1.​ Import Pandas: First, you must import the pandas library, conventionally as pd.
○​ import pandas as pd
2.​ Use the read_csv() function: Pandas provides a dedicated function pd.read_csv()
to read data from a CSV file.
3.​ Provide the File Path: You pass the path to your CSV file as a string argument to
the function.
○​ df = pd.read_csv('path/to/your/file.csv')
4.​ Store in a DataFrame: The function returns the data as a pandas DataFrame, which
is a 2D labeled data structure, ready for analysis.

(b) Describe the purpose of dimensionality reduction technique in data


pre-processing. (4 Marks)

The main purpose of dimensionality reduction is to reduce the number of input variables
(features) in a dataset. This is important for several reasons:
●​ Reduces Overfitting: Fewer features can lead to a simpler model that generalizes
better to new data.
●​ Improves Performance: Training models is computationally faster with fewer
dimensions.
●​ Handles the "Curse of Dimensionality": In very high dimensions, data becomes
sparse, making it difficult for models to find patterns. Reducing dimensions can make
the data denser and patterns easier to find.
●​ Better Visualization: It's impossible to visualize data in more than 3 dimensions.
Reducing it to 2D or 3D allows for plotting and visual inspection.

(c) Explain Performance Evaluation Methods: 1) Cross Validation 2) Confusion Matrix.


(7 Marks)
1.​ Cross-Validation:
○​ Purpose: To get a more reliable estimate of how a model will perform on
unseen data. A simple train-test split can be lucky or unlucky.
○​ How it Works (K-Fold Cross-Validation): The dataset is split into 'K' equal
parts (or "folds"). The model is trained K times. In each run, one fold is used
as the test set, and the remaining K-1 folds are used for training. The final
performance score is the average of the scores from all K runs. This ensures
that every data point gets to be in a test set exactly once.
2.​ Confusion Matrix:
○​ Purpose: A table used to evaluate the performance of a classification
model. It gives a detailed breakdown of correct and incorrect predictions for
1
each class.
2
○​ Structure: For a binary classification problem, it's a 2x2 matrix:
3
■​ True Positive (TP): Correctly predicted positive class.
4
■​ True Negative (TN): Correctly predicted negative class.
5
■​ False Positive (FP): Incorrectly pr edicted positive class (Type I
error).
■​ False Negative (FN): Incorrectly predicted negative class (Type II
error).
○​ This matrix is used to calculate key metrics like Accuracy, Precision, Recall,
and F1-Score.

(OR Q.3 a) Explain Performance improvement in machine learning. (3 Marks)

Improving the performance of a machine learning model involves several techniques:


●​ Get More Data: Often, the most effective way to improve a model is to train it on
more high-quality data.
●​ Feature Engineering: Creating new, more meaningful features from the existing
ones. For example, creating an "age" feature from a "date of birth" feature.
●​ Hyperparameter Tuning: Adjusting the settings of a model (e.g., the 'k' in KNN or
the learning rate in a neural network) to find the configuration that yields the best
performance.
●​ Try Different Models: No single algorithm is best for every problem. Trying different
models (e.g., SVM, Decision Tree, Gradient Boosting) can lead to better results.

(OR Q.3 b) Differentiate between Numerical and Categorical Data. (4 Marks)

Feature Numerical Data Categorical Data

Definition Represents quantities and Represents qualities or labels and


can be measured. can be observed.

Values Numbers. Text or numbers representing


categories.

Arithmetic All mathematical operations Mathematical operations are


are meaningful. meaningless.
Sub-types Discrete (e.g., number of Nominal (no order, e.g., colors) and
students) and Continuous Ordinal (has an order, e.g., 'low',
(e.g., height). 'medium', 'high').

Example Temperature, price, age. Gender, city name, product


category.

(OR Q.3 c) Explain steps involved in Preparing the Model activity in machine learning.
(7 Marks)

This phase, also known as Data Preprocessing, is a crucial set of steps taken to prepare
the raw data for a machine learning model.
1.​ Data Cleaning: This involves handling imperfections in the data.
○​ Handling Missing Values: Either removing rows/columns with missing data
(dropna()) or filling them in with a value like the mean, median, or mode
(fillna()).
○​ Handling Outliers: Identifying and dealing with data points that are
abnormally different from others.
2.​ Data Transformation: This involves converting data into a suitable format.
○​ Feature Scaling: Scaling numerical features to a common range (e.g., using
Standardization or Normalization) so that no single feature dominates the
learning process.
○​ Encoding Categorical Data: Converting categorical features (like 'Red',
'Green') into numbers (e.g., using One-Hot Encoding or Label Encoding)
because models can only process numerical data.
3.​ Feature Selection / Engineering:
○​ Selecting the most relevant features for the model to reduce complexity and
improve accuracy.
○​ Creating new features from existing ones if needed.
4.​ Splitting the Dataset:
○​ Dividing the processed data into a training set (used to train the model) and
a testing set (used to evaluate the model's performance on unseen data).
This is critical to check for overfitting.

Q.4: Supervised Learning: Classification


(a) Explain steps involved in the supervised learning process. (3 Marks)
1.​ Data Collection: Gather a dataset containing both input features and their
corresponding correct output labels.
2.​ Data Preprocessing: Clean, transform, and prepare the data as described in Q.3(c)
OR.
3.​ Data Splitting: Divide the dataset into training and testing sets.
4.​ Model Training: Choose a suitable algorithm and train it on the training data. The
model learns the mapping between the features and labels.
5.​ Model Evaluation: Test the trained model on the unseen testing data to evaluate its
performance using metrics like accuracy or mean squared error.
6.​ Tuning & Deployment: Adjust model hyperparameters to improve performance.
Once satisfied, the model is ready to be deployed to make predictions on new,
real-world data.

(b) Explain logistic regression technique. (4 Marks)

Logistic Regression is a supervised learning algorithm used for classification, primarily


for binary outcomes (e.g., yes/no, 0/1, spam/not-spam).
●​ How it Works: It predicts the probability that a given input belongs to a certain class.
●​ The Sigmoid Function: It uses a mathematical function called the "sigmoid" or
"logit" function, which takes any real-valued number and squishes it into a range
between 0 and 1.
●​ Decision Boundary: The output of the sigmoid function is interpreted as a
probability. A threshold (typically 0.5) is set. If the probability is greater than the
threshold, the model predicts class 1; otherwise, it predicts class 0. This creates a
linear decision boundary to separate the classes.

(c) Explain concept of support vector machine (SVM) in classification. (7 Marks)

Support Vector Machine (SVM) is a powerful supervised algorithm used for classification.
●​ Main Goal: The primary objective of SVM is to find the optimal hyperplane that best
separates the data points of different classes in the feature space.
●​ Hyperplane: A hyperplane is a decision boundary. In a 2D space, it's a line; in a 3D
space, it's a plane.
●​ Margin: SVM doesn't just find any hyperplane; it finds the one that has the
maximum margin. The margin is the distance between the hyperplane and the
nearest data points from either class. A larger margin leads to better generalization
and a more robust classifier.
●​ Support Vectors: The data points that are closest to the hyperplane and which
define the margin are called support vectors. These are the most critical data points
in the dataset.
●​ Kernel Trick: For data that is not linearly separable, SVM can use a "kernel trick" to
map the data into a higher dimension where a linear separator can be found. This
makes SVM very effective for complex, non-linear problems.

(OR Q.4 a) Differentiate Linear Regression with Logistic Regression. (3 Marks)

Feature Linear Regression Logistic Regression


Problem Type Regression Task Classification Task

Output Predicts a continuous Predicts a probability which is


value (e.g., 123.45, -50, mapped to a discrete class (e.g., 0
42). or 1).

Relationship Models a linear relationship Models the probability of a class.


between variables.

Equation Uses a straight line Uses the Sigmoid function to


produce an S-shaped curve.

equation: .

Use Case Predicting house prices, Spam detection, disease


temperature, stock values. diagnosis, credit approval.

(OR Q.4 b) Define Decision Trees algorithm. Explain Terminologies of Decision Trees.
(4 Marks)

A Decision Tree is a supervised learning algorithm that works by splitting the data into
smaller and smaller subsets based on a series of questions about the features. It creates a
tree-like model of decisions.
●​ Terminologies:
○​ Root Node: The topmost node in the tree, representing the entire dataset.
○​ Decision Node: A node that splits into two or more sub-nodes. It represents
a test on a feature.
○​ Leaf / Terminal Node: A node that does not split further. It represents the
final decision or class label.
○​ Splitting: The process of dividing a node into sub-nodes.
○​ Branch / Sub-Tree: A subsection of the entire tree.
○​ Pruning: The process of removing sub-nodes of a decision node to reduce
the complexity of the tree and prevent overfitting.

(OR Q.4 c) Write a Python program to implement K-Nearest Neighbour supervised


machine learning algorithm for a given dataset of animals which will classify
categories of 'Dog' and 'Cat'. (7 Marks)
Python

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Assume required data points/features.


# Let's create a simple dataset: [weight (kg), height (cm)]
# Label: 0 for Cat, 1 for Dog
X = np.array([
[3, 30], [4, 35], [2, 25], [5, 32], # Cats
[15, 60], [20, 75], [12, 55], [25, 80] # Dogs
])

y = np.array([0, 0, 0, 0, 1, 1, 1, 1]) # 0=Cat, 1=Dog

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- K-Nearest Neighbour Implementation ---

# 1. Choose a model and instantiate it. Let's use K=3.


knn = KNeighborsClassifier(n_neighbors=3)

# 2. Train the model


knn.fit(X_train, y_train)

# 3. Make predictions on the test data


y_pred = knn.predict(X_test)

# 4. Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

# --- Classify a new animal ---


# New animal with weight=18kg and height=70cm
new_animal = np.array([[18, 70]])
prediction = knn.predict(new_animal)

if prediction[0] == 0:
print("The new animal is classified as a 'Cat'.")
else:
print("The new animal is classified as a 'Dog'.")

Q.5: Unsupervised Learning


(a) List examples of Unsupervised Learning. (3 Marks)
●​ Clustering: Grouping similar customers based on purchasing behavior (Customer
Segmentation).
●​ Association Rule Mining: Finding items that are frequently bought together in a
supermarket (Market Basket Analysis).
●​ Dimensionality Reduction: Reducing the number of features in a dataset while
retaining important information (e.g., using PCA).
●​ Anomaly Detection: Identifying fraudulent transactions in a bank's transaction data.

(b) Give the difference between supervised and unsupervised machine learning. (4
Marks)

Feature Supervised Learning Unsupervised Learning

Input Data Uses labeled data (features + Uses unlabeled data (only
correct answers). features).

Goal To predict an outcome or To find hidden patterns or


classify data. structure in data.

Process The algorithm is "supervised" by The algorithm explores the


the correct answers. data on its own to find patterns.

Feedback Direct feedback mechanism (it No direct feedback mechanism.


knows if a prediction is right or
wrong).

Example Linear Regression, SVM, K-Means Clustering, PCA,


Algos Decision Trees. Apriori.

(c) Explain K-Means Clustering Algorithm. (7 Marks)

K-Means is an unsupervised learning algorithm used to partition a dataset into a


pre-determined number (K) of distinct, non-overlapping clusters.

The step-by-step process is as follows:


1.​ Choose K: Decide the number of clusters you want to create (e.g., K=3).
2.​ Initialize Centroids: Randomly select K data points from the dataset to act as the
initial cluster centers (centroids).
3.​ Assignment Step: For each data point in the dataset, calculate its distance (e.g.,
Euclidean distance) to all K centroids. Assign the data point to the cluster of the
nearest centroid.
4.​ Update Step: After all points have been assigned to clusters, recalculate the position
of the centroid for each cluster. The new centroid is the mean (average) of all data
points belonging to that cluster.
5.​ Repeat: Repeat the Assignment Step and the Update Step until the centroids no
longer move significantly or a maximum number of iterations is reached. At this point,
the algorithm has converged, and the final clusters are formed.

(OR Q.5 a) Define unsupervised learning. List applications of unsupervised learning.


(3 Marks)
●​ Definition: Unsupervised learning is a type of machine learning where the algorithm
learns patterns from data that is not labeled or classified. The system tries to learn
the structure of the data without any guidance on what the "correct" output is.
●​ Applications:
○​ Recommendation Systems
○​ Customer Segmentation
○​ Anomaly and Fraud Detection
○​ Medical Imaging (e.g., segmenting brain scans)

(OR Q.5 b) Explain Clustering and briefly list techniques of clustering. (4 Marks)
●​ Clustering: It is the task of grouping a set of objects (data points) in such a way that
objects in the same group (called a cluster) are more similar to each other than to
those in other groups. It is a fundamental technique in unsupervised learning.
●​ Clustering Techniques:
○​ Partitioning Methods: These methods divide the data into a pre-determined
number of non-overlapping clusters. (e.g., K-Means).
○​ Hierarchical Methods: These methods create a tree-like hierarchy of
clusters. They can be agglomerative (bottom-up) or divisive (top-down). (e.g.,
Agglomerative Clustering).
○​ Density-Based Methods: These methods connect areas of high data point
density into clusters, allowing for arbitrarily shaped clusters and identifying
outliers. (e.g., DBSCAN).

(OR Q.5 c) Define Association. Explain step by step process of Association. (7 Marks)
●​ Definition: Association is a rule-based machine learning method for discovering
interesting relationships or "associations" between variables in large datasets. The
classic example is "Market Basket Analysis."
●​ Step-by-Step Process (using the Apriori algorithm concept):
1.​ Define Metrics: The process relies on three key metrics:
■​ Support: How frequently an item or itemset appears in the dataset.
■​ Confidence: The likelihood that item B is purchased when item A is
purchased.
■​ Lift: The increase in the ratio of the sale of B when A is sold.
2.​ Set Minimum Thresholds: Define minimum thresholds for support and
confidence to filter out uninteresting rules.
3.​ Find Frequent Itemsets: The algorithm first scans the dataset to find all the
individual items that meet the minimum support threshold. These are the
"frequent 1-itemsets."
4.​ Generate Candidate Itemsets: It then joins the frequent 1-itemsets to create
"candidate 2-itemsets" and checks if they meet the minimum support. This
process is repeated iteratively (generating 3-itemsets from frequent
2-itemsets, and so on) until no more frequent itemsets can be found. This
uses the Apriori Principle: if an itemset is frequent, then all of its subsets
must also be frequent.
5.​ Generate Association Rules: From the final list of frequent itemsets, the
algorithm generates association rules (e.g., {A} -> {B}) that meet the minimum
confidence threshold. For example, the rule {Bread, Butter} -> {Milk} is
generated from the frequent itemset {Bread, Butter, Milk}.

☀️ Summer-2025 Exam (4350702)


Q.1: Fundamentals of Machine Learning & NumPy
(a) Define Machine Learning (ML). List types of ML. (3 Marks)
●​ Definition: Machine Learning is a field of artificial intelligence (AI) that gives
computers the ability to learn and improve from experience without being explicitly
programmed. It focuses on developing algorithms that can analyze data, learn from
it, and then make a determination or prediction about something.
●​ Types of ML:
1.​ Supervised Learning
2.​ Unsupervised Learning
3.​ Reinforcement Learning

(b) Give differences between Machine Learning and Human Learning. (4 Marks)
●​ Speed & Scale: Machines can process vast amounts of data much faster than
humans.
●​ Accuracy & Consistency: For specific, repetitive tasks, ML models can be more
accurate and are perfectly consistent, whereas humans are prone to fatigue and
error.
●​ Adaptability: Humans are far superior at adapting to new, unseen situations and can
learn from very limited information. ML models need to be retrained on new data.
●​ Data Dependency: ML is highly dependent on the quality and quantity of data.
Human learning can occur through abstract reasoning and intuition with little data.

(c) Write a python program to implement any five maths functions in Numpy. (7 Marks)

Python

import numpy as np

# Create two sample arrays


a = np.array([1, 4, 9, 16, 25])
b = np.array([1, 2, 3, 4, 5])

# 1. Square Root
sqrt_a = np.sqrt(a)
print(f"Original array a: {a}")
print(f"1. Square root of a: {sqrt_a}\n")

# 2. Add two arrays


sum_arr = np.add(a, b)
print(f"Original array b: {b}")
print(f"2. Sum of a and b: {sum_arr}\n")

# 3. Exponential function (e^x)


exp_b = np.exp(b)
print(f"3. Exponential of b: {exp_b}\n")

# 4. Find the maximum value


max_a = np.max(a)
print(f"4. Maximum value in a: {max_a}\n")

# 5. Calculate the sine of elements


# Create an array of angles in radians
angles = np.array([0, np.pi/2, np.pi])
sin_angles = np.sin(angles)
print(f"Angles array: {angles}")
print(f"5. Sine of angles: {sin_angles}\n")

(OR Q.1 c) How to define an array in Numpy? Create two Numpy arrays: 1. Array filled
with all zeros and 2. Array filled with all ones. Combine both in single array and
display their elements. (7 Marks)

In NumPy, an array is defined using the np.array() function, which converts a list-like
structure into a NumPy array.

Python

import numpy as np
# 1. Create an array filled with all zeros
# Let's make a 2x3 array of zeros
zeros_array = np.zeros((2, 3))
print("1. Array of all zeros:")
print(zeros_array)
print("-" * 20)

# 2. Create an array filled with all ones


# Let's make a 2x3 array of ones
ones_array = np.ones((2, 3))
print("2. Array of all ones:")
print(ones_array)
print("-" * 20)

# 3. Combine both arrays into a single array


# We can combine them vertically (stacking rows) using np.concatenate
combined_array = np.concatenate((zeros_array, ones_array), axis=0)

# Display the elements of the combined array


print("3. Combined array:")
print(combined_array)

Q.2: Pandas and Matplotlib


(a) Write a Pandas program to implement following: (3 Marks)

1. To convert the first column of a DataFrame as a Series.

2. To sort a given Series: [12, 25, 6, 4, 8, 10, 21, 22]

Python

import pandas as pd

# --- Part 1 ---


data = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
df = pd.DataFrame(data)
first_col_series = df['col1']
print("--- Part 1 ---")
print("Original DataFrame:\n", df)
print("\nFirst column as a Series:\n", first_col_series)
print(f"Type of the result: {type(first_col_series)}")
print("\n" + "="*20 + "\n")

# --- Part 2 ---


s = pd.Series([12, 25, 6, 4, 8, 10, 21, 22])
sorted_s = s.sort_values()
print("--- Part 2 ---")
print("Original Series:\n", s)
print("\nSorted Series:\n", sorted_s)

(b) Which type of machine learning system should you use to learn spam email
detection? Brief about selected model. (4 Marks)
●​ Type of ML System: Supervised Learning, specifically a Classification task. This
is because we have historical data of emails that are already labeled as either 'spam'
or 'not spam'.
●​ Selected Model: A Naive Bayes classifier is an excellent and common choice for
spam detection.
○​ Brief: It works on Bayes' theorem of probability. It calculates the probability of
an email being 'spam' given the presence of certain words in it (e.g., "free,"
"winner," "prize"). It's called "naive" because it assumes that the presence of
one word is independent of another, which is a simplifying assumption but
works very well in practice for text classification.

(c) Differentiate between plt.show() and plt.savefig() in Matplotlib. Write a program to


plot line, pie and bar graph with blue color, legends, and title. (7 Marks)

Difference:
●​ plt.show(): This function displays the plot in a pop-up window on your screen. It's
used for interactive viewing.
●​ plt.savefig('filename.png'): This function saves the current plot to a file on your
computer's disk (e.g., as a PNG, JPG, or PDF). It does not display the plot on the
screen.

Program:

Python

import matplotlib.pyplot as plt

# Data for plots


x_data = ['A', 'B', 'C', 'D']
y_data = [10, 25, 15, 30]

# Create a figure with 3 subplots


fig, axs = plt.subplots(1, 3, figsize=(18, 5))

# --- 1. Line Plot ---


axs[0].plot(x_data, y_data, color='blue', marker='o', label='Sales Data')
axs[0].set_title('Line Plot')
axs[0].legend()

# --- 2. Pie Chart ---


axs[1].pie(y_data, labels=x_data, autopct='%1.1f%%', colors=['#66b3ff', '#99ff99', '#ffcc99',
'#ff9999'])
axs[1].set_title('Pie Chart')
# Note: Legends are often handled by labels in pie charts, but you could add one if needed.

# --- 3. Bar Graph ---


axs[2].bar(x_data, y_data, color='blue', label='Inventory')
axs[2].set_title('Bar Graph')
axs[2].legend()

# Show the plots


plt.tight_layout() # Adjusts plots to prevent overlap
plt.show()

# To save the figure instead of showing it, you would use:


# plt.savefig('my_plots.png')

(OR Q.2 a) Write a Pandas program to implement following operations: (3 Marks)

1. To create a DataFrame from a dictionary and display it.

2. To sort the DataFrame first by 'name' in ascending order.

Python

import pandas as pd

# 1. Create a DataFrame from a dictionary


student_data = {
'name': ['Alice', 'Charlie', 'Bob'],
'age': [21, 20, 22],
'score': [88, 95, 85]
}
df = pd.DataFrame(student_data)
print("Original DataFrame:")
print(df)
print("-" * 20)

# 2. Sort the DataFrame by the 'name' column


df_sorted = df.sort_values(by='name')
print("\nDataFrame sorted by 'name':")
print(df_sorted)

(OR Q.2 b) Which type of machine learning system should you use to make a robot
learn how to walk? Brief about selected model. (4 Marks)
●​ Type of ML System: Reinforcement Learning (RL).
●​ Brief: In RL, an agent (the robot) learns to behave in an environment (the physical
world) by performing actions (moving its motors) and observing the rewards or
penalties it receives.
○​ For a walking robot, a positive reward could be given for moving forward
without falling over.
○​ A negative reward (penalty) would be given for falling down.
○​ Through countless trials and errors, the robot learns a policy (a strategy of
which actions to take) that maximizes its cumulative reward, which in this
case, corresponds to successful walking.

(OR Q.2 c) Differentiate between Numpy and Pandas. Create a series having the
names of 3 students in your class and assign their roll numbers as index values. Also
use attributes: index, dtype, shape, ndim with series. (7 Marks)

Difference:

| Feature | NumPy | Pandas |

| :--- | :--- | :--- |

| Primary Data Structure | ndarray (n-dimensional array) | Series (1D) and DataFrame (2D) |

| Data Type | Homogeneous (all elements must be the same type). | Heterogeneous
(columns can have different types). |

| Indexing | Uses integer-based indexing. | Uses labeled indexing (and integer-based). |

| Use Case | Optimized for fast numerical and mathematical operations. | Best for data
cleaning, manipulation, and analysis of tabular data. |

Program:

Python

import pandas as pd

# Create a Series with custom index


student_names = ['Amit', 'Priya', 'Rahul']
roll_numbers = [101, 102, 103]
student_series = pd.Series(data=student_names, index=roll_numbers)

print("Student Series:")
print(student_series)
print("-" * 20)

# Use attributes
print(f"Index: {student_series.index}")
print(f"Data Type (dtype): {student_series.dtype}")
print(f"Shape: {student_series.shape}")
print(f"Number of Dimensions (ndim): {student_series.ndim}")

Q.3: Data Preprocessing and Model Evaluation


(a) Explain the steps involved in the Preparing to Model activity in machine learning. (3
Marks)

This is also known as Data Preprocessing. The key steps are:

1.​ Data Cleaning: Handling missing values and outliers.


2.​ Data Transformation: Scaling numerical features and encoding categorical features.
3.​ Feature Selection: Choosing the most relevant features for the model.
4.​ Data Splitting: Dividing the dataset into training and testing sets.

(b) Define outliers in data. Write approaches to handle outliers. (4 Marks)


●​ Definition: An outlier is a data point that is significantly different from the other data
points in a dataset. It is an observation that lies an abnormal distance from other
values.
●​ Approaches to Handle Outliers:
1.​ Deletion: Simply remove the outlier records. This is feasible for large
datasets but can cause information loss.
2.​ Transformation: Apply mathematical transformations like taking the log or
square root of the data to reduce the effect of the outlier.
3.​ Imputation: Replace the outlier with a more reasonable value, such as the
mean, median, or mode of the feature.
4.​ Binning: Group the data into bins or categories, which can reduce the impact
of extreme values.

(c) Describe K-fold cross validation in details. (7 Marks)

K-Fold Cross-Validation is a robust technique for evaluating the performance of a machine


learning model and ensuring it generalizes well to new, independent data.

Detailed Process:
1.​ Shuffle the Dataset: The first step is to randomly shuffle the entire dataset to ensure
that the data is not ordered in any way.
2.​ Split into K Folds: The dataset is then split into K equal-sized, non-overlapping
subsets called "folds." A common value for K is 5 or 10.
3.​ Iterate K Times: The process iterates K times. In each iteration:
○​ Select Validation Fold: One of the K folds is chosen as the validation set (or
test set for that iteration).
○​ Select Training Folds: The remaining K-1 folds are combined to form the
training set.
○​ Train and Evaluate: The model is trained on the training set and then
evaluated on the validation set. The performance score (e.g., accuracy) for
that iteration is recorded.
4.​ Average the Scores: After all K iterations are complete, the final performance of the
model is calculated by taking the average of the K recorded scores. This average
score is a more reliable and less biased estimate of the model's performance than a
single train-test split.

(OR Q.3 a) Explain the importance of Dimensionality reduction and Feature subset
selection in Data Pre-Processing. (3 Marks)
●​ Dimensionality Reduction: Important for reducing the "Curse of Dimensionality,"
where models struggle with sparse data in high dimensions. It speeds up model
training and can help with data visualization.
●​ Feature Subset Selection: Important for simplifying models by removing irrelevant
or redundant features. This can improve model accuracy, reduce overfitting, and
make the model easier to interpret.

(OR Q.3 b) Write the difference between 1) Nominal and Ordinal data, 2) Interval and
Ratio data. (4 Marks)
1.​ Nominal vs. Ordinal Data (Both are Categorical):
○​ Nominal Data: Categories that have no natural order or ranking. Examples:
Gender ('Male', 'Female'), Colors ('Red', 'Blue').
○​ Ordinal Data: Categories that have a meaningful order or rank, but the
difference between them is not defined. Examples: Customer satisfaction
('Poor', 'Good', 'Excellent'), T-shirt size ('S', 'M', 'L').
2.​ Interval vs. Ratio Data (Both are Numerical):
○​ Interval Data: Ordered data where the difference between two values is
meaningful, but there is no true zero point. Examples: Temperature in
Celsius (0°C is a temperature, not the absence of heat), IQ score.
○​ Ratio Data: Ordered data with a meaningful difference and a true, absolute
zero. This means zero represents the absence of the attribute. Examples:
Height, weight, age, price.

(OR Q.3 c) Write a Pandas program to find and drop the missing values from the given
dataset. (7 Marks)

Python

import pandas as pd
import numpy as np

# Create a sample DataFrame with missing values (NaN)


data = {
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [10, 20, 30, 40]
}
df = pd.DataFrame(data)

print("Original DataFrame with Missing Values:")


print(df)
print("-" * 35)

# --- 1. Find the missing values ---


# isnull() returns a DataFrame of booleans (True for NaN)
# .sum() counts the True values in each column
missing_values_count = df.isnull().sum()
print("Count of missing values in each column:")
print(missing_values_count)
print("-" * 35)

# --- 2. Drop the missing values ---


# The dropna() method removes rows with any NaN values
df_dropped = df.dropna()
print("DataFrame after dropping rows with missing values:")
print(df_dropped)

Q.4: Supervised Learning


(a) Define classification with example in supervised learning. List types of
classification. (3 Marks)
●​ Definition: Classification is a supervised learning task where the objective is to
predict a categorical class label. The model learns from labeled data to assign new,
unseen data points to one of a predefined set of categories.
●​ Example: Classifying an email as either 'spam' or 'not spam'.
●​ Types of Classification:
○​ Binary Classification: Two possible outcomes (e.g., Yes/No).
○​ Multi-class Classification: More than two mutually exclusive outcomes (e.g.,
'Cat', 'Dog', 'Bird').
○​ Multi-label Classification: Assigning multiple non-exclusive labels to an
instance (e.g., tagging a movie as 'Action', 'Thriller', and 'Sci-Fi').

(b) Write note on choosing k value in the KNN algorithm. (4 Marks)

Choosing the value of 'k' (the number of neighbors) in the K-Nearest Neighbors algorithm is
a critical step that significantly affects the model's performance.
●​ Small k (e.g., k=1): The model is highly flexible and sensitive to noise and outliers.
This can lead to a complex decision boundary and overfitting (high variance).
●​ Large k: The model becomes less sensitive to individual points and has a smoother
decision boundary. This can lead to underfitting (high bias), as it might fail to
capture the local structure of the data.
●​ How to Choose k: There is no single best value for k. The optimal k is typically
found through experimentation and cross-validation. A common practice is to:
1.​ Test the model's performance with a range of k values (e.g., from 1 to 20).
2.​ Plot the accuracy for each k.
3.​ Choose the k value that provides the best accuracy (often found at the
"elbow" of the curve, where performance stabilizes).
4.​ It's also a good practice to use an odd number for k in binary classification to
avoid ties.

c) Define simple linear regression using a graph explaining slope. Find the slope of
the graph where the lower point on the line is represented as (-3, -2) and the higher
point on the line is represented as (2, 2). (7 Marks)

●​ Definition: Simple Linear Regression is a statistical method used to model the


relationship between two continuous variables: a single independent variable (x) and
a dependent variable (y). It finds the best-fitting straight line (the regression line)
through the data points that minimizes the distance between the line and the points.
The equation for this line is y=β0​+β1​x.
●​ Graph Explanation:
○​ Slope (β1​): The slope of the regression line represents the rate of change. It
tells us how much the dependent variable (y) is expected to change for a
one-unit increase in the independent variable (x). A positive slope means y
increases as x increases; a negative slope means y decreases as x
increases.
●​ Calculation of Slope: The formula for the slope (m) given two points (x1​,y1​) and
(x2​,y2​) is:​
m=x2​−x1
●​ ​y2​−y1​​
Given the points (−3,−2) and (2,2):​


The slope of the graph is 0.8.

(OR Q.4 a) Define regression with example in supervised learning. List types of
regression. (3 Marks)
●​ Definition: Regression is a supervised learning task where the objective is to predict
a continuous numerical value. The model learns the relationship between input
features and a continuous target variable.
●​ Example: Predicting the price of a house based on its size, location, and number of
bedrooms.
●​ Types of Regression:
○​ Simple Linear Regression
○​ Multiple Linear Regression
○​ Polynomial Regression
○​ Ridge Regression
○​ Lasso Regression

(OR Q.4 b) Differentiate between classification and regression. (4 Marks)

Feature Classification Regression

Output Predicts a discrete class Predicts a continuous quantity


label (category). (number).

Question It "What class?" or "Is this A or "How much?" or "How many?"


Answers B?"

Evaluation Accuracy, Precision, Recall, Mean Absolute Error (MAE),


Metrics F1-Score, Confusion Matrix. Mean Squared Error (MSE),
R-squared.

Example Is this email spam or not What will be the temperature


spam? tomorrow?

(OR Q.4 c) Discuss kNN algorithm in detail. (7 Marks)

The K-Nearest Neighbors (KNN) algorithm is a simple, yet effective, supervised machine
learning algorithm used for both classification and regression. It is a non-parametric and lazy
learning algorithm.
●​ Non-parametric: It makes no assumptions about the underlying data distribution.
●​ Lazy Learning: It does not build a model during the training phase. Instead, it simply
stores the entire training dataset. The computation is deferred until a prediction is
needed.

How it Works:
The core idea of KNN is that similar things exist in close proximity. To classify a new, unseen
data point, KNN follows these steps:

1.​ Choose a value for k: Decide on the number of nearest neighbors to consider (e.g.,
k=5).
2.​ Calculate Distances: Calculate the distance between the new data point and every
single point in the training dataset. The most common distance metric used is
Euclidean distance.
3.​ Identify the k Nearest Neighbors: Find the k data points from the training set that
have the smallest distances to the new point.
4.​ Make a Prediction:
○​ For Classification: The new data point is assigned to the class that is most
common among its k nearest neighbors (this is called a "majority vote").
○​ For Regression: The prediction for the new data point is the average of the
values of its k nearest neighbors.

Q.5: Unsupervised Learning


(a) Describe the main difference in the approach of k-means and k-medoids
algorithms with a neat diagram. (3 Marks)

The main difference between K-Means and K-Medoids lies in how they define the center of
a cluster.
●​ K-Means: The center, called a centroid, is the mean (average) of all the data points
in the cluster. This centroid is a calculated point and may not be an actual data point
in the dataset.
●​ K-Medoids: The center, called a medoid, is the most centrally located actual data
point within the cluster. It is chosen to minimize the total distance to all other points
in its cluster.

Because K-Medoids uses an actual data point as the center, it is more robust to outliers
than K-Means.

[Image comparing K-Means centroids vs K-Medoids medoids]

(b) Differentiate Supervised Machine Learning and Unsupervised Machine Learning. (4


Marks)

This is a repeat of Q.5(b) from the Winter-2024 paper. Please refer to that answer.

(c) Explain how the Market Basket Analysis uses the concepts of association
analysis. (7 Marks)

Market Basket Analysis (MBA) is the classic application of association analysis. It is a


technique used by retailers to understand customer purchasing patterns by discovering
associations between items that are frequently bought together.
How it uses Association Analysis:
1.​ Identify Frequent Itemsets: The first step is to analyze transaction data (e.g., from a
supermarket) to find frequent itemsets—groups of items that appear together in
shopping baskets more often than a certain threshold. For example, it might discover
that {Bread, Butter, Milk} is a frequent itemset. This is done using the Support metric.
2.​ Generate Association Rules: From these frequent itemsets, association rules are
generated. A rule takes the form "If {A} then {B}," meaning that customers who buy
item A also tend to buy item B. For example, a rule could be: {Bread, Butter} ->
{Milk}.
3.​ Evaluate Rules: These rules are not just generated randomly; they are evaluated
based on key metrics from association analysis:
○​ Support: Measures the popularity of an itemset (e.g., the percentage of all
transactions that contain both bread and butter).
○​ Confidence: Measures the likelihood of the rule being true (e.g., out of all the
customers who bought bread and butter, what percentage also bought milk?).
○​ Lift: Measures how much more likely customers are to buy milk if they have
bread and butter, compared to the overall likelihood of buying milk. A lift
greater than 1 indicates a positive association.

By using these concepts, retailers can make practical business decisions like placing
associated items close to each other in a store, creating targeted promotions, and offering
product bundles.

(OR Q.5 a) List applications of unsupervised learning. (3 Marks)

This is a repeat of Q.5(a) OR from the Winter-2024 paper. Please refer to that answer.

(OR Q.5 b) What are the broad three categories of clustering techniques? Explain the
characteristics of each briefly. (4 Marks)

This is a repeat of Q.5(b) OR from the Winter-2024 paper. The three categories are:

1.​ Partitioning Clustering (e.g., K-Means): Divides data into a pre-set number of
non-overlapping clusters.
2.​ Hierarchical Clustering (e.g., Agglomerative): Creates a tree-like hierarchy of
clusters, not requiring the number of clusters to be specified beforehand.
3.​ Density-Based Clustering (e.g., DBSCAN): Groups dense regions of data points
into clusters and can identify arbitrarily shaped clusters and outliers.

(OR Q.5 c) List association methods and explain any one with example. (7 Marks)
●​ Association Methods:
1.​ Apriori
2.​ Eclat
3.​ FP-Growth
●​ Explanation of Apriori Algorithm:​
The Apriori algorithm is a classic method for mining frequent itemsets and generating
association rules. Its core idea is based on the Apriori Principle: "If an itemset is
frequent, then all of its subsets must also be frequent." This principle helps to prune
the search space efficiently.​
Steps with an Example:​
Imagine a small set of transactions: {Milk, Bread}, {Bread, Diapers}, {Milk, Diapers,
Beer}, {Milk, Bread, Diapers}. Let's set a minimum support of 50% (must appear in at
least 2 transactions).
1.​ Find Frequent 1-Itemsets:
■​ {Milk}: 3 times (75%) -> Frequent
■​ {Bread}: 3 times (75%) -> Frequent
■​ {Diapers}: 3 times (75%) -> Frequent
■​ {Beer}: 1 time (25%) -> Infrequent. We discard {Beer}.
2.​ Generate and Prune 2-Itemsets: We generate pairs only from the frequent
1-itemsets.
■​ {Milk, Bread}: 2 times (50%) -> Frequent
■​ {Milk, Diapers}: 2 times (50%) -> Frequent
■​ {Bread, Diapers}: 2 times (50%) -> Frequent
3.​ Generate and Prune 3-Itemsets: We join the frequent 2-itemsets. The only
candidate is {Milk, Bread, Diapers}.
■​ {Milk, Bread, Diapers}: 2 times (50%) -> Frequent
4.​ Generate Association Rules: From the frequent itemsets, we generate rules
that meet a minimum confidence threshold (e.g., 60%).
■​ From {Milk, Bread, Diapers}: One possible rule is {Diapers, Bread} ->
{Milk}.
■​ Confidence Calculation: The itemset {Diapers, Bread} appears 2
times. The itemset {Milk, Bread, Diapers} also appears 2 times.
■​ Confidence = (Support of {Milk, Bread, Diapers}) / (Support of
{Diapers, Bread}) = 2/2 = 100%.
■​ Since 100% > 60%, this is a strong rule.

You might also like