0% found this document useful (0 votes)
12 views48 pages

BA LabManual V1

The document is a lab manual for the Business Analytics Lab course at Rajalakshmi Institute of Technology, outlining the vision, mission, program educational objectives, outcomes, and specific outcomes for Computer Science and Engineering students. It includes a detailed list of experiments using MS Excel and Power BI, focusing on data analysis techniques and applications. Each experiment is accompanied by objectives, procedures, and expected results, aimed at enhancing students' analytical and technical skills.

Uploaded by

hassan.kaif.jobs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views48 pages

BA LabManual V1

The document is a lab manual for the Business Analytics Lab course at Rajalakshmi Institute of Technology, outlining the vision, mission, program educational objectives, outcomes, and specific outcomes for Computer Science and Engineering students. It includes a detailed list of experiments using MS Excel and Power BI, focusing on data analysis techniques and applications. Each experiment is accompanied by objectives, procedures, and expected results, aimed at enhancing students' analytical and technical skills.

Uploaded by

hassan.kaif.jobs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

CCW331 – Business Analytics Lab

Lab Manual III/CSE /6th Semester

Name of the
Student:

Register Number:

Class/Sec:

RAJALAKSHMI INSTITUTE OF TECHNOLOGY, CHENNAI


An Autonomous Institution, Affiliated to Anna University, Chennai
REGULATIONS 2023

I VISION OF THE DEPARTMENT

To establish a pioneering presence in the domain of Computer Science and Engineering by delivering excellence in technical learning that nurtures a culture of
innovation, research and competent professionalism.
II MISSION OF THE DEPARTMENT

To empower the next generation of Computer Science and Engineering professionals attuned to industry evolution and national progress.

To drive a continuous advancement in teaching and learning standards in the field of Computer Science and Engineering by fostering an enriching
research milieu.

To forge a strong and symbiotic relationship with the industry, facilitating dynamic interactions that bridge academia and real-world
applications, nurturing innovationand productive collaboration.

III PROGRAM EDUCATIONAL OBJECTIVES (PEOs)

Graduates can
Pursue higher education and research or have a successful career in industriesassociated with Computer Science and Engineering.
Adapt the emerging technological changes for the global social benefits.
Become entrepreneurs in the field of Computer Science and Engineering, inculcating Research and Innovation.

IV PROGRAM OUTCOMES (POs)


1. Engineering Knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and an engineering specialization to the
solution of complex engineering problems.
2. Problem Analysis: Identify, formulate, review research literature, and analyze complex engineering problems reaching substantiated conclusions
using first principles of mathematics, natural sciences, and engineering sciences.
3. Design/Development of Solutions: Design solutions for complex engineering problems and design system components or processes that meet
the specified needswith appropriate consideration for the public health and safety, and the cultural, societal, and environmental considerations.
4. Conduct Investigations of Complex Problems: Use research-based knowledge andresearch methods, including design of experiments, analysis
and interpretation of data, and synthesis of the information to provide valid conclusions.
5. Modern Tool Usage: Create, select, and apply appropriate techniques, resources, and modern engineering and IT tools including prediction and
modelling to complex engineering activities with an understanding of the limitations.
6. The Engineer and Society: Apply reasoning informed by the contextual knowledgeto assess societal, health, safety, legal and cultural issues and
the consequent responsibilities relevant to the professional engineering practice.
7. Environment and Sustainability: Understand the impact of the professional engineering solutions to societal and environmental contexts, and
demonstrate the knowledge of, and need for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the engineering practice.
9. Individual and Team Work: Function effectively as an individual, and as a memberor leader in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities withthe engineering community and with society at large, such as,
being able tocomprehend and write effective reports and design documentation, make effective presentations, and give and receive clear
instructions.
11. Project Management and Finance: Demonstrate knowledge and understanding ofthe engineering and management principles and apply these to
one’s own work, asa member and leader in a team, to manage projects and in multidisciplinaryenvironments.
12. Lifelong Learning: Recognize the need for, and have the preparation and ability toengage in independent and lifelong learning in the broadest
context of technologicalchange.

V PROGRAM SPECIFIC OUTCOMES (PSOs)

The students will be able to


Design and programming skills to build and automate business solutions usingcutting edge technologies.
Strengthen the theoretical foundation leading to excellence and excitement towardsresearch, to provide elegant solutions to complex
problems.

Work effectively with various engineering fields as a team to design, build anddevelop system applications.

List of Experiments
Cycle-1 – MS Excel

1. Explore the features of Ms-Excel.


2. (i) Get the input from user and perform numerical operations (MAX, MIN, AVG, SUM,
SQRT, ROUND)
ii) Perform data import/export operations for different file formats.
3. Perform statistical operations - Mean, Median, Mode and Standard deviation,
Variance,
Skewness, Kurtosis
4. Perform Z-test, T-test & ANOVA
5. Perform data pre-processing operations i) Handling Missing data ii) Normalization

Cycle-2

6. Perform dimensionality reduction operation using PCA, KPCA & SVD


7. Perform bivariate and multivariate analysis on the dataset.
8. Apply and explore various plotting functions on the data set.
II Cycle – Power BI Desktop
9. Explore the features of Power BI Desktop
10. Prepare & Load data

Cycle-3
11. Develop the data model
12. Perform DAX calculations
13. Design a report
14. Create a dashboard and perform data analysis
15. Presentation of a case study

Case Study Title


S.
No. Experiment Objective Case Study Integration

To understand and utilize the basic and advanced


functionalities of MS Excel for data organization and
1 Explore the features of Ms-Excel Employee Performance Analysis
analysis.

Get the input from user and perform numerical operations (MAX, To perform fundamental numerical operations for
2(i) MIN, AVG, SUM, SQRT, ROUND) summarizing and analyzing data. Sales Data Processing

To handle data in various formats for seamless


2(ii) Perform data import/export operations for different file formats integration and analysis. Customer Feedback Integration

Perform statistical operations - Mean, Median, Mode, Standard To apply descriptive statistics for understanding
3 deviation, Variance, Skewness, Kurtosis data distribution and variability. Student Exam Results Analysis

To conduct inferential statistical tests for hypothesis


4 Perform Z-test, T-test & ANOVA testing and comparisons. Treatment Effectiveness Study

Perform data pre-processing operations i) Handling Missing data ii) To clean and prepare data for accurate analysis by
5 Normalization addressing missing values and ensuring consistency. Real Estate Price Prediction

To reduce data complexity while retaining key


6 Perform dimensionality reduction operation using PCA, KPCA & SVD information for analysis. High-Dimensional Image Compression

To explore relationships between variables and


7 Perform bivariate and multivariate analysis on the dataset understand complex interactions. Market Basket Analysis

To visualize data for patterns and insights using


8 Apply and explore various plotting functions on the dataset appropriate plotting techniques. Weather Trends Visualization

To familiarize with Power BI for creating data-driven


9 Explore the features of Power BI Desktop reports and dashboards. Financial Risk Analysis

To prepare data for analysis by cleaning,


10 Prepare & Load data transforming, and importing it into Power BI. Retail Store Analytics

To create structured data models for effective query


11 Develop the data model execution and analysis. E-Commerce Sales Analysis

To apply advanced analytics using Data Analysis


12 Perform DAX calculations Expressions (DAX) in Power BI. Budget vs Actuals Comparison

To create visually appealing and informative reports


13 Design a report for stakeholders. Social Media Engagement Report

To build interactive dashboards that enable dynamic


14 Create a dashboard and perform data analysis exploration and insights. Health Metrics Dashboard

To summarize and communicate findings from data


15 Presentation of a case study analysis effectively. Comprehensive Data Storytelling
Employee Performance Analysis
Date:
Title
Explore the Features of MS Excel
Aim
To understand and utilize the basic and advanced functionalities of MS Excel for
organizing and analyzing employee performance data.
Procedure
1. Open MS Excel and create a new workbook.
2. Import the provided dataset or manually input the sample data.
3. Explore the following Excel features:
o Sorting and filtering data
o Applying conditional formatting
o Using basic formulas (SUM, AVERAGE, MAX, MIN)
o Creating a pivot table
o Generating charts (e.g., bar chart, line chart)
4. Perform operations to analyze the performance data:
o Calculate the total hours worked by each employee.
o Identify the top-performing employee based on scores.
o Visualize performance trends using charts.
5. Save and export the results.
Sample Data Set
Employee Departme Hours Task Feedback
ID Name nt Worked Score Score
Alice
E001 Johnson HR 40 85 4.5

E002 Bob Smith IT 38 90 4.8

Charlie
E003 Davis Sales 42 78 4.2

Diana
E004 Brown Marketing 35 88 4.7

Initialization
1. Open MS Excel on your computer.
2. Input the sample dataset into the first sheet of the workbook.
3. Save the file as Employee_Performance.xlsx.
Step-by-Step Process
Step 1: Sorting and Filtering
● Objective: Identify top-performing employees in each department.
● Action:
o Apply sorting on "Task Score" in descending order.
o Filter by "Department" to view individual department performance.
Step 2: Using Formulas
● Objective: Calculate total scores.
● Example Formula:
o =SUM(Task Score, Feedback Score) for each employee.
o Use the AVERAGE formula to find the mean score across all employees.
Step 3: Creating Pivot Tables
● Objective: Summarize hours worked by department.
● Action:
o Select data Insert Pivot Table.
o Add "Department" to Rows and "Hours Worked" to Values.
Step 4: Generating Charts
● Objective: Visualize performance trends.
● Action:
o Select data Insert Bar Chart.
o Customize chart title as "Employee Performance Trends."
Step 5: Conditional Formatting
● Objective: Highlight top performers.
● Action:
o Apply conditional formatting to "Task Score" to highlight scores >85.
Test Cases
Test Case 1: Sorting Employees by Task Score
● Input: Employee data
● Action: Sort "Task Score" in descending order.
● Expected Output: Bob Smith appears as the first record.
Test Case 2: Total Hours Worked by Department
● Input: Employee data table
● Action: Create a pivot table for "Department" and "Hours Worked."
● Expected Output: Total hours for IT department displayed as 38.
Test Case 3: Top Performer Identification
● Input: Employee data table
● Action: Identify the employee with the highest "Task Score."
● Expected Output: Bob Smith with a score of 90.
Test Case 4: Visualization of Performance Trends
● Input: Employee performance scores
● Action: Generate a bar chart.
● Expected Output: Bar chart with scores visualized per employee.
Test Case 5: Conditional Formatting
● Input: Employee task scores
● Action: Apply conditional formatting for scores >85.
● Expected Output: Rows with scores >85 are highlighted.
Result
The analysis highlighted the following:
1. The IT department had the highest total hours worked.
2. Bob Smith was the top performer with a task score of 90.
3. Visualization of feedback scores revealed consistent high ratings across
departments.
Viva Voce Questions with Multiple Choices
1. Logical Problem-Solving
If an employee’s task score starts at 50 and increases by 10% each month for three months,
what will their score be at the end of the third month?
● (a) 55.5
● (b) 60.5
● (c) 66.55
● (d) 72.6
Answer: (c) 66.55

2. Data Analysis
In a dataset with missing values, replacing missing values with the dataset’s median is
preferred when:
● (a) The data contains extreme outliers.
● (b) The dataset is small.
● (c) The missing values are randomly distributed.
● (d) The data is normally distributed.

3. Probability and Statistics


If a dataset contains values [2,4,6,8,10][2, 4, 6, 8, 10][2,4,6,8,10], what is the probability of
randomly selecting an even number?
● (a) 20%
● (b) 40%
● (c) 60%
● (d) 100%

4. Visualization
Which of the following charts is most suitable for visualizing the relationship between two
continuous variables?
● (a) Bar Chart
● (b) Scatter Plot
● (c) Histogram
● (d) Line Graph

5. Matrix Operations
If the covariance matrix of a dataset is [3112]\begin{bmatrix} 3 & 1 \\ 1 & 2 \end{bmatrix}
[3112], what is the sum of the eigenvalues?
● (a) 4
● (b) 5
● (c) 6
● (d) 7

6. Dimensionality Reduction
What is the primary goal of Principal Component Analysis (PCA)?
● (a) Increase the number of features in the dataset.
● (b) Reduce overfitting in machine learning models.
● (c) Compress data by projecting it into fewer dimensions while preserving variance.
● (d) Improve the speed of data collection.
Numerical Operations in MS Excel
Experiment Number: 2(i)
Date:
Aim
:
Procedure:
1. Open MS Excel and create a new worksheet.
2. Enter a sample data set in a column (e.g., Sales Data).
3. Utilize Excel functions to perform various numerical operations:
o MAX: Find the maximum value in the dataset.
o MIN: Find the minimum value in the dataset.
o AVG: Calculate the average of the dataset.
o SUM: Compute the sum of the dataset.
o SQRT: Calculate the square root of individual values.
o ROUND: Round the results to the nearest integer.
4. Display the results in the adjacent column for each operation.
5. Save the worksheet.
Sample Data Set:
Sales Data (in
$)

1200

850

2100

1300

950

1700

1150
Initialization:
1. Open MS Excel.
2. Enter the sample data set in a column (A1 to A7).
Step-by-Step Process:
1. Perform MAX Operation:
o In cell B1, enter the formula =MAX(A1:A7) to find the maximum sales value.
2. Perform MIN Operation:
o In cell B2, enter the formula =MIN(A1:A7) to find the minimum sales value.
3. Calculate AVG (Average):
o In cell B3, enter the formula =AVERAGE(A1:A7) to calculate the average of the
sales data.
4. Calculate SUM:
o In cell B4, enter the formula =SUM(A1:A7) to find the sum of the sales data.
5. Calculate SQRT (Square Root):
o In cell B5, enter the formula =SQRT(A1) to calculate the square root of the first
value in the dataset.
Examples with Corresponding Test Cases:
● Example 1:
o Input: 1200, 850, 2100, 1300, 950, 1700, 1150
o Expected Output (MAX): 2100
● Example 2:
o Input: 1200, 850, 2100, 1300, 950, 1700, 1150
o Expected Output (MIN): 850
● Example 3:
o Input: 1200, 850, 2100, 1300, 950, 1700, 1150
o Expected Output (AVG): 1265
● Example 4:
o Input: 1200, 850, 2100, 1300, 950, 1700, 1150
o Expected Output (SUM): 9500
● Example 5:
o Input: 1200
o Expected Output (SQRT): 34.64

Results of Lab Experiments:


1. The maximum value of sales data is calculated correctly using the MAX function.
2. The minimum sales value is identified successfully using the MIN function.
3. The average sales value is computed accurately using the AVERAGE function.
4. The sum of sales data matches the expected result using the SUM function.
5. The square root operation works for individual sales data entries.
6. Rounding of numbers is effective when using the ROUND function.
7. Data entry is validated for accuracy and consistency.
8. The formulas correctly update when new data is added to the worksheet.
9. The worksheet is saved and formatted properly for presentation.
10. The operations are reproducible with new data inputs.
11. The results of MAX, MIN, AVG, SUM, and SQRT are displayed in the worksheet.
12. Data sets can be modified to test different operations.
13. Numerical operations provide accurate insights for decision-making.
14. The use of Excel functions reduces manual calculation errors.
15. The user is able to apply learned functions to real-world datasets effectively.

Viva Voce Questions:


1.
What is the significance of the MAX and MIN functions in data analysis?
2.
How does the AVERAGE function differ from the SUM function in Excel?
3.
What is the purpose of the ROUND function in Excel?
4.
In what scenarios would you use the SQRT function in Excel?
5.
Why is it important to clean and format data before performing numerical operations
in Excel?
Statistical Operations for Data Analysis in MS Excel
Experiment Number: 3
Date:
Aim
:
Procedure:
1. Open MS Excel and create a new worksheet.
2. Enter a sample dataset in a column (e.g., Student Exam Scores).
3. Utilize Excel functions to perform various statistical operations:
o Mean: Calculate the average of the dataset.
o Median: Find the middle value of the dataset.
o Mode: Identify the most frequent value in the dataset.
o Standard Deviation: Calculate the spread of data from the mean.
o Variance: Determine how much the values in the dataset differ from the mean.
o Skewness: Measure the asymmetry of the data distribution.
o Kurtosis: Measure the 'tailedness' of the data distribution.
4. Display the results in the adjacent column for each operation.
5. Save the worksheet.
Sample Data Set:
Student Exam
Scores

85

90

75

80

90

95

85
Initialization:
1. Open MS Excel.
2. Enter the sample dataset in a column (A1 to A7).
Step-by-Step Process:
1. Calculate the Mean:
o In cell B1, enter the formula =AVERAGE(A1:A7) to calculate the average score of
the students.
2. Calculate the Median:
o In cell B2, enter the formula =MEDIAN(A1:A7) to find the median score.
3. Find the Mode:
o In cell B3, enter the formula =MODE(A1:A7) to identify the most frequent score.
4. Calculate the Standard Deviation:
o In cell B4, enter the formula =STDEV.P(A1:A7) to compute the standard
deviation.
5. Calculate the Variance:
o In cell B5, enter the formula =VAR.P(A1:A7) to calculate the variance.
6. Calculate Skewness:
o In cell B6, enter the formula =SKEW(A1:A7) to measure the skewness of the
distribution.
7. Calculate Kurtosis:
o In cell B7, enter the formula =KURT(A1:A7) to determine the kurtosis of the
distribution.

Examples with Corresponding Test Cases:


● Example 1:
o Input: 85, 90, 75, 80, 90, 95, 85
o Expected Output (Mean): 85.71
● Example 2:
o Input: 85, 90, 75, 80, 90, 95, 85
o Expected Output (Median): 85
● Example 3:
o Input: 85, 90, 75, 80, 90, 95, 85
o Expected Output (Mode): 85 and 90 (Bimodal)
● Example 4:
o Input: 85, 90, 75, 80, 90, 95, 85
o Expected Output (Standard Deviation): 6.89
● Example 5:
o Input: 85, 90, 75, 80, 90, 95, 85
o Expected Output (Skewness): 0.29
Results of Lab Experiments:
1. Missing data can be handled effectively using various imputation methods.
2. Data normalization ensures that all features are on the same scale, improving
model performance.
3. Different normalization techniques, such as Min-Max scaling and Z-score
normalization, have distinct advantages based on the dataset.
4. The pre-processing step significantly improves the quality of the data and ensures
consistency.
5. It is essential to handle missing values before performing any further analysis to
avoid skewed results.

Viva Voce Questions:


1.
What does the Mean value tell us about a data set?
2.
What is the purpose of calculating the Standard Deviation?
3.
How do you calculate the Mode in Excel?
4.
What does Skewness measure in a dataset?
5.
How is Kurtosis different from Skewness?
Title: Hypothesis Testing using Z-test, T-test, and ANOVA
Experiment Number: 4
Date:
Aim
:
Procedure:
1. Open MS Excel and create a new worksheet.
2. Enter the sample datasets for which hypothesis tests will be performed.
3. Formulate the null and alternative hypotheses for each test.
4. Use the appropriate Excel functions to perform the Z-test, T-test, and ANOVA:
o Z-test: For comparing the sample mean to a known population mean.
o T-test: For comparing the means of two independent groups.
o ANOVA: For comparing means across multiple groups.
5. Interpret the results to determine if the null hypothesis should be accepted or
rejected.
6. Save the worksheet.
Sample Data Set:
Group 1 Group 2 Group 3
(Scores) (Scores) (Scores)

85 90 88

78 83 90

92 86 85

85 91 87

88 89 90
Initialization:
1. Open MS Excel.
2. Enter the sample data in three separate columns (Group 1, Group 2, and Group 3).
Step-by-Step Process:
1. Perform Z-test:
o In cell B1, calculate the sample mean and standard deviation of Group 1.
o In cell B2, enter the formula for the Z-test: = (mean - population mean) /
(standard deviation / SQRT(sample size)).
o Perform the Z-test and interpret the result based on the critical value.
2. Perform T-test:
o In cell B3, calculate the sample means and standard deviations for Group 1 and
Group 2.
o In cell B4, enter the formula =[Link](range1, range2, tails, type) to perform the
T-test.
o Interpret the result based on the p-value.
3. Perform ANOVA:
o In cell B5, use the formula =ANOVA(data1, data2, data3) to perform the ANOVA
test.
o Analyze the results to check if there are significant differences between the
means of Group 1, Group 2, and Group 3.
4. Interpret Results:
o Compare the p-value to the significance level (usually 0.05) to determine
whether to reject or accept the null hypothesis.
o Document the findings in the worksheet.

Examples with Corresponding Test Cases:


● Example 1:
o Input: Group 1: 85, 78, 92, 85, 88; Population mean: 80
o Expected Output (Z-test): Z = 1.86 (Reject the null hypothesis if p-value < 0.05)
● Example 2:
o Input: Group 1: 85, 78, 92, 85, 88; Group 2: 90, 83, 86, 91, 89
o Expected Output (T-test): p-value = 0.054 (Fail to reject the null hypothesis if p-
value > 0.05)
● Example 3:
o Input: Group 1: 85, 78, 92, 85, 88; Group 2: 90, 83, 86, 91, 89; Group 3: 88, 90, 85,
87, 90
o Expected Output (ANOVA): p-value = 0.006 (Reject the null hypothesis if p-value
< 0.05)
● Example 4:
o Input: Group 1: 85, 78, 92, 85, 88; Group 2: 80, 75, 82, 85, 80
o Expected Output (Z-test): Z = 0.62 (Fail to reject the null hypothesis if p-value >
0.05)
● Example 5:
o Input: Group 1: 85, 78, 92, 85, 88; Group 2: 80, 85, 84, 83, 81
o Expected Output (T-test): p-value = 0.29 (Fail to reject the null hypothesis if p-
value > 0.05)
Results of Lab Experiments:
1. Dimensionality reduction helps in visualizing high-dimensional data more
effectively.
2. PCA is a powerful linear method for reducing data dimensions without losing
significant information.
3. KPCA handles non-linear data more effectively than PCA by transforming data into
higher-dimensional space.
4. SVD is useful for matrix factorization, especially in applications like image
compression.
5. Dimensionality reduction techniques can significantly improve model training time
and efficiency.

Viva Voce Questions:


1.
What is the primary purpose of performing a Z-test?
2.
In a T-test, what does a p-value less than 0.05 indicate?
3.
When should you use ANOVA over a T-test?
4.
What does a large p-value (> 0.05) imply in hypothesis testing?
5.
How is the p-value interpreted in hypothesis testing?
Perform Data Pre-processing Operations
Title: Data Pre-processing for Data Analysis
Experiment Number: 4
Date:
Aim
:

Procedure:
1. Open MS Excel or any suitable data analysis software.
2. Import the dataset that requires cleaning (e.g., sales data, customer feedback).
3. Identify and handle missing data (either by imputation, removing rows, or using
statistical methods).
4. Normalize the dataset by scaling the features to a common range (e.g., Min-Max
scaling, Z-score normalization).
5. Document the cleaned and normalized dataset for further analysis.

Sample Data Set:


Sale Pric Customer
s e Rating

150 20 4.5

200 25 NULL

NUL
L 30 3.8

NUL
180 L 4.2

210 35 4.7

Initialization:
1. Open your data analysis software (Excel, Python, R, etc.).
2. Load the dataset to be processed.
Step-by-Step Process:
1. Identify Missing Data: Check for cells with missing values (NULL or NaN).
2. Handle Missing Data: Use imputation methods (mean, median) or remove the rows
with missing values.
3. Normalize Data: Apply Min-Max scaling or Z-score normalization to the numerical
columns.
4. Verify Data: After normalization, ensure all data values are within the expected
range.
5. Save Cleaned Data: Store the cleaned and normalized dataset for analysis.

Examples with Corresponding Test Cases:


Example 1:
● Input: Sales Data: [150, 200, NULL, 180, 210]
● Expected Output: After Imputation (mean): [150, 200, 184.5, 180, 210]
Example 2:
● Input: Price Data: [20, 25, 30, NULL, 35]
● Expected Output: After Imputation (median): [20, 25, 30, 27.5, 35]
Example 3:
● Input: Customer Rating: [4.5, NULL, 3.8, 4.2, 4.7]
● Expected Output: After Imputation (mean): [4.5, 4.3, 3.8, 4.2, 4.7]
Example 4:
● Input: Normalizing Sales Data: [150, 200, 180, 210]
● Expected Output: Min-Max Normalization: [0.0, 1.0, 0.5, 1.0]
Example 5:
● Input: Normalizing Price Data: [20, 25, 30, 35]
● Expected Output: Z-Score Normalization: [-1.341, -0.447, 0.447, 1.341]

Viva Voce Questions:


1. What are the common techniques for handling missing data in a dataset?
a) Mean imputation
2. Why is data normalization necessary in data analysis?
a) It reduces the range of the data
3. What is the difference between Min-Max scaling and Z-score normalization?
a) Min-Max scaling standardizes data, while Z-score normalization normalizes data
4. When should you remove rows with missing data instead of imputing the values?
a) When the dataset is small and removing rows does not lead to significant loss of
data
5. What is the potential drawback of imputing missing data with mean values?
a) It can introduce bias in the data

Perform Dimensionality Reduction Operations using PCA, KPCA & SVD


Title: Dimensionality Reduction with PCA, KPCA, and SVD
Experiment Number: 5
Date:
Aim
:

Procedure:
1. Load the high-dimensional dataset to be analyzed (e.g., image dataset, customer
data).
2. Apply PCA to reduce the dataset's dimensionality while retaining the variance.
3. Use KPCA to handle non-linear data and apply dimensionality reduction in a higher-
dimensional space.
4. Perform SVD for matrix factorization and dimensionality reduction.
5. Evaluate the results to ensure that the reduced dataset retains the critical features
of the original data.

Sample Data Set:


Feature Feature Feature Feature
1 2 3 4

1.0 2.1 3.3 4.4


1.2 2.3 3.5 4.6

1.4 2.5 3.7 4.8

1.6 2.7 3.9 5.0

Initialization:
1. Import necessary libraries (e.g., scikit-learn, numpy) for PCA, KPCA, and SVD.
2. Load the dataset into the Python environment.

Step-by-Step Process:
1. Apply PCA:
o Use the PCA function from a library (e.g., scikit-learn) to reduce the
dimensionality of the dataset.
2. Apply KPCA:
like the Silhouette score.
5. Cluster New Data: Assign new data points to the nearest cluster using the trained
model.

Examples with Corresponding Test Cases:


Example 1:
● Input: Dataset: [(1.0, 2.0), (1.5, 1.8), (5.0, 8.0), (8.0, 8.5), (1.2, 1.9)], K = 2
● Expected Output: Cluster 1: [(1.0, 2.0), (1.5, 1.8), (1.2, 1.9)], Cluster 2: [(5.0, 8.0), (8.0,
8.5)]
Example 2:
● Input: Dataset: [(2.0, 3.0), (3.5, 4.5), (7.0, 8.0), (8.5, 9.5)], K = 2
● Expected Output: Cluster 1: [(2.0, 3.0), (3.5, 4.5)], Cluster 2: [(7.0, 8.0), (8.5, 9.5)]
Example 3:
● Input: Dataset: [(0.5, 1.5), (1.0, 2.0), (5.5, 7.5), (7.0, 9.0)], K = 2
● Expected Output: Cluster 1: [(0.5, 1.5), (1.0, 2.0)], Cluster 2: [(5.5, 7.5), (7.0, 9.0)]
Example 4:
● Input: Dataset: [(3.0, 4.0), (3.5, 4.5), (6.0, 7.0), (8.0, 9.0)], K = 3
● Expected Output: Cluster 1: [(3.0, 4.0), (3.5, 4.5)], Cluster 2: [(6.0, 7.0)], Cluster 3:
[(8.0, 9.0)]
Example 5:
● Input: Dataset: [(2.0, 3.0), (2.5, 3.5), (7.0, 8.0), (8.5, 9.5)], K = 2
● Expected Output: Cluster 1: [(2.0, 3.0), (2.5, 3.5)], Cluster 2: [(7.0, 8.0), (8.5, 9.5)]
1.

Results of Lab Experiments:


1. K-means is effective in clustering data when the number of clusters is known in
advance.
2. Preprocessing such as scaling is important to ensure the K-means algorithm works
effectively, especially with high-dimensional data.
3. The value of K should ideally be chosen based on the dataset or using methods
such as the Elbow Method or Silhouette score.
4. K-means may struggle with non-spherical clusters or when clusters have different
densities.
5. The performance of K-means can be evaluated by measuring intra-cluster similarity
and inter-cluster dissimilarity.

Viva Voce Questions:


1. What is the K-means clustering algorithm used for?
a) Supervised learning
2. How do you determine the optimal number of clusters (K) for K-means?
a) Using the Elbow method
3. What is the main limitation of the K-means algorithm?
a) It assumes clusters are spherical
4. How does K-means assign a data point to a cluster?
a) Based on the closest centroid
5. What are some ways to improve K-means clustering?
a) Using a better initialization method for centroids

Decision Trees for Classification


Title: Building a Decision Tree Classifier
Experiment Number: 7
Date:
Aim
:

Procedure:
1. Load the dataset containing features (independent variables) and labels (target
variable).
2. Preprocess the data (e.g., handle missing values and categorical variables).
3. Split the data into training and testing sets.
4. Implement the Decision Tree Classifier using a library such as scikit-learn.
5. Train the model on the training dataset and evaluate its performance on the test
dataset using accuracy or confusion matrix.

Sample Data Set:


A Inco Education Purchased (Yes/
ge me Level No)

2
2 30K Bachelor's No

4
5 60K Master's Yes

3
6 45K Bachelor's Yes

5
2 80K PhD Yes

2
9 50K Master's No

Initialization:
1. Import necessary libraries (e.g., pandas, numpy, scikit-learn).
2. Load and preprocess the dataset into Python.

Step-by-Step Process:
1. Load Dataset: Import the dataset using pandas.
2. Preprocess Data: Handle missing values and convert categorical features into
numerical values (e.g., one-hot encoding for Education Level).
3. Split Dataset: Use the train_test_split() function to divide the data into training and
testing sets.
4. Train Model: Implement a Decision Tree Classifier using scikit-learn’s
DecisionTreeClassifier() and train it on the training data.
5. Evaluate Model: Assess the model’s performance using metrics such as accuracy,
Naive Bayes Classifier for Text Classification
Title: Implementing a Naive Bayes Classifier for Sentiment Analysis
Experiment Number: 8
Date:
Aim
:

Procedure:
1. Load the dataset of text documents, such as product reviews labeled as positive or
negative.
2. Preprocess the text data by removing stop words, special characters, and tokenizing
the text.
3. Convert the text data into numerical form using techniques like TF-IDF or Bag of
Words.
4. Train the Naive Bayes model using the MultinomialNB() function from scikit-learn.
5. Evaluate the model using metrics such as accuracy, precision, recall, and F1-score.

Sample Data Set:


Sentime
Review nt
This product is amazing, very
useful! Positive

Terrible quality, do not buy this Negativ


product. e

I love this, will definitely buy again. Positive

Negativ
Very disappointed, waste of money. e

Excellent value for the price. Positive

Initialization:
1. Import necessary libraries (e.g., pandas, numpy, scikit-learn, nltk).
2. Load the dataset into Python and preprocess the text data.

Step-by-Step Process:
1. Load Dataset: Import the text data (e.g., CSV or text file).
2. Text Preprocessing: Clean the text by removing punctuation, stop words, and
tokenizing the sentences.
3. Feature Extraction: Convert the text into a numerical representation using TF-IDF or
Bag of Words.
4. Train Model: Use scikit-learn's MultinomialNB() to train the Naive Bayes classifier on
the training data.
5. Evaluate Model: Evaluate the model's performance using accuracy, precision, recall,
and F1-score on the test data.

Examples with Corresponding Test Cases:


Example 1:
● Input: Review = "This product is amazing, very useful!"
● Expected Output: Sentiment = Positive
Example 2:
● Input: Review = "Terrible quality, do not buy this product."
● Expected Output: Sentiment = Negative
Example 3:
● Input: Review = "I love this, will definitely buy again."
● Expected Output: Sentiment = Positive
Example 4:
● Input: Review = "Very disappointed, waste of money."
● Expected Output: Sentiment = Negative
Example 5:
● Input: Review = "Excellent value for the price."
● Expected Output: Sentiment = Positive

Results of Lab Experiments:


1. Naive Bayes classifiers are simple and effective for text classification tasks.
2. Preprocessing the text data significantly improves the performance of the classifier.
3. The Naive Bayes classifier works best when the features are conditionally
independent, although it can still work reasonably well even when this assumption is
violated.
4. TF-IDF is a widely used feature extraction technique for text data.
5. Naive Bayes models can be used for sentiment analysis, spam detection, and other
text classification tasks.

Viva Voce Questions:


1. What is the Naive Bayes classifier used for?
a) Classification tasks
2. What is the main assumption of the Naive Bayes classifier?
a) The data is linearly separable
3. How do you preprocess text data for use in Naive Bayes?
a) Remove stop words, punctuation, and tokenize the text
4. Which metric is commonly used to evaluate text classification models?
a) Accuracy
5. What does the TF-IDF technique do in text classification?
a) It reduces the dimensionality of the text data

K-Means Clustering for Data Segmentation


Title: Implementing K-Means Clustering Algorithm
Experiment Number: 9
Date:

Aim:

Procedure:
1. Load the dataset that contains multiple features for segmentation.
2. Preprocess the data (e.g., scale features and handle missing values).
3. Implement the K-Means algorithm using KMeans from scikit-learn.
4. Specify the number of clusters, k, and run the algorithm to find cluster centroids.
5. Visualize the clustering result by plotting the data points and the cluster centers.

Sample Data Set:


Feature Feature
1 2

1.2 3.5

2.8 2.3

3.5 1.8

5.1 7.9

6.2 8.1

Initialization:
1. Import necessary libraries (pandas, numpy, matplotlib, sklearn).
2. Load the dataset and preprocess it (if required, like scaling the features).

Step-by-Step Process:
1. Load Dataset: Import and load the dataset into a pandas DataFrame.
2. Preprocess Data: Handle missing data and scale the features if necessary.
3. Run K-Means Algorithm: Use KMeans() from scikit-learn to create and fit the model
with k clusters.
4. Obtain Results: Extract and examine the cluster centers and labels assigned to each
data point.
5. Visualize Clusters: Plot the clusters and their centroids using matplotlib.

Examples with Corresponding Test Cases:


Example 1:
● Input: Dataset with points in 2D space (Feature 1, Feature 2), k=2
● Expected Output: Two clusters identified with centroids at (2, 3) and (5, 7).
Example 2:
● Input: Dataset with 3 points in 2D space, k=1
● Expected Output: All points assigned to a single cluster with centroid at the average
point.
Example 3:
● Input: Dataset with random 2D data points, k=3
● Expected Output: Three clusters identified with corresponding centroids.
Example 4:
● Input: Data points: (1,2), (2,2), (5,8), k=2
● Expected Output: Two clusters: one centered around (1.5, 2) and another around (5,
8).
Example 5:
● Input: Dataset with highly overlapping points, k=4
● Expected Output: Four clusters with minimal separation between some clusters.

Results of Lab Experiments:


1. K-Means is an effective clustering algorithm when the number of clusters is known.
2. The algorithm is sensitive to the initial placement of centroids.
3. K-Means performs well when clusters are spherical and well-separated.
4. Feature scaling improves the accuracy of clustering.
5. Outliers can negatively impact the cluster centroids.

Viva Voce Questions:


1.
What is the primary purpose of K-Means clustering?
2.
What happens if you set k too high or too low?
3.
What is the typical shape of clusters that K-Means works well with?
4.
How do you determine the optimal value of k in K-Means clustering?
5.
What is the disadvantage of K-Means clustering?
Principal Component Analysis (PCA)
Title: Reducing Dimensionality Using PCA
Experiment Number: 10
Date:
Aim
:

Procedure:
1. Load the dataset with multiple features.
2. Standardize the data (i.e., scale the features to zero mean and unit variance).
3. Apply PCA to transform the data into principal components.
4. Visualize the explained variance ratio of the components.
5. Choose the first few components that explain the most variance and project the
data onto these components.

Sample Data Set:


Feature Feature Feature
1 2 3
Support Vector Machines (SVM)
Title: Implementing Support Vector Machines for Classification
Experiment Number: 11
Date:
Aim
:

Procedure:
1. Load the dataset with labeled data.
2. Preprocess the data by handling missing values and scaling the features.
3. Train an SVM classifier using the SVC function from scikit-learn.
4. Evaluate the classifier's performance using metrics like accuracy, precision, and
recall.
5. Visualize the decision boundaries for two-dimensional data.

Sample Data Set:


Feature Feature Lab
1 2 el

2.5 3.4 1

1.5 1.8 0

3.2 3.6 1

0.8 2.0 0

2.1 2.5 1

Initialization:
1. Import necessary libraries ([Link], [Link], matplotlib).
2. Load and preprocess the data.

Step-by-Step Process:
1. Load Dataset: Import the dataset into a DataFrame.
2. Preprocess Data: Scale the data and handle any missing values.
3. Train SVM: Use SVC() to train the model with the specified kernel.
4. Evaluate Performance: Check performance using accuracy, precision, recall, etc.
5. Visualize Decision Boundaries: For two-dimensional data, plot the decision
boundaries.

Examples with Corresponding Test Cases:


Example 1:
● Input: Dataset with two features, class labels 0 and 1
● Expected Output: A decision boundary separating the two classes.
Example 2:
● Input: Small dataset with linearly separable classes
● Expected Output: High classification accuracy.
Example 3:
● Input: Dataset with some overlap between classes
● Expected Output: Moderate accuracy with some misclassifications.
Example 4:
● Input: Data with noise and outliers
● Expected Output: Lower classification accuracy due to sensitivity to outliers.
Example 5:
● Input: Large dataset with multiple features
● Expected Output: A well-defined decision boundary with high accuracy.

Results of Lab Experiments:


1. SVM is highly effective for binary classification problems.
2. The kernel trick allows SVM to work in non-linear decision spaces.
3. SVM is sensitive to the choice of kernel and parameters (e.g., C and gamma).
4. SVM performs well in high-dimensional spaces.
5. SVM can be prone to overfitting if not properly tuned.

Viva Voce Questions:


1. What is the main advantage of using SVM over other classification algorithms?
a) It works only on small datasets
2. What does the kernel trick in SVM do?
a) It reduces the dimensionality of data
3. How do you tune an SVM classifier?
a) By adjusting the kernel
4. What happens if you set C too high in an SVM?
a) The model may overfit the data
5. What are the advantages and disadvantages of using SVM in real-world datasets?
a) It works well for small, clean datasets but struggles with large datasets

Random Forest Algorithm for Classification


Title: Implementing Random Forest for Classification
Experiment Number: 12
Date:
Aim
:

Procedure:
1. Load the dataset that contains labeled data.
2. Preprocess the data, which includes handling missing values, encoding categorical
variables (if any), and scaling the features.
3. Train a Random Forest classifier using the RandomForestClassifier from scikit-learn.
4. Evaluate the classifier's performance using accuracy, precision, recall, and the
confusion matrix.
5. Visualize the feature importance determined by the Random Forest model.

Sample Data Set:


Feature Feature Feature Lab
1 2 3 el

1.5 2.3 3.0 0

2.8 3.5 1.2 1

4.1 1.6 3.4 0

3.7 2.1 2.9 1

1.2 3.4 3.5 0

Initialization:
1. Import necessary libraries (pandas, numpy, [Link], [Link],
matplotlib).
2. Load and preprocess the dataset.

Step-by-Step Process:
1. Load Dataset: Import and load the dataset into a pandas DataFrame.
2. Preprocess Data: Handle missing values and encode categorical features if
required.
3. Train Random Forest: Use RandomForestClassifier() to train the model on the
dataset.
4. Evaluate Performance: Calculate accuracy, precision, recall, and confusion matrix.
5. Visualize Feature Importance: Plot the feature importance based on the trained
Random Forest model.

Examples with Corresponding Test Cases:


Example 1:
● Input: Dataset with three features and two classes (0 and 1), trained Random Forest
model.
● Expected Output: Accuracy score of 90%, with confusion matrix showing true
positives, true negatives, false positives, and false negatives.
Example 2:
● Input: Dataset with 5 features and 3 classes, RandomForestClassifier with 100 trees.
● Expected Output: Precision, recall, and F1-score for each class.
Example 3:
● Input: 10,000 data points with 4 features and binary labels.
● Expected Output: Feature importance plot highlighting the most important features.
Example 4:
● Input: Small dataset with a few noisy data points.
● Expected Output: Model performance slightly impacted, but Random Forest still
performs well with high robustness to noise.
Example 5:
● Input: Dataset with categorical variables (e.g., color: red, blue, green).
● Expected Output: Categorical features correctly encoded and used in the
classification model.

Results of Experiments:
1. Random Forest is robust to overfitting due to its ensemble nature.
2. Random Forest performs well with both numerical and categorical data.
3. Hyperparameter tuning, like adjusting the number of trees, significantly impacts
model performance.
4. Feature importance can help in identifying which variables most affect the
prediction.
5. Random Forest is slower to train compared to other classifiers but performs better
in complex tasks.

Viva Voce Questions:


1. What is the advantage of using Random Forest over a single decision tree?
a) It is computationally less expensive
2. How does Random Forest handle overfitting?
a) By averaging multiple trees to reduce variance
3. What is feature importance in Random Forest?
a) A measure of how much each feature contributes to the model’s predictions
4. What is the role of bootstrapping in Random Forest?
a) To ensure every feature is used equally
5. How do you determine the optimal number of trees in Random Forest?
a) By using cross-validation
Logistic Regression for Binary Classification
Title: Implementing Logistic Regression for Binary Classification
Experiment Number: 13
Date:
Aim
:

Procedure:
1. Load the dataset with binary class labels.
2. Preprocess the data (handling missing values, scaling features, etc.).
3. Train a logistic regression model using LogisticRegression from scikit-learn.
4. Evaluate the model's performance using accuracy, confusion matrix, and ROC curve.
5. Tune the model by adjusting the regularization parameter (C) and observing
performance.

Sample Data Set:


Feature Feature Lab
1 2 el
1.2 3.5 0
3.8 1.5 1

4.5 3.1 1

2.8 2.7 0

Initialization:
1. Import necessary libraries (sklearn.linear_model, [Link], matplotlib,
pandas).
2. Load the dataset and preprocess it.

Step-by-Step Process:
1. Load Dataset: Import the dataset into a pandas DataFrame.
2. Preprocess Data: Handle missing values and scale the features.
3. Train Logistic Regression: Use LogisticRegression() to train the model.
4. Evaluate Performance: Calculate accuracy and plot the ROC curve.
5. Tune Model: Adjust the regularization parameter C to improve performance.

Examples with Corresponding Test Cases:


Example 1:
● Input: Dataset with two features and binary class labels.
● Expected Output: A model with an accuracy of 85% and a ROC AUC score of 0.88.
Example 2:
● Input: Dataset with three features and binary class labels, C=0.5.
● Expected Output: Model accuracy of 90%, confusion matrix showing correct
classification of both classes.
Example 3:
● Input: Dataset with missing values in the features.
● Expected Output: Model handles missing values using imputation and performs
well.
Example 4:
● Input: Dataset with highly skewed features.
● Expected Output: Logistic regression adjusts the weights for skewed features
appropriately.
Example 5:
● Input: Small dataset with a regularization strength of C=1.
● Expected Output: Model with moderate accuracy, confusion matrix showing
balance between precision and recall.
Results of Experiments:
1. Logistic Regression is a simple and effective algorithm for binary classification
problems.
2. Regularization (parameter C) plays a crucial role in preventing overfitting.
3. The ROC curve and AUC score are important metrics for evaluating binary
classifiers.
4. Logistic Regression may perform poorly with highly non-linear data.
5. Feature scaling is important for Logistic Regression to perform optimally.

Viva Voce Questions:


1. What is the purpose of the logistic function in Logistic Regression?
a) To map predictions to probabilities
2. What does the regularization parameter C control in Logistic Regression?
a) The depth of the decision tree
3. How do you interpret the coefficients in a Logistic Regression model?
a) As weights that influence the decision boundary
4. What is the role of the ROC curve in evaluating Logistic Regression?
a) To measure the precision of the model
5. How would you handle imbalanced classes in Logistic Regression?
a) By using class weights in the model

This structured approach for Experiments 12 and 13 ensures that students can practice
implementing essential machine learning algorithms while also understanding their
underlying principles and evaluation metrics.
4o mini
You said:
You’re a seasoned educational content creator with over 15 years of experience in
developing comprehensive lab manuals for various scientific disciplines. Your expertise lies
in creating structured, clear, and engaging manuals that facilitate student understanding
and hands-on learning. Your task is to prepare a student lab manual for 14 and 15
experiment Here are the details you need to include: - Title: __________ - Experiment Number:
__________ - Date: __________ - Aim: __________ - Procedure: __________ - Sample Data Set:
__________ - Initialization: __________ - Step-by-Step Process: 1. __________ 2. __________ 3.
__________ 4. __________ 5. __________ Include five examples with corresponding test cases
that detail input and expected output: - Example 1: - Input: __________ - Expected Output:
__________ - Example 2: - Input: __________ - Expected Output: __________ - Example 3: - Input:
__________ - Expected Output: __________ - Example 4: - Input: __________ - Expected Output:
__________ - Example 5: - Input: __________ - Expected Output: __________ Finally, write the
results of experiments and include five viva voce questions with a choice-based format: 1.
Question 1: __________ 2. Question 2: __________ 3. Question 3: __________ 4. Question 4:
__________ 5. Question 5: __________
ChatGPT said:
ChatGPT
Here’s a detailed student lab manual for Experiments 14 and 15:

Support Vector Machine (SVM) for Binary Classification


Title: Implementing Support Vector Machine for Binary Classification
Experiment Number: 14
Date:
Aim
:
Procedure:
1. Load the dataset with binary class labels.
2. Preprocess the dataset by handling missing values, scaling the features, and
encoding categorical variables (if any).
3. Train an SVM classifier using the SVC() function from scikit-learn.
4. Evaluate the classifier’s performance using accuracy, precision, recall, and F1-score.
5. Visualize the decision boundary and support vectors.

Sample Data Set:


Feature Feature Lab
1 2 el

1.2 3.4 0

2.1 3.5 1

2.8 1.5 0

3.6 2.8 1

1.4 3.2 0

Initialization:
1. Import necessary libraries (pandas, numpy, [Link], [Link],
matplotlib).
2. Load the dataset and preprocess it.

Step-by-Step Process:
1. Load Dataset: Load the dataset into a pandas DataFrame.
2. Preprocess Data: Handle missing values, scale the features using StandardScaler,
and encode categorical variables if present.
3. Train SVM: Use the SVC() function to train the model with the dataset.
4. Evaluate Performance: Calculate the accuracy, precision, recall, and F1-score using
scikit-learn metrics.
5. Visualize Decision Boundary: Plot the decision boundary using a scatter plot to
visualize the SVM classifier’s behavior.

Examples with Corresponding Test Cases:


Example 1:
● Input: Dataset with two features and binary labels (0 and 1).
● Expected Output: A model that correctly classifies data points with 85% accuracy.
Example 2:
● Input: Dataset with four features and 1000 data points, trained with an RBF kernel.
● Expected Output: A model that achieves 92% accuracy, and a confusion matrix
showing correct classification of both classes.
Example 3:
● Input: Small dataset with noisy data.
● Expected Output: The SVM classifier performs well despite noise, with a precision
of 0.85 and recall of 0.88.
Example 4:
● Input: Linearly separable data with two features.
● Expected Output: A clear decision boundary separating the classes with 100%
accuracy.
Example 5:
● Input: Imbalanced dataset with many more instances of class 0 than class 1.
● Expected Output: Model performs well but may have a slightly higher recall for
class 0 due to class imbalance.

Results of Experiments:
1. SVM is effective for both linearly separable and non-linearly separable data.
2. The kernel function plays a critical role in SVM's performance.
3. SVM is sensitive to the choice of the regularization parameter C and kernel type.
4. Preprocessing like scaling is essential for SVM to perform optimally.
5. SVM can be slow to train on large datasets but performs well on smaller datasets
with fewer features.

Viva Voce Questions:


1. What is the role of the kernel function in SVM?
a) It decides how the model handles missing data
2. What happens if you set a very high value for the regularization parameter C in
SVM?
a) The model may overfit the data
3. What is the primary advantage of using an SVM with an RBF kernel?
a) It works well for high-dimensional data
4. What is a support vector in the context of SVM?
a) A feature that has a high weight in the model
5. Why is feature scaling important for SVM?
a) To speed up training
K-Means Clustering Algorithm
Title: Implementing K-Means Clustering
Experiment Number: 15
Date:
Aim
:

Procedure:
1. Load a dataset for unsupervised learning (no labels).
2. Preprocess the data by handling missing values and scaling features.
3. Implement the K-Means algorithm using KMeans() from scikit-learn.
4. Evaluate the model by examining the silhouette score and visualizing the clusters.
5. Experiment with different values of K and determine the optimal number of clusters.

Sample Data Set:


Feature Feature
1 2

1.1 3.2

2.5 3.8

3.2 1.4

4.1 2.5

5.0 3.7
Initialization:
1. Import necessary libraries (pandas, [Link], [Link], matplotlib).
2. Load and preprocess the dataset.

Step-by-Step Process:
1. Load Dataset: Import the dataset into a pandas DataFrame.
2. Preprocess Data: Handle missing values and scale the data using StandardScaler.
3. Train K-Means: Apply the K-Means algorithm using KMeans() with a specified
number of clusters.
4. Evaluate Clusters: Use the silhouette score to evaluate the clustering performance.
5. Visualize Results: Plot the clusters and their centroids to visualize how the data
points are grouped.

Examples with Corresponding Test Cases:


Example 1:
● Input: Dataset with two features, applied K-Means with K=3.
● Expected Output: A silhouette score of 0.75 and 3 well-defined clusters.
Example 2:
● Input: Dataset with five features, applied K-Means with K=2.
● Expected Output: Two clusters with centroids at distinct positions and a silhouette
score of 0.6.
Example 3:
● Input: Randomly distributed data with K=5.
● Expected Output: A visualization showing five clusters, each with an equal number
of data points.
Example 4:
● Input: Data with a large number of outliers.
● Expected Output: Clusters that are less compact, with some outliers placed outside
the main clusters.
Example 5:
● Input: Data with a mix of numeric and categorical variables.
● Expected Output: Successful clustering after encoding the categorical variables.

Results of Experiments:
1. K-Means is a simple and effective clustering algorithm for well-separated data.
2. The number of clusters (K) significantly impacts the results; selecting the right K is
crucial.
3. K-Means tends to struggle with highly skewed or non-globular data.
4. Clusters are often easier to visualize in two or three dimensions but become difficult
with higher-dimensional data.
5. Using the elbow method and silhouette score helps determine the optimal number
of clusters.

Viva Voce Questions:


1. What does the silhouette score measure in K-Means clustering?
a) The accuracy of the clusters
2. How do you determine the optimal number of clusters (K) in K-Means?
a) By using the elbow method or silhouette score
3. What is the main limitation of K-Means clustering?
a) It requires labeled data
4. How do you handle categorical data in K-Means clustering?
a) By ignoring it
5. What happens when K is set too high in K-Means clustering?
a) The model will overfit the data

You might also like