0% found this document useful (0 votes)

12 views48 pages

BA LabManual V1

The document is a lab manual for the Business Analytics Lab course at Rajalakshmi Institute of Technology, outlining the vision, mission, program educational objectives, outcomes, and specific outcomes for Computer Science and Engineering students. It includes a detailed list of experiments using MS Excel and Power BI, focusing on data analysis techniques and applications. Each experiment is accompanied by objectives, procedures, and expected results, aimed at enhancing students' analytical and technical skills.

Uploaded by

hassan.kaif.jobs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views48 pages

BA LabManual V1

Uploaded by

hassan.kaif.jobs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

CCW331 – Business Analytics Lab

Lab Manual III/CSE /6th Semester

Name of the
Student:

Class/Sec:

RAJALAKSHMI INSTITUTE OF TECHNOLOGY, CHENNAI

An Autonomous Institution, Aﬃliated to Anna University, Chennai
REGULATIONS 2023

I VISION OF THE DEPARTMENT

To establish a pioneering presence in the domain of Computer Science and Engineering by delivering excellence in technical learning that nurtures a culture of
innovation, research and competent professionalism.
II MISSION OF THE DEPARTMENT

To empower the next generation of Computer Science and Engineering professionals attuned to industry evolution and national progress.

To drive a continuous advancement in teaching and learning standards in the ﬁeld of Computer Science and Engineering by fostering an enriching
research milieu.

To forge a strong and symbiotic relationship with the industry, facilitating dynamic interactions that bridge academia and real-world
applications, nurturing innovationand productive collaboration.

III PROGRAM EDUCATIONAL OBJECTIVES (PEOs)

Graduates can
Pursue higher education and research or have a successful career in industriesassociated with Computer Science and Engineering.
Adapt the emerging technological changes for the global social beneﬁts.
Become entrepreneurs in the ﬁeld of Computer Science and Engineering, inculcating Research and Innovation.

IV PROGRAM OUTCOMES (POs)

1. Engineering Knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and an engineering specialization to the
solution of complex engineering problems.
2. Problem Analysis: Identify, formulate, review research literature, and analyze complex engineering problems reaching substantiated conclusions
using ﬁrst principles of mathematics, natural sciences, and engineering sciences.
3. Design/Development of Solutions: Design solutions for complex engineering problems and design system components or processes that meet
the speciﬁed needswith appropriate consideration for the public health and safety, and the cultural, societal, and environmental considerations.
4. Conduct Investigations of Complex Problems: Use research-based knowledge andresearch methods, including design of experiments, analysis
and interpretation of data, and synthesis of the information to provide valid conclusions.
5. Modern Tool Usage: Create, select, and apply appropriate techniques, resources, and modern engineering and IT tools including prediction and
modelling to complex engineering activities with an understanding of the limitations.
6. The Engineer and Society: Apply reasoning informed by the contextual knowledgeto assess societal, health, safety, legal and cultural issues and
the consequent responsibilities relevant to the professional engineering practice.
7. Environment and Sustainability: Understand the impact of the professional engineering solutions to societal and environmental contexts, and
demonstrate the knowledge of, and need for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the engineering practice.
9. Individual and Team Work: Function effectively as an individual, and as a memberor leader in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities withthe engineering community and with society at large, such as,
being able tocomprehend and write effective reports and design documentation, make effective presentations, and give and receive clear
instructions.
11. Project Management and Finance: Demonstrate knowledge and understanding ofthe engineering and management principles and apply these to
one’s own work, asa member and leader in a team, to manage projects and in multidisciplinaryenvironments.
12. Lifelong Learning: Recognize the need for, and have the preparation and ability toengage in independent and lifelong learning in the broadest
context of technologicalchange.

V PROGRAM SPECIFIC OUTCOMES (PSOs)

The students will be able to

Design and programming skills to build and automate business solutions usingcutting edge technologies.
Strengthen the theoretical foundation leading to excellence and excitement towardsresearch, to provide elegant solutions to complex
problems.

Work effectively with various engineering ﬁelds as a team to design, build anddevelop system applications.

List of Experiments
Cycle-1 – MS Excel

1. Explore the features of Ms-Excel.

2. (i) Get the input from user and perform numerical operations (MAX, MIN, AVG, SUM,
SQRT, ROUND)
ii) Perform data import/export operations for different ﬁle formats.
3. Perform statistical operations - Mean, Median, Mode and Standard deviation,
Variance,
Skewness, Kurtosis
4. Perform Z-test, T-test & ANOVA
5. Perform data pre-processing operations i) Handling Missing data ii) Normalization

Cycle-2

6. Perform dimensionality reduction operation using PCA, KPCA & SVD

7. Perform bivariate and multivariate analysis on the dataset.
8. Apply and explore various plotting functions on the data set.
II Cycle – Power BI Desktop
9. Explore the features of Power BI Desktop
10. Prepare & Load data

Cycle-3
11. Develop the data model
12. Perform DAX calculations
13. Design a report
14. Create a dashboard and perform data analysis
15. Presentation of a case study

Case Study Title

S.
No. Experiment Objective Case Study Integration

To understand and utilize the basic and advanced

functionalities of MS Excel for data organization and
1 Explore the features of Ms-Excel Employee Performance Analysis
analysis.

Get the input from user and perform numerical operations (MAX, To perform fundamental numerical operations for
2(i) MIN, AVG, SUM, SQRT, ROUND) summarizing and analyzing data. Sales Data Processing

To handle data in various formats for seamless

2(ii) Perform data import/export operations for different ﬁle formats integration and analysis. Customer Feedback Integration

Perform statistical operations - Mean, Median, Mode, Standard To apply descriptive statistics for understanding
3 deviation, Variance, Skewness, Kurtosis data distribution and variability. Student Exam Results Analysis

To conduct inferential statistical tests for hypothesis

4 Perform Z-test, T-test & ANOVA testing and comparisons. Treatment Effectiveness Study

Perform data pre-processing operations i) Handling Missing data ii) To clean and prepare data for accurate analysis by
5 Normalization addressing missing values and ensuring consistency. Real Estate Price Prediction

To reduce data complexity while retaining key

6 Perform dimensionality reduction operation using PCA, KPCA & SVD information for analysis. High-Dimensional Image Compression

To explore relationships between variables and

7 Perform bivariate and multivariate analysis on the dataset understand complex interactions. Market Basket Analysis

To visualize data for patterns and insights using

8 Apply and explore various plotting functions on the dataset appropriate plotting techniques. Weather Trends Visualization

To familiarize with Power BI for creating data-driven

9 Explore the features of Power BI Desktop reports and dashboards. Financial Risk Analysis

To prepare data for analysis by cleaning,

10 Prepare & Load data transforming, and importing it into Power BI. Retail Store Analytics

To create structured data models for effective query

11 Develop the data model execution and analysis. E-Commerce Sales Analysis

To apply advanced analytics using Data Analysis

12 Perform DAX calculations Expressions (DAX) in Power BI. Budget vs Actuals Comparison

To create visually appealing and informative reports

13 Design a report for stakeholders. Social Media Engagement Report

To build interactive dashboards that enable dynamic

14 Create a dashboard and perform data analysis exploration and insights. Health Metrics Dashboard

To summarize and communicate ﬁndings from data

15 Presentation of a case study analysis effectively. Comprehensive Data Storytelling
Employee Performance Analysis
Date:
Title
Explore the Features of MS Excel
Aim
To understand and utilize the basic and advanced functionalities of MS Excel for
organizing and analyzing employee performance data.
Procedure
1. Open MS Excel and create a new workbook.
2. Import the provided dataset or manually input the sample data.
3. Explore the following Excel features:
o Sorting and ﬁltering data
o Applying conditional formatting
o Using basic formulas (SUM, AVERAGE, MAX, MIN)
o Creating a pivot table
o Generating charts (e.g., bar chart, line chart)
4. Perform operations to analyze the performance data:
o Calculate the total hours worked by each employee.
o Identify the top-performing employee based on scores.
o Visualize performance trends using charts.
5. Save and export the results.
Sample Data Set
Employee Departme Hours Task Feedback
ID Name nt Worked Score Score
Alice
E001 Johnson HR 40 85 4.5

E002 Bob Smith IT 38 90 4.8

Charlie
E003 Davis Sales 42 78 4.2

Diana
E004 Brown Marketing 35 88 4.7

Initialization
1. Open MS Excel on your computer.
2. Input the sample dataset into the first sheet of the workbook.
3. Save the file as Employee_Performance.xlsx.
Step-by-Step Process
Step 1: Sorting and Filtering
● Objective: Identify top-performing employees in each department.
● Action:
o Apply sorting on "Task Score" in descending order.
o Filter by "Department" to view individual department performance.
Step 2: Using Formulas
● Objective: Calculate total scores.
● Example Formula:
o =SUM(Task Score, Feedback Score) for each employee.
o Use the AVERAGE formula to find the mean score across all employees.
Step 3: Creating Pivot Tables
● Objective: Summarize hours worked by department.
● Action:
o Select data Insert Pivot Table.
o Add "Department" to Rows and "Hours Worked" to Values.
Step 4: Generating Charts
● Objective: Visualize performance trends.
● Action:
o Select data Insert Bar Chart.
o Customize chart title as "Employee Performance Trends."
Step 5: Conditional Formatting
● Objective: Highlight top performers.
● Action:
o Apply conditional formatting to "Task Score" to highlight scores >85.
Test Cases
Test Case 1: Sorting Employees by Task Score
● Input: Employee data
● Action: Sort "Task Score" in descending order.
● Expected Output: Bob Smith appears as the first record.
Test Case 2: Total Hours Worked by Department
● Input: Employee data table
● Action: Create a pivot table for "Department" and "Hours Worked."
● Expected Output: Total hours for IT department displayed as 38.
Test Case 3: Top Performer Identification
● Input: Employee data table
● Action: Identify the employee with the highest "Task Score."
● Expected Output: Bob Smith with a score of 90.
Test Case 4: Visualization of Performance Trends
● Input: Employee performance scores
● Action: Generate a bar chart.
● Expected Output: Bar chart with scores visualized per employee.
Test Case 5: Conditional Formatting
● Input: Employee task scores
● Action: Apply conditional formatting for scores >85.
● Expected Output: Rows with scores >85 are highlighted.
Result
The analysis highlighted the following:
1. The IT department had the highest total hours worked.
2. Bob Smith was the top performer with a task score of 90.
3. Visualization of feedback scores revealed consistent high ratings across
departments.
Viva Voce Questions with Multiple Choices
1. Logical Problem-Solving
If an employee’s task score starts at 50 and increases by 10% each month for three months,
what will their score be at the end of the third month?
● (a) 55.5
● (b) 60.5
● (c) 66.55
● (d) 72.6
Answer: (c) 66.55

2. Data Analysis
In a dataset with missing values, replacing missing values with the dataset’s median is
preferred when:
● (a) The data contains extreme outliers.
● (b) The dataset is small.
● (c) The missing values are randomly distributed.
● (d) The data is normally distributed.

3. Probability and Statistics

If a dataset contains values [2,4,6,8,10][2, 4, 6, 8, 10][2,4,6,8,10], what is the probability of
randomly selecting an even number?
● (a) 20%
● (b) 40%
● (c) 60%
● (d) 100%

4. Visualization
Which of the following charts is most suitable for visualizing the relationship between two
continuous variables?
● (a) Bar Chart
● (b) Scatter Plot
● (c) Histogram
● (d) Line Graph

5. Matrix Operations
If the covariance matrix of a dataset is [3112]\begin{bmatrix} 3 & 1 \\ 1 & 2 \end{bmatrix}
[3112], what is the sum of the eigenvalues?
● (a) 4
● (b) 5
● (c) 6
● (d) 7

6. Dimensionality Reduction
What is the primary goal of Principal Component Analysis (PCA)?
● (a) Increase the number of features in the dataset.
● (b) Reduce overﬁtting in machine learning models.
● (c) Compress data by projecting it into fewer dimensions while preserving variance.
● (d) Improve the speed of data collection.
Numerical Operations in MS Excel
Experiment Number: 2(i)
Date:
Aim
:
Procedure:
1. Open MS Excel and create a new worksheet.
2. Enter a sample data set in a column (e.g., Sales Data).
3. Utilize Excel functions to perform various numerical operations:
o MAX: Find the maximum value in the dataset.
o MIN: Find the minimum value in the dataset.
o AVG: Calculate the average of the dataset.
o SUM: Compute the sum of the dataset.
o SQRT: Calculate the square root of individual values.
o ROUND: Round the results to the nearest integer.
4. Display the results in the adjacent column for each operation.
5. Save the worksheet.
Sample Data Set:
Sales Data (in
$)

1200

850

2100

1300

950

1700

1150
Initialization:
1. Open MS Excel.
2. Enter the sample data set in a column (A1 to A7).
Step-by-Step Process:
1. Perform MAX Operation:
o In cell B1, enter the formula =MAX(A1:A7) to find the maximum sales value.
2. Perform MIN Operation:
o In cell B2, enter the formula =MIN(A1:A7) to find the minimum sales value.
3. Calculate AVG (Average):
o In cell B3, enter the formula =AVERAGE(A1:A7) to calculate the average of the
sales data.
4. Calculate SUM:
o In cell B4, enter the formula =SUM(A1:A7) to find the sum of the sales data.
5. Calculate SQRT (Square Root):
o In cell B5, enter the formula =SQRT(A1) to calculate the square root of the first
value in the dataset.
Examples with Corresponding Test Cases:
● Example 1:
o Input: 1200, 850, 2100, 1300, 950, 1700, 1150
o Expected Output (MAX): 2100
● Example 2:
o Input: 1200, 850, 2100, 1300, 950, 1700, 1150
o Expected Output (MIN): 850
● Example 3:
o Input: 1200, 850, 2100, 1300, 950, 1700, 1150
o Expected Output (AVG): 1265
● Example 4:
o Input: 1200, 850, 2100, 1300, 950, 1700, 1150
o Expected Output (SUM): 9500
● Example 5:
o Input: 1200
o Expected Output (SQRT): 34.64

Results of Lab Experiments:

1. The maximum value of sales data is calculated correctly using the MAX function.
2. The minimum sales value is identiﬁed successfully using the MIN function.
3. The average sales value is computed accurately using the AVERAGE function.
4. The sum of sales data matches the expected result using the SUM function.
5. The square root operation works for individual sales data entries.
6. Rounding of numbers is effective when using the ROUND function.
7. Data entry is validated for accuracy and consistency.
8. The formulas correctly update when new data is added to the worksheet.
9. The worksheet is saved and formatted properly for presentation.
10. The operations are reproducible with new data inputs.
11. The results of MAX, MIN, AVG, SUM, and SQRT are displayed in the worksheet.
12. Data sets can be modiﬁed to test different operations.
13. Numerical operations provide accurate insights for decision-making.
14. The use of Excel functions reduces manual calculation errors.
15. The user is able to apply learned functions to real-world datasets effectively.

Viva Voce Questions:

1.
What is the signiﬁcance of the MAX and MIN functions in data analysis?
2.
How does the AVERAGE function differ from the SUM function in Excel?
3.
What is the purpose of the ROUND function in Excel?
4.
In what scenarios would you use the SQRT function in Excel?
5.
Why is it important to clean and format data before performing numerical operations
in Excel?
Statistical Operations for Data Analysis in MS Excel
Experiment Number: 3
Date:
Aim
:
Procedure:
1. Open MS Excel and create a new worksheet.
2. Enter a sample dataset in a column (e.g., Student Exam Scores).
3. Utilize Excel functions to perform various statistical operations:
o Mean: Calculate the average of the dataset.
o Median: Find the middle value of the dataset.
o Mode: Identify the most frequent value in the dataset.
o Standard Deviation: Calculate the spread of data from the mean.
o Variance: Determine how much the values in the dataset differ from the mean.
o Skewness: Measure the asymmetry of the data distribution.
o Kurtosis: Measure the 'tailedness' of the data distribution.
4. Display the results in the adjacent column for each operation.
5. Save the worksheet.
Sample Data Set:
Student Exam
Scores

85
Initialization:
1. Open MS Excel.
2. Enter the sample dataset in a column (A1 to A7).
Step-by-Step Process:
1. Calculate the Mean:
o In cell B1, enter the formula =AVERAGE(A1:A7) to calculate the average score of
the students.
2. Calculate the Median:
o In cell B2, enter the formula =MEDIAN(A1:A7) to ﬁnd the median score.
3. Find the Mode:
o In cell B3, enter the formula =MODE(A1:A7) to identify the most frequent score.
4. Calculate the Standard Deviation:
o In cell B4, enter the formula =STDEV.P(A1:A7) to compute the standard
deviation.
5. Calculate the Variance:
o In cell B5, enter the formula =VAR.P(A1:A7) to calculate the variance.
6. Calculate Skewness:
o In cell B6, enter the formula =SKEW(A1:A7) to measure the skewness of the
distribution.
7. Calculate Kurtosis:
o In cell B7, enter the formula =KURT(A1:A7) to determine the kurtosis of the
distribution.

Examples with Corresponding Test Cases:

● Example 1:
o Input: 85, 90, 75, 80, 90, 95, 85
o Expected Output (Mean): 85.71
● Example 2:
o Input: 85, 90, 75, 80, 90, 95, 85
o Expected Output (Median): 85
● Example 3:
o Input: 85, 90, 75, 80, 90, 95, 85
o Expected Output (Mode): 85 and 90 (Bimodal)
● Example 4:
o Input: 85, 90, 75, 80, 90, 95, 85
o Expected Output (Standard Deviation): 6.89
● Example 5:
o Input: 85, 90, 75, 80, 90, 95, 85
o Expected Output (Skewness): 0.29
Results of Lab Experiments:
1. Missing data can be handled effectively using various imputation methods.
2. Data normalization ensures that all features are on the same scale, improving
model performance.
3. Different normalization techniques, such as Min-Max scaling and Z-score
normalization, have distinct advantages based on the dataset.
4. The pre-processing step signiﬁcantly improves the quality of the data and ensures
consistency.
5. It is essential to handle missing values before performing any further analysis to
avoid skewed results.

Viva Voce Questions:

1.
What does the Mean value tell us about a data set?
2.
What is the purpose of calculating the Standard Deviation?
3.
How do you calculate the Mode in Excel?
4.
What does Skewness measure in a dataset?
5.
How is Kurtosis different from Skewness?
Title: Hypothesis Testing using Z-test, T-test, and ANOVA
Experiment Number: 4
Date:
Aim
:
Procedure:
1. Open MS Excel and create a new worksheet.
2. Enter the sample datasets for which hypothesis tests will be performed.
3. Formulate the null and alternative hypotheses for each test.
4. Use the appropriate Excel functions to perform the Z-test, T-test, and ANOVA:
o Z-test: For comparing the sample mean to a known population mean.
o T-test: For comparing the means of two independent groups.
o ANOVA: For comparing means across multiple groups.
5. Interpret the results to determine if the null hypothesis should be accepted or
rejected.
6. Save the worksheet.
Sample Data Set:
Group 1 Group 2 Group 3
(Scores) (Scores) (Scores)

85 90 88

78 83 90

92 86 85

85 91 87

88 89 90
Initialization:
1. Open MS Excel.
2. Enter the sample data in three separate columns (Group 1, Group 2, and Group 3).
Step-by-Step Process:
1. Perform Z-test:
o In cell B1, calculate the sample mean and standard deviation of Group 1.
o In cell B2, enter the formula for the Z-test: = (mean - population mean) /
(standard deviation / SQRT(sample size)).
o Perform the Z-test and interpret the result based on the critical value.
2. Perform T-test:
o In cell B3, calculate the sample means and standard deviations for Group 1 and
Group 2.
o In cell B4, enter the formula =[Link](range1, range2, tails, type) to perform the
T-test.
o Interpret the result based on the p-value.
3. Perform ANOVA:
o In cell B5, use the formula =ANOVA(data1, data2, data3) to perform the ANOVA
test.
o Analyze the results to check if there are significant differences between the
means of Group 1, Group 2, and Group 3.
4. Interpret Results:
o Compare the p-value to the significance level (usually 0.05) to determine
whether to reject or accept the null hypothesis.
o Document the findings in the worksheet.

Examples with Corresponding Test Cases:

● Example 1:
o Input: Group 1: 85, 78, 92, 85, 88; Population mean: 80
o Expected Output (Z-test): Z = 1.86 (Reject the null hypothesis if p-value < 0.05)
● Example 2:
o Input: Group 1: 85, 78, 92, 85, 88; Group 2: 90, 83, 86, 91, 89
o Expected Output (T-test): p-value = 0.054 (Fail to reject the null hypothesis if p-
value > 0.05)
● Example 3:
o Input: Group 1: 85, 78, 92, 85, 88; Group 2: 90, 83, 86, 91, 89; Group 3: 88, 90, 85,
87, 90
o Expected Output (ANOVA): p-value = 0.006 (Reject the null hypothesis if p-value
< 0.05)
● Example 4:
o Input: Group 1: 85, 78, 92, 85, 88; Group 2: 80, 75, 82, 85, 80
o Expected Output (Z-test): Z = 0.62 (Fail to reject the null hypothesis if p-value >
0.05)
● Example 5:
o Input: Group 1: 85, 78, 92, 85, 88; Group 2: 80, 85, 84, 83, 81
o Expected Output (T-test): p-value = 0.29 (Fail to reject the null hypothesis if p-
value > 0.05)
Results of Lab Experiments:
1. Dimensionality reduction helps in visualizing high-dimensional data more
effectively.
2. PCA is a powerful linear method for reducing data dimensions without losing
significant information.
3. KPCA handles non-linear data more effectively than PCA by transforming data into
higher-dimensional space.
4. SVD is useful for matrix factorization, especially in applications like image
compression.
5. Dimensionality reduction techniques can significantly improve model training time
and efficiency.

Viva Voce Questions:

1.
What is the primary purpose of performing a Z-test?
2.
In a T-test, what does a p-value less than 0.05 indicate?
3.
When should you use ANOVA over a T-test?
4.
What does a large p-value (> 0.05) imply in hypothesis testing?
5.
How is the p-value interpreted in hypothesis testing?
Perform Data Pre-processing Operations
Title: Data Pre-processing for Data Analysis
Experiment Number: 4
Date:
Aim
:

Procedure:
1. Open MS Excel or any suitable data analysis software.
2. Import the dataset that requires cleaning (e.g., sales data, customer feedback).
3. Identify and handle missing data (either by imputation, removing rows, or using
statistical methods).
4. Normalize the dataset by scaling the features to a common range (e.g., Min-Max
scaling, Z-score normalization).
5. Document the cleaned and normalized dataset for further analysis.

Sample Data Set:

Sale Pric Customer
s e Rating

150 20 4.5

200 25 NULL

NUL
L 30 3.8

NUL
180 L 4.2

210 35 4.7

Initialization:
1. Open your data analysis software (Excel, Python, R, etc.).
2. Load the dataset to be processed.
Step-by-Step Process:
1. Identify Missing Data: Check for cells with missing values (NULL or NaN).
2. Handle Missing Data: Use imputation methods (mean, median) or remove the rows
with missing values.
3. Normalize Data: Apply Min-Max scaling or Z-score normalization to the numerical
columns.
4. Verify Data: After normalization, ensure all data values are within the expected
range.
5. Save Cleaned Data: Store the cleaned and normalized dataset for analysis.

Examples with Corresponding Test Cases:

Example 1:
● Input: Sales Data: [150, 200, NULL, 180, 210]
● Expected Output: After Imputation (mean): [150, 200, 184.5, 180, 210]
Example 2:
● Input: Price Data: [20, 25, 30, NULL, 35]
● Expected Output: After Imputation (median): [20, 25, 30, 27.5, 35]
Example 3:
● Input: Customer Rating: [4.5, NULL, 3.8, 4.2, 4.7]
● Expected Output: After Imputation (mean): [4.5, 4.3, 3.8, 4.2, 4.7]
Example 4:
● Input: Normalizing Sales Data: [150, 200, 180, 210]
● Expected Output: Min-Max Normalization: [0.0, 1.0, 0.5, 1.0]
Example 5:
● Input: Normalizing Price Data: [20, 25, 30, 35]
● Expected Output: Z-Score Normalization: [-1.341, -0.447, 0.447, 1.341]

Viva Voce Questions:

1. What are the common techniques for handling missing data in a dataset?
a) Mean imputation
2. Why is data normalization necessary in data analysis?
a) It reduces the range of the data
3. What is the difference between Min-Max scaling and Z-score normalization?
a) Min-Max scaling standardizes data, while Z-score normalization normalizes data
4. When should you remove rows with missing data instead of imputing the values?
a) When the dataset is small and removing rows does not lead to signiﬁcant loss of
data
5. What is the potential drawback of imputing missing data with mean values?
a) It can introduce bias in the data

Perform Dimensionality Reduction Operations using PCA, KPCA & SVD

Title: Dimensionality Reduction with PCA, KPCA, and SVD
Experiment Number: 5
Date:
Aim
:

Procedure:
1. Load the high-dimensional dataset to be analyzed (e.g., image dataset, customer
data).
2. Apply PCA to reduce the dataset's dimensionality while retaining the variance.
3. Use KPCA to handle non-linear data and apply dimensionality reduction in a higher-
dimensional space.
4. Perform SVD for matrix factorization and dimensionality reduction.
5. Evaluate the results to ensure that the reduced dataset retains the critical features
of the original data.

Sample Data Set:

Feature Feature Feature Feature
1 2 3 4

1.0 2.1 3.3 4.4

1.2 2.3 3.5 4.6

1.4 2.5 3.7 4.8

1.6 2.7 3.9 5.0

Initialization:
1. Import necessary libraries (e.g., scikit-learn, numpy) for PCA, KPCA, and SVD.
2. Load the dataset into the Python environment.

Step-by-Step Process:
1. Apply PCA:
o Use the PCA function from a library (e.g., scikit-learn) to reduce the
dimensionality of the dataset.
2. Apply KPCA:
like the Silhouette score.
5. Cluster New Data: Assign new data points to the nearest cluster using the trained
model.

Examples with Corresponding Test Cases:

Example 1:
● Input: Dataset: [(1.0, 2.0), (1.5, 1.8), (5.0, 8.0), (8.0, 8.5), (1.2, 1.9)], K = 2
● Expected Output: Cluster 1: [(1.0, 2.0), (1.5, 1.8), (1.2, 1.9)], Cluster 2: [(5.0, 8.0), (8.0,
8.5)]
Example 2:
● Input: Dataset: [(2.0, 3.0), (3.5, 4.5), (7.0, 8.0), (8.5, 9.5)], K = 2
● Expected Output: Cluster 1: [(2.0, 3.0), (3.5, 4.5)], Cluster 2: [(7.0, 8.0), (8.5, 9.5)]
Example 3:
● Input: Dataset: [(0.5, 1.5), (1.0, 2.0), (5.5, 7.5), (7.0, 9.0)], K = 2
● Expected Output: Cluster 1: [(0.5, 1.5), (1.0, 2.0)], Cluster 2: [(5.5, 7.5), (7.0, 9.0)]
Example 4:
● Input: Dataset: [(3.0, 4.0), (3.5, 4.5), (6.0, 7.0), (8.0, 9.0)], K = 3
● Expected Output: Cluster 1: [(3.0, 4.0), (3.5, 4.5)], Cluster 2: [(6.0, 7.0)], Cluster 3:
[(8.0, 9.0)]
Example 5:
● Input: Dataset: [(2.0, 3.0), (2.5, 3.5), (7.0, 8.0), (8.5, 9.5)], K = 2
● Expected Output: Cluster 1: [(2.0, 3.0), (2.5, 3.5)], Cluster 2: [(7.0, 8.0), (8.5, 9.5)]
1.

Results of Lab Experiments:

1. K-means is effective in clustering data when the number of clusters is known in
advance.
2. Preprocessing such as scaling is important to ensure the K-means algorithm works
effectively, especially with high-dimensional data.
3. The value of K should ideally be chosen based on the dataset or using methods
such as the Elbow Method or Silhouette score.
4. K-means may struggle with non-spherical clusters or when clusters have different
densities.
5. The performance of K-means can be evaluated by measuring intra-cluster similarity
and inter-cluster dissimilarity.

Viva Voce Questions:

1. What is the K-means clustering algorithm used for?
a) Supervised learning
2. How do you determine the optimal number of clusters (K) for K-means?
a) Using the Elbow method
3. What is the main limitation of the K-means algorithm?
a) It assumes clusters are spherical
4. How does K-means assign a data point to a cluster?
a) Based on the closest centroid
5. What are some ways to improve K-means clustering?
a) Using a better initialization method for centroids

Decision Trees for Classiﬁcation

Title: Building a Decision Tree Classiﬁer
Experiment Number: 7
Date:
Aim
:

Procedure:
1. Load the dataset containing features (independent variables) and labels (target
variable).
2. Preprocess the data (e.g., handle missing values and categorical variables).
3. Split the data into training and testing sets.
4. Implement the Decision Tree Classiﬁer using a library such as scikit-learn.
5. Train the model on the training dataset and evaluate its performance on the test
dataset using accuracy or confusion matrix.

Sample Data Set:

A Inco Education Purchased (Yes/
ge me Level No)

2
2 30K Bachelor's No

4
5 60K Master's Yes

3
6 45K Bachelor's Yes

5
2 80K PhD Yes

2
9 50K Master's No

Initialization:
1. Import necessary libraries (e.g., pandas, numpy, scikit-learn).
2. Load and preprocess the dataset into Python.

Step-by-Step Process:
1. Load Dataset: Import the dataset using pandas.
2. Preprocess Data: Handle missing values and convert categorical features into
numerical values (e.g., one-hot encoding for Education Level).
3. Split Dataset: Use the train_test_split() function to divide the data into training and
testing sets.
4. Train Model: Implement a Decision Tree Classifier using scikit-learn’s
DecisionTreeClassifier() and train it on the training data.
5. Evaluate Model: Assess the model’s performance using metrics such as accuracy,
Naive Bayes Classifier for Text Classification
Title: Implementing a Naive Bayes Classifier for Sentiment Analysis
Experiment Number: 8
Date:
Aim
:

Procedure:
1. Load the dataset of text documents, such as product reviews labeled as positive or
negative.
2. Preprocess the text data by removing stop words, special characters, and tokenizing
the text.
3. Convert the text data into numerical form using techniques like TF-IDF or Bag of
Words.
4. Train the Naive Bayes model using the MultinomialNB() function from scikit-learn.
5. Evaluate the model using metrics such as accuracy, precision, recall, and F1-score.

Sample Data Set:

Sentime
Review nt
This product is amazing, very
useful! Positive

Terrible quality, do not buy this Negativ

product. e

I love this, will deﬁnitely buy again. Positive

Negativ
Very disappointed, waste of money. e

Excellent value for the price. Positive

Initialization:
1. Import necessary libraries (e.g., pandas, numpy, scikit-learn, nltk).
2. Load the dataset into Python and preprocess the text data.

Step-by-Step Process:
1. Load Dataset: Import the text data (e.g., CSV or text ﬁle).
2. Text Preprocessing: Clean the text by removing punctuation, stop words, and
tokenizing the sentences.
3. Feature Extraction: Convert the text into a numerical representation using TF-IDF or
Bag of Words.
4. Train Model: Use scikit-learn's MultinomialNB() to train the Naive Bayes classiﬁer on
the training data.
5. Evaluate Model: Evaluate the model's performance using accuracy, precision, recall,
and F1-score on the test data.

Examples with Corresponding Test Cases:

Example 1:
● Input: Review = "This product is amazing, very useful!"
● Expected Output: Sentiment = Positive
Example 2:
● Input: Review = "Terrible quality, do not buy this product."
● Expected Output: Sentiment = Negative
Example 3:
● Input: Review = "I love this, will deﬁnitely buy again."
● Expected Output: Sentiment = Positive
Example 4:
● Input: Review = "Very disappointed, waste of money."
● Expected Output: Sentiment = Negative
Example 5:
● Input: Review = "Excellent value for the price."
● Expected Output: Sentiment = Positive

Results of Lab Experiments:

1. Naive Bayes classifiers are simple and effective for text classification tasks.
2. Preprocessing the text data significantly improves the performance of the classifier.
3. The Naive Bayes classifier works best when the features are conditionally
independent, although it can still work reasonably well even when this assumption is
violated.
4. TF-IDF is a widely used feature extraction technique for text data.
5. Naive Bayes models can be used for sentiment analysis, spam detection, and other
text classification tasks.

Viva Voce Questions:

1. What is the Naive Bayes classifier used for?
a) Classification tasks
2. What is the main assumption of the Naive Bayes classifier?
a) The data is linearly separable
3. How do you preprocess text data for use in Naive Bayes?
a) Remove stop words, punctuation, and tokenize the text
4. Which metric is commonly used to evaluate text classification models?
a) Accuracy
5. What does the TF-IDF technique do in text classification?
a) It reduces the dimensionality of the text data

K-Means Clustering for Data Segmentation

Title: Implementing K-Means Clustering Algorithm
Experiment Number: 9
Date:

Aim:

Procedure:
1. Load the dataset that contains multiple features for segmentation.
2. Preprocess the data (e.g., scale features and handle missing values).
3. Implement the K-Means algorithm using KMeans from scikit-learn.
4. Specify the number of clusters, k, and run the algorithm to ﬁnd cluster centroids.
5. Visualize the clustering result by plotting the data points and the cluster centers.

Sample Data Set:

Feature Feature
1 2

1.2 3.5

2.8 2.3

3.5 1.8

5.1 7.9

6.2 8.1

Initialization:
1. Import necessary libraries (pandas, numpy, matplotlib, sklearn).
2. Load the dataset and preprocess it (if required, like scaling the features).

Step-by-Step Process:
1. Load Dataset: Import and load the dataset into a pandas DataFrame.
2. Preprocess Data: Handle missing data and scale the features if necessary.
3. Run K-Means Algorithm: Use KMeans() from scikit-learn to create and ﬁt the model
with k clusters.
4. Obtain Results: Extract and examine the cluster centers and labels assigned to each
data point.
5. Visualize Clusters: Plot the clusters and their centroids using matplotlib.

Examples with Corresponding Test Cases:

Example 1:
● Input: Dataset with points in 2D space (Feature 1, Feature 2), k=2
● Expected Output: Two clusters identiﬁed with centroids at (2, 3) and (5, 7).
Example 2:
● Input: Dataset with 3 points in 2D space, k=1
● Expected Output: All points assigned to a single cluster with centroid at the average
point.
Example 3:
● Input: Dataset with random 2D data points, k=3
● Expected Output: Three clusters identiﬁed with corresponding centroids.
Example 4:
● Input: Data points: (1,2), (2,2), (5,8), k=2
● Expected Output: Two clusters: one centered around (1.5, 2) and another around (5,
8).
Example 5:
● Input: Dataset with highly overlapping points, k=4
● Expected Output: Four clusters with minimal separation between some clusters.

Results of Lab Experiments:

1. K-Means is an effective clustering algorithm when the number of clusters is known.
2. The algorithm is sensitive to the initial placement of centroids.
3. K-Means performs well when clusters are spherical and well-separated.
4. Feature scaling improves the accuracy of clustering.
5. Outliers can negatively impact the cluster centroids.

Viva Voce Questions:

1.
What is the primary purpose of K-Means clustering?
2.
What happens if you set k too high or too low?
3.
What is the typical shape of clusters that K-Means works well with?
4.
How do you determine the optimal value of k in K-Means clustering?
5.
What is the disadvantage of K-Means clustering?
Principal Component Analysis (PCA)
Title: Reducing Dimensionality Using PCA
Experiment Number: 10
Date:
Aim
:

Procedure:
1. Load the dataset with multiple features.
2. Standardize the data (i.e., scale the features to zero mean and unit variance).
3. Apply PCA to transform the data into principal components.
4. Visualize the explained variance ratio of the components.
5. Choose the ﬁrst few components that explain the most variance and project the
data onto these components.

Sample Data Set:

Feature Feature Feature
1 2 3
Support Vector Machines (SVM)
Title: Implementing Support Vector Machines for Classiﬁcation
Experiment Number: 11
Date:
Aim
:

Procedure:
1. Load the dataset with labeled data.
2. Preprocess the data by handling missing values and scaling the features.
3. Train an SVM classiﬁer using the SVC function from scikit-learn.
4. Evaluate the classiﬁer's performance using metrics like accuracy, precision, and
recall.
5. Visualize the decision boundaries for two-dimensional data.

Sample Data Set:

Feature Feature Lab
1 2 el

2.5 3.4 1

1.5 1.8 0

3.2 3.6 1

0.8 2.0 0

2.1 2.5 1

Initialization:
1. Import necessary libraries ([Link], [Link], matplotlib).
2. Load and preprocess the data.

Step-by-Step Process:
1. Load Dataset: Import the dataset into a DataFrame.
2. Preprocess Data: Scale the data and handle any missing values.
3. Train SVM: Use SVC() to train the model with the speciﬁed kernel.
4. Evaluate Performance: Check performance using accuracy, precision, recall, etc.
5. Visualize Decision Boundaries: For two-dimensional data, plot the decision
boundaries.

Examples with Corresponding Test Cases:

Example 1:
● Input: Dataset with two features, class labels 0 and 1
● Expected Output: A decision boundary separating the two classes.
Example 2:
● Input: Small dataset with linearly separable classes
● Expected Output: High classification accuracy.
Example 3:
● Input: Dataset with some overlap between classes
● Expected Output: Moderate accuracy with some misclassifications.
Example 4:
● Input: Data with noise and outliers
● Expected Output: Lower classification accuracy due to sensitivity to outliers.
Example 5:
● Input: Large dataset with multiple features
● Expected Output: A well-defined decision boundary with high accuracy.

Results of Lab Experiments:

1. SVM is highly effective for binary classiﬁcation problems.
2. The kernel trick allows SVM to work in non-linear decision spaces.
3. SVM is sensitive to the choice of kernel and parameters (e.g., C and gamma).
4. SVM performs well in high-dimensional spaces.
5. SVM can be prone to overﬁtting if not properly tuned.

Viva Voce Questions:

1. What is the main advantage of using SVM over other classification algorithms?
a) It works only on small datasets
2. What does the kernel trick in SVM do?
a) It reduces the dimensionality of data
3. How do you tune an SVM classifier?
a) By adjusting the kernel
4. What happens if you set C too high in an SVM?
a) The model may overfit the data
5. What are the advantages and disadvantages of using SVM in real-world datasets?
a) It works well for small, clean datasets but struggles with large datasets

Random Forest Algorithm for Classiﬁcation

Title: Implementing Random Forest for Classiﬁcation
Experiment Number: 12
Date:
Aim
:

Procedure:
1. Load the dataset that contains labeled data.
2. Preprocess the data, which includes handling missing values, encoding categorical
variables (if any), and scaling the features.
3. Train a Random Forest classifier using the RandomForestClassifier from scikit-learn.
4. Evaluate the classifier's performance using accuracy, precision, recall, and the
confusion matrix.
5. Visualize the feature importance determined by the Random Forest model.

Sample Data Set:

Feature Feature Feature Lab
1 2 3 el

1.5 2.3 3.0 0

2.8 3.5 1.2 1

4.1 1.6 3.4 0

3.7 2.1 2.9 1

1.2 3.4 3.5 0

Initialization:
1. Import necessary libraries (pandas, numpy, [Link], [Link],
matplotlib).
2. Load and preprocess the dataset.

Step-by-Step Process:
1. Load Dataset: Import and load the dataset into a pandas DataFrame.
2. Preprocess Data: Handle missing values and encode categorical features if
required.
3. Train Random Forest: Use RandomForestClassiﬁer() to train the model on the
dataset.
4. Evaluate Performance: Calculate accuracy, precision, recall, and confusion matrix.
5. Visualize Feature Importance: Plot the feature importance based on the trained
Random Forest model.

Examples with Corresponding Test Cases:

Example 1:
● Input: Dataset with three features and two classes (0 and 1), trained Random Forest
model.
● Expected Output: Accuracy score of 90%, with confusion matrix showing true
positives, true negatives, false positives, and false negatives.
Example 2:
● Input: Dataset with 5 features and 3 classes, RandomForestClassiﬁer with 100 trees.
● Expected Output: Precision, recall, and F1-score for each class.
Example 3:
● Input: 10,000 data points with 4 features and binary labels.
● Expected Output: Feature importance plot highlighting the most important features.
Example 4:
● Input: Small dataset with a few noisy data points.
● Expected Output: Model performance slightly impacted, but Random Forest still
performs well with high robustness to noise.
Example 5:
● Input: Dataset with categorical variables (e.g., color: red, blue, green).
● Expected Output: Categorical features correctly encoded and used in the
classiﬁcation model.

Results of Experiments:
1. Random Forest is robust to overfitting due to its ensemble nature.
2. Random Forest performs well with both numerical and categorical data.
3. Hyperparameter tuning, like adjusting the number of trees, significantly impacts
model performance.
4. Feature importance can help in identifying which variables most affect the
prediction.
5. Random Forest is slower to train compared to other classifiers but performs better
in complex tasks.

Viva Voce Questions:

1. What is the advantage of using Random Forest over a single decision tree?
a) It is computationally less expensive
2. How does Random Forest handle overfitting?
a) By averaging multiple trees to reduce variance
3. What is feature importance in Random Forest?
a) A measure of how much each feature contributes to the model’s predictions
4. What is the role of bootstrapping in Random Forest?
a) To ensure every feature is used equally
5. How do you determine the optimal number of trees in Random Forest?
a) By using cross-validation
Logistic Regression for Binary Classification
Title: Implementing Logistic Regression for Binary Classification
Experiment Number: 13
Date:
Aim
:

Procedure:
1. Load the dataset with binary class labels.
2. Preprocess the data (handling missing values, scaling features, etc.).
3. Train a logistic regression model using LogisticRegression from scikit-learn.
4. Evaluate the model's performance using accuracy, confusion matrix, and ROC curve.
5. Tune the model by adjusting the regularization parameter (C) and observing
performance.

Sample Data Set:

Feature Feature Lab
1 2 el
1.2 3.5 0
3.8 1.5 1

4.5 3.1 1

2.8 2.7 0

Initialization:
1. Import necessary libraries (sklearn.linear_model, [Link], matplotlib,
pandas).
2. Load the dataset and preprocess it.

Step-by-Step Process:
1. Load Dataset: Import the dataset into a pandas DataFrame.
2. Preprocess Data: Handle missing values and scale the features.
3. Train Logistic Regression: Use LogisticRegression() to train the model.
4. Evaluate Performance: Calculate accuracy and plot the ROC curve.
5. Tune Model: Adjust the regularization parameter C to improve performance.

Examples with Corresponding Test Cases:

Example 1:
● Input: Dataset with two features and binary class labels.
● Expected Output: A model with an accuracy of 85% and a ROC AUC score of 0.88.
Example 2:
● Input: Dataset with three features and binary class labels, C=0.5.
● Expected Output: Model accuracy of 90%, confusion matrix showing correct
classification of both classes.
Example 3:
● Input: Dataset with missing values in the features.
● Expected Output: Model handles missing values using imputation and performs
well.
Example 4:
● Input: Dataset with highly skewed features.
● Expected Output: Logistic regression adjusts the weights for skewed features
appropriately.
Example 5:
● Input: Small dataset with a regularization strength of C=1.
● Expected Output: Model with moderate accuracy, confusion matrix showing
balance between precision and recall.
Results of Experiments:
1. Logistic Regression is a simple and effective algorithm for binary classification
problems.
2. Regularization (parameter C) plays a crucial role in preventing overfitting.
3. The ROC curve and AUC score are important metrics for evaluating binary
classifiers.
4. Logistic Regression may perform poorly with highly non-linear data.
5. Feature scaling is important for Logistic Regression to perform optimally.

Viva Voce Questions:

1. What is the purpose of the logistic function in Logistic Regression?
a) To map predictions to probabilities
2. What does the regularization parameter C control in Logistic Regression?
a) The depth of the decision tree
3. How do you interpret the coeﬃcients in a Logistic Regression model?
a) As weights that inﬂuence the decision boundary
4. What is the role of the ROC curve in evaluating Logistic Regression?
a) To measure the precision of the model
5. How would you handle imbalanced classes in Logistic Regression?
a) By using class weights in the model

This structured approach for Experiments 12 and 13 ensures that students can practice
implementing essential machine learning algorithms while also understanding their
underlying principles and evaluation metrics.
4o mini
You said:
You’re a seasoned educational content creator with over 15 years of experience in
developing comprehensive lab manuals for various scientific disciplines. Your expertise lies
in creating structured, clear, and engaging manuals that facilitate student understanding
and hands-on learning. Your task is to prepare a student lab manual for 14 and 15
experiment Here are the details you need to include: - Title: __________ - Experiment Number:
__________ - Date: __________ - Aim: __________ - Procedure: __________ - Sample Data Set:
__________ - Initialization: __________ - Step-by-Step Process: 1. __________ 2. __________ 3.
__________ 4. __________ 5. __________ Include five examples with corresponding test cases
that detail input and expected output: - Example 1: - Input: __________ - Expected Output:
__________ - Example 2: - Input: __________ - Expected Output: __________ - Example 3: - Input:
__________ - Expected Output: __________ - Example 4: - Input: __________ - Expected Output:
__________ - Example 5: - Input: __________ - Expected Output: __________ Finally, write the
results of experiments and include five viva voce questions with a choice-based format: 1.
Question 1: __________ 2. Question 2: __________ 3. Question 3: __________ 4. Question 4:
__________ 5. Question 5: __________
ChatGPT said:
ChatGPT
Here’s a detailed student lab manual for Experiments 14 and 15:

Support Vector Machine (SVM) for Binary Classiﬁcation

Title: Implementing Support Vector Machine for Binary Classification
Experiment Number: 14
Date:
Aim
:
Procedure:
1. Load the dataset with binary class labels.
2. Preprocess the dataset by handling missing values, scaling the features, and
encoding categorical variables (if any).
3. Train an SVM classifier using the SVC() function from scikit-learn.
4. Evaluate the classifier’s performance using accuracy, precision, recall, and F1-score.
5. Visualize the decision boundary and support vectors.

Sample Data Set:

Feature Feature Lab
1 2 el

1.2 3.4 0

2.1 3.5 1

2.8 1.5 0

3.6 2.8 1

1.4 3.2 0

Initialization:
1. Import necessary libraries (pandas, numpy, [Link], [Link],
matplotlib).
2. Load the dataset and preprocess it.

Step-by-Step Process:
1. Load Dataset: Load the dataset into a pandas DataFrame.
2. Preprocess Data: Handle missing values, scale the features using StandardScaler,
and encode categorical variables if present.
3. Train SVM: Use the SVC() function to train the model with the dataset.
4. Evaluate Performance: Calculate the accuracy, precision, recall, and F1-score using
scikit-learn metrics.
5. Visualize Decision Boundary: Plot the decision boundary using a scatter plot to
visualize the SVM classiﬁer’s behavior.

Examples with Corresponding Test Cases:

Example 1:
● Input: Dataset with two features and binary labels (0 and 1).
● Expected Output: A model that correctly classifies data points with 85% accuracy.
Example 2:
● Input: Dataset with four features and 1000 data points, trained with an RBF kernel.
● Expected Output: A model that achieves 92% accuracy, and a confusion matrix
showing correct classification of both classes.
Example 3:
● Input: Small dataset with noisy data.
● Expected Output: The SVM classifier performs well despite noise, with a precision
of 0.85 and recall of 0.88.
Example 4:
● Input: Linearly separable data with two features.
● Expected Output: A clear decision boundary separating the classes with 100%
accuracy.
Example 5:
● Input: Imbalanced dataset with many more instances of class 0 than class 1.
● Expected Output: Model performs well but may have a slightly higher recall for
class 0 due to class imbalance.

Results of Experiments:
1. SVM is effective for both linearly separable and non-linearly separable data.
2. The kernel function plays a critical role in SVM's performance.
3. SVM is sensitive to the choice of the regularization parameter C and kernel type.
4. Preprocessing like scaling is essential for SVM to perform optimally.
5. SVM can be slow to train on large datasets but performs well on smaller datasets
with fewer features.

Viva Voce Questions:

1. What is the role of the kernel function in SVM?
a) It decides how the model handles missing data
2. What happens if you set a very high value for the regularization parameter C in
SVM?
a) The model may overﬁt the data
3. What is the primary advantage of using an SVM with an RBF kernel?
a) It works well for high-dimensional data
4. What is a support vector in the context of SVM?
a) A feature that has a high weight in the model
5. Why is feature scaling important for SVM?
a) To speed up training
K-Means Clustering Algorithm
Title: Implementing K-Means Clustering
Experiment Number: 15
Date:
Aim
:

Procedure:
1. Load a dataset for unsupervised learning (no labels).
2. Preprocess the data by handling missing values and scaling features.
3. Implement the K-Means algorithm using KMeans() from scikit-learn.
4. Evaluate the model by examining the silhouette score and visualizing the clusters.
5. Experiment with different values of K and determine the optimal number of clusters.

Sample Data Set:

Feature Feature
1 2

1.1 3.2

2.5 3.8

3.2 1.4

4.1 2.5

5.0 3.7
Initialization:
1. Import necessary libraries (pandas, [Link], [Link], matplotlib).
2. Load and preprocess the dataset.

Step-by-Step Process:
1. Load Dataset: Import the dataset into a pandas DataFrame.
2. Preprocess Data: Handle missing values and scale the data using StandardScaler.
3. Train K-Means: Apply the K-Means algorithm using KMeans() with a speciﬁed
number of clusters.
4. Evaluate Clusters: Use the silhouette score to evaluate the clustering performance.
5. Visualize Results: Plot the clusters and their centroids to visualize how the data
points are grouped.

Examples with Corresponding Test Cases:

Example 1:
● Input: Dataset with two features, applied K-Means with K=3.
● Expected Output: A silhouette score of 0.75 and 3 well-defined clusters.
Example 2:
● Input: Dataset with five features, applied K-Means with K=2.
● Expected Output: Two clusters with centroids at distinct positions and a silhouette
score of 0.6.
Example 3:
● Input: Randomly distributed data with K=5.
● Expected Output: A visualization showing five clusters, each with an equal number
of data points.
Example 4:
● Input: Data with a large number of outliers.
● Expected Output: Clusters that are less compact, with some outliers placed outside
the main clusters.
Example 5:
● Input: Data with a mix of numeric and categorical variables.
● Expected Output: Successful clustering after encoding the categorical variables.

Results of Experiments:
1. K-Means is a simple and effective clustering algorithm for well-separated data.
2. The number of clusters (K) signiﬁcantly impacts the results; selecting the right K is
crucial.
3. K-Means tends to struggle with highly skewed or non-globular data.
4. Clusters are often easier to visualize in two or three dimensions but become diﬃcult
with higher-dimensional data.
5. Using the elbow method and silhouette score helps determine the optimal number
of clusters.

Viva Voce Questions:

1. What does the silhouette score measure in K-Means clustering?
a) The accuracy of the clusters
2. How do you determine the optimal number of clusters (K) in K-Means?
a) By using the elbow method or silhouette score
3. What is the main limitation of K-Means clustering?
a) It requires labeled data
4. How do you handle categorical data in K-Means clustering?
a) By ignoring it
5. What happens when K is set too high in K-Means clustering?
a) The model will overﬁt the data

Ccw331-Business Analtics Lab
No ratings yet
Ccw331-Business Analtics Lab
64 pages
Ccw331 Lab Manual
No ratings yet
Ccw331 Lab Manual
102 pages
Data Analytics Manual
No ratings yet
Data Analytics Manual
59 pages
Excel Lab Manual 2024-25
No ratings yet
Excel Lab Manual 2024-25
38 pages
Lecture0 INT217
No ratings yet
Lecture0 INT217
25 pages
DAE Lab - Manual
No ratings yet
DAE Lab - Manual
42 pages
APBADS Brochure
No ratings yet
APBADS Brochure
33 pages
Dae - Lab Manual
No ratings yet
Dae - Lab Manual
44 pages
Laboratory Certificate for Statistics Course
No ratings yet
Laboratory Certificate for Statistics Course
35 pages
Business Analytics & Data Science Program
No ratings yet
Business Analytics & Data Science Program
32 pages
Data Analytics Brochure (V1) - Launch (4) (1) (1) (2) (1) (1) (1) (1) (1) (1) - Removed
No ratings yet
Data Analytics Brochure (V1) - Launch (4) (1) (1) (2) (1) (1) (1) (1) (1) (1) - Removed
20 pages
Int217 Lec0
No ratings yet
Int217 Lec0
15 pages
Data Analytics Course Handout 2024 29.11.24 Anjamma
No ratings yet
Data Analytics Course Handout 2024 29.11.24 Anjamma
42 pages
Updated - DSV - Lab Manual (2024-25)
No ratings yet
Updated - DSV - Lab Manual (2024-25)
90 pages
AMCEC DataExcel Lab Manual
No ratings yet
AMCEC DataExcel Lab Manual
32 pages
Lab Manual BI
100% (1)
Lab Manual BI
47 pages
Data Science Career Launchpad
No ratings yet
Data Science Career Launchpad
18 pages
DWBI Venky Final Print
No ratings yet
DWBI Venky Final Print
39 pages
Ilide - Info Data Analytics Lab File Rohit PR
No ratings yet
Ilide - Info Data Analytics Lab File Rohit PR
23 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
Session Agenda For DAP Cpounseling 24.03.2025
No ratings yet
Session Agenda For DAP Cpounseling 24.03.2025
6 pages
Business Computing PG P 21 Outline
No ratings yet
Business Computing PG P 21 Outline
3 pages
PGP DS&A 122c5ebe
No ratings yet
PGP DS&A 122c5ebe
23 pages
Business Analytics and Data Science
No ratings yet
Business Analytics and Data Science
25 pages
Data Analytics Updated SA
No ratings yet
Data Analytics Updated SA
12 pages
Data Mining Lab Guide
No ratings yet
Data Mining Lab Guide
56 pages
Resume TusharYadav Submit
No ratings yet
Resume TusharYadav Submit
2 pages
Data Analytics
No ratings yet
Data Analytics
9 pages
III-i Bda Syllabus
No ratings yet
III-i Bda Syllabus
8 pages
Data Analyst in 2025
No ratings yet
Data Analyst in 2025
13 pages
GE3361 Professional Development
No ratings yet
GE3361 Professional Development
45 pages
DWDM R20 Lab Manual 3-1 Cse 2022-2023 Sem 1
No ratings yet
DWDM R20 Lab Manual 3-1 Cse 2022-2023 Sem 1
151 pages
CPDA-Sunday Brochure July 2025
No ratings yet
CPDA-Sunday Brochure July 2025
17 pages
Ba Lab Manual
No ratings yet
Ba Lab Manual
85 pages
EPN VIT - PGP Data Analytics New
No ratings yet
EPN VIT - PGP Data Analytics New
16 pages
Data Analyst Roadmap 2025
No ratings yet
Data Analyst Roadmap 2025
19 pages
Data Science 8752
No ratings yet
Data Science 8752
28 pages
DV Lab 97541
No ratings yet
DV Lab 97541
91 pages
Data Science Roles and Skills Overview
No ratings yet
Data Science Roles and Skills Overview
26 pages
Full IT Skills Lab2 Project Jyoti Singh
No ratings yet
Full IT Skills Lab2 Project Jyoti Singh
7 pages
Dataanalytics
No ratings yet
Dataanalytics
44 pages
IILM Institute For Higher Education
No ratings yet
IILM Institute For Higher Education
7 pages
CCW331 Set 2
No ratings yet
CCW331 Set 2
2 pages
Syllabus
No ratings yet
Syllabus
22 pages
Wa0011.
No ratings yet
Wa0011.
3 pages
110 - Data Analytics - L&E-SYS
No ratings yet
110 - Data Analytics - L&E-SYS
12 pages
Master Data Analytics
No ratings yet
Master Data Analytics
17 pages
Lab Manual
No ratings yet
Lab Manual
61 pages
Ia Ch-Part-A - (R18) - 2020-2021
No ratings yet
Ia Ch-Part-A - (R18) - 2020-2021
30 pages
Dav Cis R20 DS
No ratings yet
Dav Cis R20 DS
9 pages
Data Science and Analytics Brochure Welcome To RISE INSTITUTE
No ratings yet
Data Science and Analytics Brochure Welcome To RISE INSTITUTE
8 pages
Data Science, Data Engineering & AI
No ratings yet
Data Science, Data Engineering & AI
30 pages
Data Science Lab Manual for B.Tech CSE
No ratings yet
Data Science Lab Manual for B.Tech CSE
39 pages
Business Intelligence Course Overview
No ratings yet
Business Intelligence Course Overview
9 pages
Ccs346-Eda Lab Record
No ratings yet
Ccs346-Eda Lab Record
74 pages
Data Analyst Online Bootcamp
No ratings yet
Data Analyst Online Bootcamp
24 pages
ML Lec13
No ratings yet
ML Lec13
3 pages
Mlsec Solution Exercise Sheet 7
No ratings yet
Mlsec Solution Exercise Sheet 7
6 pages
Project Report Early Lung Cancer Detection Using Machine Learning and Image Processing
No ratings yet
Project Report Early Lung Cancer Detection Using Machine Learning and Image Processing
52 pages
Exercices Kernel Trick
No ratings yet
Exercices Kernel Trick
24 pages
Unit IV
No ratings yet
Unit IV
96 pages
Autonomous - UAV - Flight - Control - For - GPS-Based - Navigation - 2
No ratings yet
Autonomous - UAV - Flight - Control - For - GPS-Based - Navigation - 2
9 pages
Lecture 3 Types of Machine Learning
No ratings yet
Lecture 3 Types of Machine Learning
40 pages
AI Agents With Python Build Autonomous Systems That Think, Learn, and Act (Publishing, Reactive Van Der Post, Hayden) (Z-Library)
100% (2)
AI Agents With Python Build Autonomous Systems That Think, Learn, and Act (Publishing, Reactive Van Der Post, Hayden) (Z-Library)
422 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
48 pages
ConsensusClusterPlus R Tutorial
No ratings yet
ConsensusClusterPlus R Tutorial
12 pages
Prova Aprendizado Estatistico
No ratings yet
Prova Aprendizado Estatistico
6 pages
NCA-GENL Exam Dumps
100% (1)
NCA-GENL Exam Dumps
13 pages
Weka Classification and Clustering Guide
No ratings yet
Weka Classification and Clustering Guide
3 pages
K-Means Clustering From Scratch
No ratings yet
K-Means Clustering From Scratch
3 pages
Facility Location With Clustering Algorithm
No ratings yet
Facility Location With Clustering Algorithm
50 pages
ML Lec-16
No ratings yet
ML Lec-16
16 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
93 pages
Instructor Support
100% (1)
Instructor Support
150 pages
Play Store EDA Using Machine Learning
No ratings yet
Play Store EDA Using Machine Learning
21 pages
Advanced Intrusion Detection
No ratings yet
Advanced Intrusion Detection
17 pages
K Mean Algorithm
No ratings yet
K Mean Algorithm
18 pages
Data Mining: Frequent Itemsets & Clustering
No ratings yet
Data Mining: Frequent Itemsets & Clustering
152 pages
40 Interview Questions On Clustering
No ratings yet
40 Interview Questions On Clustering
9 pages
Clustering Xu R and Wunsch DC 2008 Book Review
No ratings yet
Clustering Xu R and Wunsch DC 2008 Book Review
3 pages
Machine Learning Final Exam Spring 2009
No ratings yet
Machine Learning Final Exam Spring 2009
25 pages
Mridul Report
No ratings yet
Mridul Report
43 pages
Zhang The Unreasonable Effectiveness CVPR 2018 Paper
No ratings yet
Zhang The Unreasonable Effectiveness CVPR 2018 Paper
10 pages
Datamites Certified Data Scientist Brochure
No ratings yet
Datamites Certified Data Scientist Brochure
19 pages
Student Cohort Analysis Guide
No ratings yet
Student Cohort Analysis Guide
8 pages
Unit 6 Machine Learning Algorithms - AI CBSE
No ratings yet
Unit 6 Machine Learning Algorithms - AI CBSE
1 page