Activity Title:
Data Cleaning and Analysis: Handling Missing Data, Outliers, Measures of Central
Tendency, and Clustering
Objective:
This activity aims to enhance students’ practical understanding of data cleaning and descriptive
analytics techniques. Students will apply methods to handle missing data, detect and treat
outliers, compute descriptive statistics, and perform clustering analysis using R programming.
By the end of this activity, students should be able to:
1. Create and manage their own dataset.
2. Handle missing and inconsistent data appropriately.
3. Detect and treat outliers using statistical techniques.
4. Compute and interpret mean, median, and mode.
5. Perform basic clustering and interpret the visualization output.
6. Save and document their R code for verification.
Instructions:
Step 1: Dataset Creation
1. Create a unique dataset consisting of 20–30 rows and 5–6 variables.
2. The dataset should be original and based on a theme of your choice, such as:
o Academic performance
o Sales or business data
o Health or fitness data
o Environmental monitoring
o Technology usage or customer feedback
3. The dataset must include:
o At least two numeric variables
o At least one categorical variable
o Some intentionally missing values
o At least one potential outlier
Step 2: Handling Missing Data
1. Identify missing data using R functions (e.g., is.na() or summary()).
2. Apply at least two methods to handle missing values (e.g., mean imputation, median
imputation, or deletion).
3. Briefly explain why each chosen method is appropriate for your dataset.
Step 3: Outlier Detection and Treatment
1. Identify outliers using visualization (e.g., boxplot()) or statistical measures such as IQR.
2. Decide how to treat each outlier (retain, cap, or remove).
3. Provide a short justification for your decision.
Step 4: Computation of Mean, Median, and Mode
1. Compute the mean, median, and mode for at least two numeric variables.
2. Interpret your results clearly in relation to your dataset.
Step 5: Clustering Analysis
1. Select two numeric variables and perform K-Means Clustering in R.
2. Visualize the resulting clusters using an appropriate graph (e.g., ggplot2 or factoextra).
3. Interpret the visual results and describe the similarities or differences between the
clusters.
Step 6: Visualization
1. Create at least two visualizations to support your analysis:
o One plot showing data cleaning results (e.g., boxplot before and after outlier
removal).
o One clustering visualization (e.g., scatter plot of clusters).
2. Ensure all graphs have appropriate titles, axis labels, and legends.
Step 7: Saving and Submitting Your Code
1. Save all the R codes you used for this activity in a Notepad (.txt) file.
2. The file should include comments (#) explaining what each section of your code does.
3. Save the file using this format:
4. Lastname_Firstname_Rcode.txt
5. This file will allow the instructor to check and verify your R script.
Step 8: Final Submission
Submit the following:
1. Original dataset (before cleaning) — Lastname_Firstname_OriginalDataset.csv
2. Cleaned dataset (after cleaning and clustering) —
Lastname_Firstname_CleanedDataset.csv
3. R code file (saved from Notepad) — Lastname_Firstname_Rcode.txt
4. Written report (Word or PDF) — Lastname_Firstname_DataAnalysisReport.docx or .pdf
5. Visualization screenshots showing results and graphs — embedded in your report or
submitted separately.
Evaluation Criteria
Criteria Description Points
Dataset Creation Originality, completeness, and organization 10
Handling Missing Data Appropriate method and justification 15
Outlier Detection & Treatment Correct identification and reasoning 15
Measures of Central Tendency Accuracy and interpretation 15
Clustering Analysis Correct process and explanation 20
Visualization Relevance, clarity, and labeling 10
Code Submission Code correctness and proper documentation 5
Report Presentation Clarity, structure, and depth of analysis 10
Total 100