Data Mining Lab Report
Introduction
This report analyses student performance based on a dataset containing
demographic details, academic records, and behavioural attributes. The study
primarily focuses on attributes such as age, study time, failures, absences, and
grades (G1, G2, G3). The objective is to explore correlations, data distributions,
and potential factors influencing academic success.
Data Overview
Dataset: Student performance dataset (Mathematics students)
Observations: 395
Attributes: 33 (7 selected for analysis: age, studytime, failures, absences,
G1, G2, G3)
Missing Values: No missing values were detected in the selected dataset.
Summary Statistics
The summary statistics shows the 5-table summary of all the attributes
included in the table. The values generated by the calculation of the 5-table
summary are: Minimum, 1st quartile, Median, 3rd quartile and Maximum.
Boxplots
The overall boxplot reveals the presence of outliers, especially in
absences.
Individual boxplots show that G3 scores vary widely, with a median
around 11.
The boxplot of final grade (G3) is created for better understanding of the
attribute and also to calculate the minimum, maximum and the median.
Correlation Analysis
The correlation matrix indicates strong relationships:
o G1 and G3: 0.86 (strong positive correlation)
o G2 and G3: 0.91 (very strong positive correlation)
o Study time and final grades show a weaker correlation.
The correlation matrix indicates the degree of relation between 2 attributes.
The more negative value shows that they are least related, and the more
positive value shows that they are mostly related.
Correlation Heatmap
A correlation heatmap visualizes the relationships between numerical
attributes.
Darker colours indicate stronger relationships, confirming the high
correlation between previous and final grades.
Lighter colours indicate weaker relationships, confirming the low
correlation between previous and final grades.
Histograms
The histogram depicted shows the frequency distributions of the grades
and the various attributes in the csv file. It helps in summarising the
discrete or continuous data that are measured on an interval scale.
The grade distributions (G1, G2, G3) are slightly skewed, with most
students scoring between 8 and 15.
Absences show a right-skewed distribution, with a few students having
extremely high absenteeism
Scatter Plots
We use scatter plot to represent the relationship between two variables
in a dataset. The independent variable is plotted on the X axis whereas
the dependent variable is plotted on the Y axis.
G1 vs G3 and G2 vs G3 exhibit strong linear relationships, suggesting that
early performance is a good predictor of final grades.
Study time vs G3 does not show a strong correlation.
Density Plot of Final Grades (G3)
We mostly use density plot to identify patterns, trends and the structure
of the data. Represent the distribution of a continuous variable.
The distribution of G3 is slightly skewed, with peaks around 10-14.
A density plot highlights the concentration of students within specific
grade ranges.
Findings from the report
Previous grades (G1, G2) strongly predict final grades (G3).
Absences exhibit high variability but do not directly correlate with
performance.
Study time does not show a strong relationship with final grades.
Most students have G3 scores between 8 and 15, with a few outliers.
Conclusion
This study highlights the significance of early academic performance in
determining final grades. While study time and absences do not show strong
correlations with G3, prior grades (G1, G2) are crucial indicators. Future
analyses could include additional factors such as parental support and
extracurricular activities.