Assignment 4
📊 Worth: 9%
📅 Due: December 14, @ Midnight
🕑 Late submissions: 5% penalty per late day. Maximum of 5 late days allowed.
Feel free to add more functions to avoid code repetition
⚠ What to Submit
One team member should submit the following:
.py files called performance_factors.py
Report in .pdf format
All graphs in .png format
Modules
Ensure that the following modules are installed:
matplotlib
numpy
Student Performance Factors
Context
(Optional read)
In this assignment, you will study the impact of various factors affecting student performance in
exams. The "Student Performance Factors" is a synthetic dataset created for education
purposes
only. It can be found on Kaggle, a very popular data science website where many AI competitions
are
hosted.
The dataset includes various factors that may affect student performance, such as study habits,
attendance, and parental involvement. Your goal is to identify which factors have the greatest
impact
on student exam scores.
Dataset
The data contains 6608 lines and 20 columns. Each line represents the data for one student and
has
the following columns:
and has the following columns:
Column Type Description Index
int
Hours_Studied Number of hours spent studying per week. 0
orfloat
Column Type Description Index
int
Attendance Percentage of classes attended. 1
orfloat
Level of parental involvement in thestudent’s
Parental_Involvement string 2
education (Low, Medium, High).
Availability of educational resources
Access_to_Resources string 3
(Low,Medium,High).
Extracurricular_Activities string Participation in extracurricular activities(Yes,No). 4
int
Sleep_Hours Average number of hours of sleep pernight. 5
orfloat
int
Previous_Scores Scores from previous exams. 6
orfloat
Motivation_Level string Student’s level of motivation (Low,Medium,High). 7
Internet_Access string Availability of internet access (Yes,No). 8
int
Tutoring_Sessions Number of tutoring sessions attended permonth. 9
orfloat
Family_Income string Family income level (Low,Medium,High). 10
Teacher_Quality string Quality of the teachers (Low,Medium,High). 11
School_Type string Type of school attended (Public, Private). 12
Influence of peers on academic performance
Peer_Influence string 13
(Positive, Neutral, Negative).
int Average number of hours of physicalactivity per
Physical_Activity 14
orfloat week.
Learning_Disabilities string Presence of learning disabilities (Yes, No). 15
Highest education level of
Parental_Education_Level string 16
parents(HighSchool,College,Postgraduate).
Distance from home to school
Distance_from_Home string 17
(Near,Moderate,Far).
Gender string Gender of the student (Male, Female). 18
Column Type Description Index
int
Exam_Score Final exam score. This is the dependantvariable. 19
orfloat
Columns of interest
Focus on the following factors:
Hours_Studied
Teacher_Quality
School_Type
Two additional numeric factors of your choice.
PART I
1.1 Write the function read_data()
Input parameters:
the file_name
Returns:
A list containing all the required lists: [exam_scores_lists, study_hours_list, choice1_list,
choice2_list, teacher_list, school_list]
Task
Use the technique seen in class to read the .csv file
Important: Some lines in Hours_studied and Teacher_Quality are missing values. In
those
cases, append None as shown in the example below:
teacher_list = []
# ... code
for line in csv_reader:
# ... code
if line[TEACHER_INDEX] == '': #missing categorical value
teacher_list.append(None)
else:
teacher_list.append(line[TEACHER_INDEX])
For each line make sure to:
clean the data if necessary
convert it to the appropriate data type
add each value to its associated list.
1.2 Write the function print_stats()
Input parameter
A list of exam scores scores_list
Return
None
Task
Gather some statistics on the student scores, the minimum, maximum, average, standard
deviation,
median as well as the count of students.
Calculate the min_score , max_score , avg_score , the median med_scoreand the standard
deviation std
Calculate the count of elements in the list
Display the values as shown below
Hint: You can use numpys function np.median(scores_list) to calculate the median ,
np.mean(scores_list) to calculate the average and np.std(scores_list).
Call the function in the `main(). Copy the results into your report.
Example of output
------Exam scores statistics------
Average: 67.
Median : 67.
Min : 55
Max : 101
Std : 3.
Count : 6607
✅ Submit your work as .py file
PART II
In this section, we will focus on plotting and analysing the trends. You are free to design the
function
in which ever way you want, ensure that the graphs and the values are saved properly.
1.3 Write a function trend_analysis()
This step uses np.polyfit() to find a linear function that approximates the relationship between
student scores and other columns.
Task:
Fit a linear polynomial function onto the data using np.polyfit()where score_list is a
function of study_hours_list:
Plot both the original data and the model on the same graph.
Add axis labels and a graph title
Use different marker styles for the model
Add a legend and labels for plot.
Save the figure as trend_hours_studied.png
Print the equation the equation. Copy the results into your report.
equation = f"$y = {a:.2f}x + {b:.2f}$"
print_answer("Study Hours: ", equation)
(Optional) You can display the equation on the graph as such:
plt.text(x=20, y=70, equation, fontsize=14, color='black', ha='center',
va='center')
Repeat the previous steps with the two other lists choice1_list and choice2_list.
Example of graph
Happy Holidays! 🎄🎉✨ Wishing you a good end of semester and a restful break 🌟