SQL Databases – Final Project (Database Design and Implementation)
SQL Databases (DAMO-500-2)
Team
Mirza Yahya Baig
Sangjin Park
Wilson Kwesi Bli
Instructor: Dr. Hany Osman
6/12/2024
1
Table of Contents
Chapter 1: Background ................................................................................................................ 3
1.1 Background .......................................................................................................................... 3
1.2 Understanding the Database Entities & Relations ........................................................... 3
1.3 Research questions with Hypothesis .................................................................................. 4
Chapter 2: Data Collection and SQL Queries ............................................................................ 5
2.1.1 Research Question One .................................................................................................... 6
2.1.2 Research Question Two .................................................................................................... 7
2.1.3 Research Question Three ................................................................................................. 9
2.2.1 Non-Research Queries ....................................................................................................11
Chapter 3. Potential Data and Machine Learning Analysis ................................................... 22
Chapter 4: Conclusion ................................................................................................................ 24
2
Chapter 1: Background
1.1 Background
Formula 1 is often considered the pinnacle of motorsport: harmonious leading-edge technology
combined with driver skill delivers high-octane racing entertainment. Circuit length, layout, and
design do vary; hence, the performances of drivers and teams tend to vary quite a lot depending
on those characteristics of a circuit. Such understanding of the variations would give insights into
competitive advantages, driver consistency of skill, and team strategy.
The Formula 1 database contains data in a full dataset, including circuit characteristics, race
results, penalties, team standings, and driver performance with lap records. Using the dataset,
this study strives to analyze the interrelation of circuit characteristics and performance metrics
while trying to answer some fundamental questions about competitive trends in Formula 1.
1.2 Understanding the Database Entities & Relations
Below shows the cardinalities between the entities and relations of the database:
3
Figure 1: Formula 1 Database Entity Relationship Diagram
1.3 Research questions with Hypothesis
1. How does the length of a circuit influence lap record times?
Null Hypothesis (H₀): The length of the circuit does not influence lap record time.
Alternate Hypothesis (H₁): The length of the circuit does not influence lap record time.
4
2. Which drivers consistently perform best across circuits of varying lengths and
layouts?
Null Hypothesis (H₀): Driver performance does not significantly vary across circuits of
different lengths and layouts.
Alternate Hypothesis (H₁): Driver performance significantly varies across circuits of
different lengths and layouts.
3. Do specific circuit characteristics favor certain drivers or teams?
Null Hypothesis (H₀): Circuit characteristics do not significantly influence which drivers
or teams achieve lap records.
Alternate Hypothesis (H₁): Circuit characteristics significantly influence which drivers
or teams achieve lap records.
Chapter 2: Data Collection and SQL Queries
To do further analysis of the research questions, the next important step is to query the
database to extract the significant data and queries required. Below is screenshots of database
MySQL queries and output based on the researched questions.
5
2.1.1 Research Question One - How does the length of a circuit influence lap record times?
SQL Query:
Figure 2: Research Question One SQL Query
Output:
6
Figure 3: Research Question One - Output
2.1.2 Research Question Two - Which drivers consistently perform best across circuits of
varying lengths and layouts?
SQL Query:
7
Figure 4: Research Question Two SQL Query
8
Output:
Figure 5: Research Question Two - Output
2.1.3 Research Question Three - Do specific circuit characteristics favor certain drivers or
teams?
SQL Query:
9
Figure 6: Research Question Three Query
Output:
10
Figure 7: Research Question Three - Output
2.2.1 Non-Research Queries
Question 1: Which driver and their team collectively scored the highest total points in races
where the driver achieved a podium finish (Position 1, 2, or 3)?
SQL Code:
11
Figure 8: Non-Research Question One SQL Query
Output:
Figure 9: Non-Research Question One - Output
Question 2: Which team had the best collective driver performance in terms of average race
position across all drivers in the 2020 season?
12
SQL Code:
13
Figure 10: Non-Research Question Two SQL Query
Output:
Figure 11: Non-Research Question Two - Ouput
14
Question 3: How do weather conditions affect the frequency of penalties in races, and which
weather condition is associated with the highest average number of penalties per race?
SQL Code:
Figure 12: Non-Research Question Three SQL Query
15
Output:
Figure 13: Non-Research Question Three - Output
Question 4: How does driver experience (career points and championship) correlate with their
performance on penalty-heavy circuits?
SQL Code:
Figure 14: Non-Research Question Four SQL Query
16
Output:
Figure 15: Non-Research Question Four - Output
Question 5: Which team has the highest average points per race in the 2020 season?
SQL Code:
Figure 16: Non-Research Question Five SQL Query
17
Output:
Figure 17: Non-Research Question Five - Output
18
Question 6: Which country produced the youngest championship winning drivers?
SQL Code:
Figure 18: Non-Research Question Six SQL Query
19
Output:
Figure 19: Non-Research Question Six - Output
Question 7: Which team and driver combination received the most penalties?
SQL Code:
Figure 20: Non-Research Question Seven SQL Query
20
Output:
Figure 21: Non-Research Question Seven - Output
21
Chapter 3. Potential Data and Machine Learning Analysis
The database is crucial to the success of Formula 1. It represents the structure of the league, the
teams, sponsorship and the intensity of the league. Further machine learning analysis can be
performed based on the research question to gain further insight. For example, based on the
hypothesis, and extracted data from research question one, the potential data analysis
methods to be performed are described below:
▪ Descriptive Statistics:
✓ Calculate the mean, median, standard deviation, and range for both circuit length and lap
record times.
✓ Examine the spread and variation of lap record times relative to circuit lengths.
▪ Visualization:
✓ Create a scatter plot with circuit length on the x-axis and lap record times on the y-axis.
This will help visualize the relationship between the two variables.
✓ Add a trend line (linear or nonlinear) to observe the nature of the relationship.
✓ Include boxplots for lap record times grouped by circuit length categories (e.g., short,
medium, long circuits).
▪ Correlation Analysis:
✓ Compute the Pearson or Spearman correlation coefficient to assess the strength and
direction of the relationship between circuit length and lap record times.
▪ Regression Analysis:
22
✓ Perform a linear regression analysis to determine if circuit length significantly predicts
lap record times.
✓ Check the p-value for the slope to test if the relationship is statistically significant.
▪ Hypothesis Testing:
✓ Use ANOVA or t-tests to compare lap record times across predefined circuit length
categories (e.g., short circuits < 4.5 km, medium circuits 4.5–5.5 km, long circuits > 5.5
km).
✓ Test whether the variance in lap record times differs significantly between groups.
▪ Advanced Models:
✓ Incorporate circuit layout and other track-specific variables if available, using multiple
regression or machine learning models to evaluate their collective impact on lap record
times.
23
Chapter 4: Conclusion
Finally, we will conclude about the present futuristic possibility for the project. The performance
to bring into practice a strong database integrated with constructs queries is proved through the
project glare. The formula 1 database provides a strong foundation for trend analysis and pattern
recognition proven from the extracted data set and builds a better understanding of data
characteristics.
It is evident that further improvement, insights, and predictions can be learned from the data
through machine learning analysis, predictively analysis, clustering and segmentation and many
others.
These will be beneficial to the Formula 1 organization, as they can rely on these models and
insights to make informed decisions and further make the league lucrative for investments,
sponsorship and revenue transformations.
24