0% found this document useful (0 votes)
9 views6 pages

A Model-Based Approach For Flight D.

2f e wbwhdhjs eiss sj

Uploaded by

gourabdas2128
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views6 pages

A Model-Based Approach For Flight D.

2f e wbwhdhjs eiss sj

Uploaded by

gourabdas2128
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/371751084

A Model-Based Approach for Flight Delay Prediction

Article in Indian Journal Of Applied Research · May 2023


DOI: 10.36106/ijar/4524156

CITATIONS READS

0 159

5 authors, including:

Suklav Ghosh
Indian Institute of Technology Guwahati
2 PUBLICATIONS 2 CITATIONS

SEE PROFILE

All content following this page was uploaded by Suklav Ghosh on 04 July 2023.

The user has requested enhancement of the downloaded file.


Volume - 13 | Issue - 05 | May - 2023 | PRINT ISSN No. 2249 - 555X | DOI : 10.36106/ijar
Original Research Paper
Statistics

A MODEL-BASED APPROACH FOR FLIGHT DELAY PREDICTION

School of Mathematical and Computational Sciences, Indian Association for the


Ayan Dutta Cultivation of Science, India
School of Mathematical and Computational Sciences, Indian Association for the
Mrittika Nandi Cultivation of Science, India
School of Mathematical and Computational Sciences, Indian Association for the
Smita Sarkar Cultivation of Science, India
School of Mathematical and Computational Sciences, Indian Association for the
Swetlina Hota Cultivation of Science, India
School of Mathematical and Computational Sciences, Indian Association for the
Suklav Ghosh* Cultivation of Science, India *Corresponding Author
ABSTRACT Flight delays incur costs directly and indirectly, such as for maintaining crowd at the gate, extra remuneration for staff,
food service, and lodging. The delayed arrival of aircraft will have signicant impact on an airport's management, like the
reallotment of parking gates, runways and scheduling of ground staff. The precise prediction of ight delays can provide passengers with
dependable travel schedules and improve the service performance of airports and airlines. Objectives: To propose a model that can predict a ight
delay, using machine learning algorithms like logistic regression methods. Methods: Logistic model with and without shrinkage was t on an
Airlines dataset of movements of 6585 domestic ights of 18 airlines in the United States and then, lasso, ridge and elastic net were used to
improve the predictive ability. Results: Among all the models, the day-specic was found to give the best ight delay prediction. Conclusion:
Since, only 0.6 sensitivity and 0.5-0.6 specicity could be achieved in our predictive model, it is not very reliable as there are other factors that
inuence a ight delay.

KEYWORDS : Logistic Regression, Ridge and Lasso Regularization, k-Means Clustering, Machine Learning, Predictive
Model
INTRODUCTION 7. Length (duration of ight in minutes)
Time is money. Delay is one of the most signicant factors in the 8. Delay (0 represents No Delay, 1 represents Delay)
performance measure of any transportation system. Since air travel is an
integral part of the world's transportation system, delays in airlines have THEORY
adverse economic effects on passengers, airports and airlines. Every year
approximately 20% of airline ights are canceled or delayed, costing LOGISTIC REGRESSION MODEL
passengers more than 20 billion dollars in money and their time. Logistic regression is a supervised machine learning model. It is a
method that predicts a discrete outcome, based on training datasets. It
The air trafc system is bursting at the seams, with a huge number of is a widely used algorithm for classication and performs well with
passengers and ights clogging airports, creating congestion on runways linearly separable classes.
and in the air. Flight delays are like “environmental awareness”-
everybody talks about it, but nobody seems to do anything. It uses a logistic function to model a binary output variable. These
binary results allow straightforward decisions to be taken between two
Due to the complex nature of air transport, it is extremely hard to choices. This model analyzes the relationship between the existing
explain the reason for a delay. Some of the few reasons for delay are predictors and predicts the value of a response variable. For example,
mechanical issues, air trafc control, weather conditions, runway this model can be used to predict whether it will rain or not.
queues, ground delays, and capacity constraints.
A standard logistic function is dened as
Despite the accurate prediction of ight delay being difcult, we have
tried to solve the problem. What we have attempted to do is an in-depth SHRINKAGE METHODS
analysis of the data and considering different models to see how the A set of regularization techniques known as shrinkage involves tting
prediction accuracy changes. a regression model utilizing all p predictors while imposing some
restrictions on the size of their predicted coefcients. Shrinkage, also
DATA DESCRIPTION called regularization, has the effect of reducing variance, hence
To view the clean data, visit https://suklav.github.io/acad/projects/ improving the model's stability and can also set some of the coefcient
isical_kmd/Airlines.csv estimates to zero, thus allowing for variable selection. [2]
The Airlines dataset was found on Kaggle which was collected from Ridge Regression aims to minimize the penalized or regularized
the US department of transportation. This dataset tracks the movement Residual Sum of Squares(RSS)
of various domestic ights within the United States. This dataset has
539383 rows and 8 different variables.

The different variables are:


1. Airline (There are 18 airlines)
2. Flight (Flight No.)(There are 6585 ights)
3. Airport From (There are 293 airports) where λ ≥ 0 is a complexity parameter.
4. Airport To (There are 293 airports)
5.DayOfWeek(Monday(1),Tuesday(2), Wednesday(3), Thursday(4), The coefcients get closer to 0 as λ rises. It can be demanding for a
Friday(5), Saturday(6), Sunday(7)) model with many features to incorporate all p predictors in the nal
6.Time (departure time measured in minutes from midnight; the range model, regardless of the value of their coefcients. Lasso Regression,
is 10-1439) which does the variable selection, overcomes this drawback.
36 INDIAN JOURNAL OF APPLIED RESEARCH
Volume - 13 | Issue - 05 | May - 2023 | PRINT ISSN No. 2249 - 555X | DOI : 10.36106/ijar
As opposed to Ridge Regression's L-2 penalty, which considers the CONFUSION MATRIX
absolute value of the coefcient rather than squaring it, Lasso A confusion matrix (or error matrix) is a particular table layout that
Regression utilizes an L-1 penalty. In order to lower the MSE error, it is helps in the visualization of an algorithm. A typical confusion matrix
crucial to optimize the value of the Lasso Regression. looks like this:

Lasso Regression corrects some of the coefcient estimates to be


exactly equal to 0 whereas Ridge Regression shrinks the value of
coefcients typically close to 0.

On the other hand, the elastic net linearly combines both L1 and L2
penalties. Also, the elastic net overcomes the restraints of Lasso, which
uses a penalty function based on

CLASSIFICATION RULE
A "supervised learning technique" called the classication algorithm
is used to categorize fresh observations in light of training data. A
programme that does classication divides fresh observations into
various classes or groups after learning from the provided dataset or
observations.
VARIOUS TERMS USED
1. Specificity - Specicity is a measure of how well an algorithm
performs. It measures the percentage of actual negative outcomes that
are correctly determined. For instance, the proportion of healthy
individuals who are appropriately classied as not having COVID.

2. Sensitivity- Sensitivity is another metric for evaluating an


algorithm's performance. It measures the percentage of positive
outcomes that are correctly determined. For instance, the proportion of
COVID patients who are correctly diagnosed with the disease.

3. Goodness of fit- The term "goodness of t" refers to how well a


model ts a certain set of observations or how well it will forecast
future results.

METHODOLOGY:
1. The repetitions have been removed since for a xed day of the week,
from a certain airport to another particular airport, there are multiple
rows with exactly the same entries. It generally happens when different
people store the same data and compile. The rectied dataset prepared
with no repetitions has a signicantly reduced size.

2. The data is split into 18 parts, one for each airline and the airports are
K-MEANS CLUSTERING numbered from 1 to n say, then those are scaled by subtracting the
A given data collection can be divided into a group of k clusters using mean and dividing by standard deviation. The above is done for
the unsupervised machine learning process known as "k-means different days.
clustering." It categorizes objects in various clusters so that objects in a
single cluster are as same as possible, or have a high intra-class 3. A predictive model, i.e., logistic model is tted without any
similarity, while those from various clusters are as different as possible shrinkage for each airline and the goodness of t is assessed based on
or have a low inter-class similarity. Basically, it is necessary to reduce the confusion matrix. After getting a decent model, ridge, lasso and
all within-cluster variation. Here, the center of each cluster, or the elastic net are used for further development to check if that improved
mean of the points allocated to it, serves as its representation.[1] the predictive ability of the model.

K-means algorithm: 4. The sensitivity and specicity are computed based on the confusion
1. The analyst species how many clusters (K) should be established. matrix for each airline and it is seen that they are not good at the same
One of the methods used is to calculate k-means clustering using time since our clustering algorithm is based on this simple formulation:
various values of k clusters. Next, the within sum of squares(wss) is p/(1-p) > 1, i.e., p> 0.5, we classify as 1, otherwise zero.
plotted against the number of clusters. This plot represents the within-
clusters variance. It decreases as k increases. The “knee” in the plot is Now, an airline is xed and starting from p>0.5 classication rule, we
usually considered as the optimal number of clusters, i.e., additional move to the higher values, 0.55, 0.60, 0.65, to 0.80 say, i.e., if predicted
clusters beyond this point have little value. p>0.6, then classied as 1, else 0, and likewise. For each of these, the
sensitivity and specicity are calculated and plotted on the same curve.
2. The initial cluster centers, or m means, are chosen at random from From this plot, the optimal threshold is chosen for which a decent
the dataset to be K objects. sensitivity as well as specicity is found. This process is repeated for
all airlines and the threshold will vary for each airline. This is called an
3. Each observation is allocated to its closest centroid by calculating adaptive approach, where a threshold is chosen based on the data to
the Euclidean distance between the centroid and the object. achieve a certain level of accuracy in terms of sensitivity and
specicity.
4. The cluster centroid for each cluster is updated using the newly
obtained mean values of all the data points in the cluster. The centroid 5. To obtain a better predictive model, each airline was xed and
is a vector of length p (p is the number of variables) which contains the logistic models were t with respect to each day (airline-day specic).
means of all the variables for the observations in the kth cluster. 6. To get an even better model, each day was xed throughout the data
and logistic models were t (day-specic).
5. Iterate steps 3 and 4 as many times as necessary till the cluster
assignments cease to change or it reaches the maximum number of 7. Now, the unsupervised clustering of k-means approach is used by
iterations. This way the total wss is reduced.[1] clustering the entire data excluding the response (delay: yes/no) and
INDIAN JOURNAL OF APPLIED RESEARCH 37
Volume - 13 | Issue - 05 | May - 2023 | PRINT ISSN No. 2249 - 555X | DOI : 10.36106/ijar
the optimal number of clusters is found. Then, logistic models are run Observations:
for each cluster. 1. Logistic regression model has the best predictive ability with the
least prediction error.
RESULTS AND DISCUSSIONS: 2. Specicity and sensitivity are not statistically signicant
Number of Delays Vs Airlines: simultaneously.

Airline specific with optimal threshold values:


As observed the predictive ability of logistic models were best,
shrinkage methods were not used for further analysis. To get equally
good specicity and sensitivity values, classication rules were
applied on a range of threshold values to obtain the optimal threshold
for each airline.

This optimal threshold was thus used for further analysis.


Airline Threshold Value Sensitivity Specificity Sum
CO 0.54 0.5825462 0.6014539 1.1840001
US 0.42 0.5582086 0.5653754 1.123584
Observation:
AA 0.46 0.5700344 0.518064 1.0880984
We can observe that WN (Northwest Airlines) has the maximum
number of delays and HA (Hawaiian Airlines) has the least number of AS 0.4 0.5809889 0.5264059 1.1073948
delays. It is possible that Northwest airlines is the least efcient airline DL 0.47 0.5854699 0.5617011 1.147171
and Hawaiian airlines is the most efcient airline. B6 0.49 0.510716 0.5735901 1.0843061
HA 0.44 0.531218 0.5505255 1.0817435
Number of Delays vs Day in week: OO 0.47 0.5833935 0.5575236 1.1409171
9E 0.44 0.5766208 0.5398162 1.116437
OH 0.35 0.5556734 0.5754911 1.1311645
EV 0.44 0.5566004 0.5764859 1.1330863
XE 0.42 0.566567 0.5848487 1.1514157
YV 0.34 0.6239148 0.5883436 1.2122584
UA 0.38 0.5646364 0.5902374 1.1548738
MQ 0.44 0.5499251 0.537902 1.0878271
FL 0.36 0.5999178 0.5777778 1.1776956
F9 0.48 0.5629838 0.5429487 1.1059325
WN 0.66 0.6103013 0.5993017 1.209603
For detailed results : https://suklav.github.io/acad/projects/isical_kmd
Observation: /Specic_Sheet1.html
We can clearly observe that Wednesday has the maximum number of
delays and Saturday has the least number of delays. It is possible that Observations:
Wednesday may be the busiest day in the US and Saturday may be the 1. Average of the specicity and sensitivity values is 1.132.
least busy day in the US. 2. Clearly the sum of specicity and sensitivity is decent and the
predictive ability of the models have improved.
Airline specific analysis using logistic methods(with and without
shrinkage): Airline-Day Specific:

38 INDIAN JOURNAL OF APPLIED RESEARCH


Volume - 13 | Issue - 05 | May - 2023 | PRINT ISSN No. 2249 - 555X | DOI : 10.36106/ijar

Observations:
1. Average of the specicity and sensitivity values is 1.177.
2. Much better predictive ability is observed in this model.

Day Specific Results:

Observations:
1. Average of the specicity and sensitivity values is 1.182
2. As we can observe, this model gives a better prediction.

Clustering results:

INDIAN JOURNAL OF APPLIED RESEARCH 39


Volume - 13 | Issue - 05 | May - 2023 | PRINT ISSN No. 2249 - 555X | DOI : 10.36106/ijar
Observation: [6] Bhatia,B.(2018).Flight Delay Prediction.
[7] Raj,V., Raj,V., Singh,S., Mishra,A.(2021). Flight Delay Prediction.
From the plot, we observe that the optimal number of clusters is 4.

The results of the logistic model regression for each cluster is as


follows:

Observation: Average of the sensitivity and specicity values is 1.102

From the above observations, it can be concluded that the day-specic


model gives the best ight delay prediction.

CONCLUSIONS
Even with the most advanced machine learning techniques, only 0.6
sensitivity and 0.5-0.6 specicity could be achieved in our predictive
model. This indicates that although the proposed model performs
better than random guess, it is not very reliable. Also, there are several
additional factors that inuence ight delay which were not provided
in the dataset.

This is possibly the rst model-based approach for addressing the


ight delay in the United States. It needs data on some additional
factors associated with ight delay such as weather conditions, air
trafc control, number of hours a pilot works in a week etc. for a better
predictive model.

ACKNOWLEDGEMENT
We are immensely grateful to our supervisor Dr. Kiranmoy Das,
Associate Professor, Applied Statistics Division, Indian Statistical
Institute, Kolkata for guiding us throughout this project. This project
would not have been successful without his kind and valuable
assistance and support. We are thankful for his belief in our abilities,
his enthusiasm for this project, and his continuous insistence. He
provided excellent teaching, encouragement, sound advice, and some
great ideas.

APPENDIX
The code is uploaded here: https://suklav.github.io/acad/projects/
isical_kmd/

REFERENCES:
[1] https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-
practical-examples/
[2] http://busigence.com/blog/shrinkage-methods-in-linear-regression/
[3] https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
[4] Sternberg,A., Carvalho,D., Soares,J.D.A., Ogasawara,E.(2017). A Review on Flight
Delay Prediction.
[5] Ramos,G.J.(2014). Estimating The Effect Of Poverty On Violent Crime. Theses and
Dissertations. 1696.
40 INDIAN JOURNAL OF APPLIED RESEARCH

View publication stats

You might also like