PySpark Questions:
1. Remove the columns CANCELLATION_REASON and DIVERTED from a
Spark DataFrame.
flight_df = flight_df.drop("CANCELLATION_REASON", "DIVERTED")
flight_df.show()
2. Read the dataset into a PySpark DataFrame and filter flights that
have DEPARTURE_DELAY greater than 20 minutes and are flying to
LAX. Show the count.
flight_df = [Link]("FlightsDataset(in).csv", header=True,
inferSchema=True)
flight_df.show()
from [Link] import *
filtered_flight_df = flight_df.filter((col("DEPARTURE_DELAY") > 20) &
(col("DESTINATION_AIRPORT") == "LAX")).count()
print(filtered_flight_df)
3. Write a SQL query to find the total number of flights for each airline
and display it in descending order.
%%sparksql
select AIRLINE, count(AIRLINE) as total_flights from flight_table
group by AIRLINE order by total_flights desc;
MySQL Questions:
4. Create a view named on_time_flights that includes only flights that
had no departure or arrival delays.
[Link]("create view on_time_flights as select * from flight_table
where DEPARTURE_DELAY < 0 and ARRIVAL_DELAY < 0")
5. Use a CTE to calculate the average delay time per airport, then
retrieve airports with an average delay greater than 20 minutes.
%%sparksql
with delay_per_airport as (select ORIGIN_AIRPORT,
avg(DEPARTURE_DELAY) as avg_delay from flight_table group by
ORIGIN_AIRPORT)
select * from delay_per_airport where avg_delay > 20;
6. Write a SQL query to find the top 5 flights with the highest
DEPARTURE_DELAY.
%%sparksql
select AIRLINE, max(DEPARTURE_DELAY) as max_delay from
flight_table group by AIRLINE order by max_delay desc LIMIT 5;