0% found this document useful (0 votes)
39 views4 pages

PySpark SQL Assessment

Uploaded by

kalirew812
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views4 pages

PySpark SQL Assessment

Uploaded by

kalirew812
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

PySpark Questions:

1. Remove the columns CANCELLATION_REASON and DIVERTED from a


Spark DataFrame.

flight_df = flight_df.drop("CANCELLATION_REASON", "DIVERTED")


flight_df.show()

2. Read the dataset into a PySpark DataFrame and filter flights that
have DEPARTURE_DELAY greater than 20 minutes and are flying to
LAX. Show the count.

flight_df = [Link]("FlightsDataset(in).csv", header=True,


inferSchema=True)
flight_df.show()
from [Link] import *

filtered_flight_df = flight_df.filter((col("DEPARTURE_DELAY") > 20) &


(col("DESTINATION_AIRPORT") == "LAX")).count()

print(filtered_flight_df)

3. Write a SQL query to find the total number of flights for each airline
and display it in descending order.

%%sparksql
select AIRLINE, count(AIRLINE) as total_flights from flight_table
group by AIRLINE order by total_flights desc;
MySQL Questions:

4. Create a view named on_time_flights that includes only flights that


had no departure or arrival delays.

[Link]("create view on_time_flights as select * from flight_table


where DEPARTURE_DELAY < 0 and ARRIVAL_DELAY < 0")

5. Use a CTE to calculate the average delay time per airport, then
retrieve airports with an average delay greater than 20 minutes.

%%sparksql
with delay_per_airport as (select ORIGIN_AIRPORT,
avg(DEPARTURE_DELAY) as avg_delay from flight_table group by
ORIGIN_AIRPORT)
select * from delay_per_airport where avg_delay > 20;
6. Write a SQL query to find the top 5 flights with the highest
DEPARTURE_DELAY.

%%sparksql
select AIRLINE, max(DEPARTURE_DELAY) as max_delay from
flight_table group by AIRLINE order by max_delay desc LIMIT 5;

You might also like