0% found this document useful (0 votes)

528 views13 pages

SA Coding Assessment

The document outlines a coding assessment for data engineering focused on airlines and baby names datasets. Participants are required to perform data cleansing, transformation, and answer various questions using DataFrame API, with an emphasis on the process rather than the results. The assessment includes both coding and written-answer questions, covering topics such as data loading, analysis, and performance considerations in Spark.

Uploaded by

Anusha Ragavendran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

528 views13 pages

SA Coding Assessment

Uploaded by

Anusha Ragavendran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

SA coding assessment: Data

Engineering, Airlines
Version 2022.01
What you'll do:

 We provide the dataset. You will load it into dataframes, and perform
some data cleansing and transformation tasks.
 You will answer a series of questions to show insights from the data.
 There are also some written-answer questions.

We care about the process, not the result. I.e., we're looking for proper
use of data engineering techniques and understanding of the code you've
written.

This Data Engineering section is scored out of 55 points.

# This folder is for you to write any data as needed. Write access is restricted elsewhere. You can
always read from dbfs.

aws_role_id = "AROAUQVMTFU2DCVUR57M2"

user = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')

userhome = f"s3a://e2-interview-user-data/home/{aws_role_id}:{user}"

print(userhome)

The following questions use the airlines dataset located at dbfs:/interview-

datasets/sa/airlines. All airlines questions expect answers that use the
Dataframes API (Scala or Python). SQL only answers are accepted but may
receive reduced points. We will not accept answers that use the RDD API.

Question : 1

Write code that uses the DataFrame API to read in the entire airlines data
set with clearly named columns.

# Please provide your code answer for Question 1 here.

# Reading Schema - While loading data from /airlines the header was not detected properly by
default with inferSchema & header switch so this workaround to fetch the schema

header = spark.read.csv("/interview-datasets/sa/airlines/part-00000", header="true")

airline_sch = header.schema

airlineDf = spark.read.csv("/interview-datasets/sa/airlines/", inferSchema="false", header="true",

schema=airline_sch)

airlineDf.cache()

display(airlineDf)

Written Answer

Explain your answer to Question 1 here.

I have used spark dataframe API method to read the airlines CSV file.

Since the part-00000 has the header record and while loading records in
parallel to dataframe the Header is not always the firect record so I used
an alternate step to read the schema from part-00000 and then use the
schema to load the dataset with named columns with swtich Header as
True to eliminate the header record from part-00000.

I have cached the filtered df (airlineDf) for frequent use in below questions
as a best practice and for better performance

Airlines Question 2 [5 Points]

How many unique airlines are present in this dataset?
# Please provide your code answer for Question 2 here

from pyspark.sql import functions as F

# Get distinct count of UniqueCarrier

unqCar = airlineDf.select(F.countDistinct("UniqueCarrier"))

display(unqCar)

Written Answer
 I am using Dataframe method Select and using countDistinct function
to find no. of unique UniqueCarrier from the filtered dataset
 Storing the output into new dataframe unqCar.

Airlines Question 3 [10 Points]

Which airline is delayed on departure most often? Show a bar graph of the
top five most delayed airlines.
# Please provide your code answer for Question 3

from pyspark.sql import functions as F

from pyspark.sql import types as T

airdf = airlineDf.withColumn("iDepDelay", F.col("DepDelay").cast(T.IntegerType()))

delayedAirline = airdf.filter(F.col("iDepDelay") >

0).groupBy("UniqueCarrier").agg(F.count("iDepDelay").alias("NumDelays")).sort(F.desc("NumDelays"
)).take(5)

display(delayedAirline)

Written Answer

I have followed below steps to derive the result:

 Filtered the airlines that has departure delay records by checking the
DepDelay field is greater than 0.
 Aggregating number of occurences an airline has delayed based on
the UniqueAirline field
 Sort NumDelays in descending order and take the first 5 items into
the result dataframe - delayedAirline
 Display using Bar chart for 5 carriers and respective delay occurences

Airlines Question 4 [15 Points]

Part a: What was the average arrival delay per airline?

Part b: Also, for each airline, on average did flights arrive early or late?

Calculate the average arrival delay per airline. Then, add a column to this
new dataframe (containing the grouped averages) that contains the string
"Late" if the average arrival for that airline arrive >15 minutes late,
"Early" if the average arrival is <0 minutes late, and "On-time" if the
average arrival is between 0 and 15 minutes late.
To add the additional column, use a Spark UDF. Additionally, make sure to
filter out or fill in null values in your dataframe (if there are any) prior to
applying the UDF.

# Please provide your code answer for Question 4

# Part a:

arrDelaydf = airlineDf.withColumn("iArrDelay", F.col("ArrDelay").cast(T.IntegerType()))

avgArrival = arrDelaydf.groupby("UniqueCarrier").agg(F.avg("iArrDelay").alias("AvgArrDelay"))

display(avgArrival)

# Part b

from pyspark.sql.functions import udf

@udf("string")

def groupedAvg_udf(a):

if a >= 15:

return 'Late'

elif a >= 0 and a < 15:

return 'On-time'

else:

return 'Early'

groupedArr = avgArrival.withColumn("groupedAvgArrival", groupedAvg_udf("AvgArrDelay"))

display(groupedArr)

Written Answer

Part: A

 Convert the string datatype field ArrDelay into Integer

 Calculate avg on Arrival Delay per UniqueCarrier

Part: B

 Declaring an UDF to take input as the Avg Arrival Delay and applying
the logic to find Late, On-time or Early
 Creating a new Dataframe and adding a new column using the UDF
created to return Grouped Avg Delay string

Airlines Question 5 [15 Points]

What file format is airlines data stored in, and was this the most optimal
format for the questions asked above?

What format would you store this data in if you frequently queried only
the UniqueCarr and CancellationCode columns?

What if you frequently read entire rows of the dataset?

Note: Cite any sources used. You do not need a code answer for this
question.

Written Answer

 File format was CSV; Looking at the data and operations performed I
think it would be good to store them in parquet file format
 If I am supposed to use only few columns for the analysis then I prefer
use parquet only since it is good while performing operations on
column level especially when the data volumn is huge
 If I am supposed to process entire row, then i prefer to choose ORC file
format since it gives better read performance on row level since the
data are stored/processed at stripe level and options to skip rows
based on the stats stored in the footer of each stripes.

I have written based on my experience but usually I refer Spark

documentation and databricks learning materials while researching for
better ways of designing solutions.

Airlines Question 6 [5 Points]

If you needed to keep multiple versions of this dataset, why might you use
the Delta format to do this efficiently?

Written Answer

Delta lake format provides versioning options either by version or by

timestamp which ease the complexity of managing multiple versions of a
dataset. Also, it provides reliable ACID transactions and better metadata
management options.
In this use case I would have used a single table to restore old data or
merge multiple datasets using delta format. It provides better cache
capabilities for frequently used queries. Converting from parquet or Delta
comes available out of the box from Databricks which simplifies the
funtional dependencies and common across languages.

Most of the questions were on single or few column based so I would use
Delta or Parquet format to store and process the data instead of from the
CSV format
SA coding assessment: Data
Engineering, Baby Names
Version 2022.01
What you'll do:

We care about the process, not the result. I.e., we're looking for proper
use of data engineering techniques and understanding of the code you've
written.

This Data Engineering section is scored out of 50 points.

Setup Env

1
# This folder is for you to write any data as needed. Write access is restricted elsewhere.
You can always read from dbfs.
2
aws_role_id = "AROAUQVMTFU2DCVUR57M2"
3
user =
dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')
4
userhome = f"s3a://e2-interview-user-data/home/{aws_role_id}:{user}"
5
print(userhome)
s3a://e2-interview-user-data/home/
AROAUQVMTFU2DCVUR57M2:[email protected]

Using Baby Names Data Set

This dataset comes from a website referenced by Data.gov. It lists baby
names used in the state of NY from 2007 to 2018.
Run the following two cells to copy this file to a usable location.
%scala
2
import java.net.URL
3
import java.io.File
4
import org.apache.commons.io.FileUtils
5

6
val tmpFile = new File("/tmp/rows.json")
7
FileUtils.copyURLToFile(new
URL("https://health.data.ny.gov/api/views/jxy9-yhdk/rows.json?
accessType=DOWNLOAD"), tmpFile)
import java.net.URL import java.io.File import
org.apache.commons.io.FileUtils tmpFile: java.io.File = /tmp/rows.json

# https://docs.python.org/3/library/hashlib.html#blake2

from hashlib import blake2b

user =
dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().a
pply('user')

h = blake2b(digest_size=4)

h.update(user.encode("utf-8"))

display_name = "user_" + h.hexdigest()

print("Display Name: " + display_name)

dbutils.fs.cp('file:/tmp/rows.json', userhome + '/rows.json')

dbutils.fs.cp(userhome + '/rows.json'
,f"dbfs:/tmp/{display_name}/rows.json")

baby_names_path = f"dbfs:/tmp/{display_name}/rows.json"

print("Baby Names Path: " + baby_names_path)

dbutils.fs.head(baby_names_path)

# Ensure you use baby_names_path to answer the questions. A bug in

Spark 2.X will cause your read to fail if you read the file from userhome.

# Please note that dbfs:/tmp is cleaned up daily at 6AM pacific

Baby Names Question 1 - Nested Data [15 Points]

Use Spark SQL's native JSON support to read the baby names file into a
dataframe. Use this dataframe to create a temporary table containing all
the nested data columns ("sid", "id", "position", "created_at",
"created_meta", "updated_at", "updated_meta", "meta", "year",
"first_name", "county", "sex", "count") so that they can be queried using
SQL.
Hint: you can use dbutils.fs.head(baby_names_path) to take a look at the
dataset before reading it in.

Suggested Steps:

1. Read in the JSON data

2. Pull all columns in the nested data column to top level, following the schema
specified above. There are built-in Spark SQL functions that will accomplish
this.
3. Create a temp table from this expanded dataframe using
createOrReplaceTempView()

# Please provide your code answer for Question 1 here

# dbutils.fs.head(baby_names_path)

# Loading JSON file into Dataframe

from pyspark.sql.functions import explode

spark.read.json(baby_names_path,multiLine=True).select(explode("data").
alias("BN")).createOrReplaceTempView("NameView")
sqlContext.sql("SELECT BN[0] AS sid, BN[1] AS id, BN[2] AS position,
BN[3] AS created_at, BN[4] AS created_meta, BN[5] AS updated_at, BN[6]
AS updated_meta, BN[7] AS meta, BN[8] AS year, BN[9] AS first_name,
BN[10] AS country, BN[11] AS sex, BN[12] AS count FROM
NameView").createOrReplaceTempView("babyNames")

display(sqlContext.sql("SELECT * FROM babyNames LIMIT 10"))

Written Answer

1. I have used native spark JSON file read method and using explode
method to extract the "data" node and creating a temp view
NameView
2. Using the NameView, I have extracted nested column into named
temp view using createOrReplaceTempView()
3. Finally displaying sample 5 records for review

Baby Names Question 2 - Multiple Languages [10 Points]

Using the temp table you created in the question above, write a SQL
query that gives the most popular baby name for each year in the
dataset. Then, write the same query using either the Scala or Python
dataframe APIs.

Code Answer

# Please provide your code answer for Question 2 here. You will need
separate cells for your SQL answer and your Python or Scala answer.

# SQL Answer

sumBabyNames = spark.sql("SELECT year, first_name, SUM(count) AS

sum_name FROM babyNames GROUP BY year, first_name ORDER BY
first_name")

display(sumBabyNames)

sumBabyNames.createOrReplaceTempView("bNames")
popNames = spark.sql("SELECT year,first_name, sum_name FROM
(SELECT *, DENSE_RANK() OVER(PARTITION BY year ORDER BY sum_name
DESC) AS rank FROM bNames) WHERE rank=1 ORDER BY year")

display(popNames.show())

# Python code

from pyspark.sql.window import *

from pyspark.sql.functions import *

nameDf = spark.sql("select year, first_name, sum(count) as scount from babyNames group by year,
first_name")

popularName = nameDf.withColumn("rank",
dense_rank().over(Window.partitionBy("year").orderBy(desc("scount")) )).where(col("rank")==1).sel
ect('year','first_name','scount').orderBy("year").show()

Written Answer

Please provide your brief, written description of your code here.

Baby Names Question 3 - Performance [10 Points]

Are there any performance considerations when choosing a language API
(SQL vs Python vs Scala) in the context of Spark?

Are there any performance considerations when using different data

representations (RDD, Dataframe) in Spark? Please explain, and provide
references if possible. No code answer is required.

visitors_path = "/interview-datasets/sa/births/births-with-visitor-data.json"

## Hint: the code below will read in the downloaded JSON files. However, the xml column needs to
be given structure. Consider using a UDF.

# Read the births-with-visitor-data.json file into a dataframe and parse the nested XML fields into
columns and print the total record count

df = spark.read.option("inferSchema", True).json(visitors_path)
import xml.etree.ElementTree as ET

import pyspark.sql.functions as F

@F.udf('array<struct<id:string, age:string, sex:string>>')

def parse_visitors(s):

root = ET.fromstring(s)

return list(map(lambda x: x.attrib, root.findall('visitor')))

df.select("year","county","first_name",
F.explode(parse_visitors('visitors')).alias('visitors')).select('year','county','first_name','visitors.*').crea
teOrReplaceTempView("babyVisitors")

display(spark.sql("select count(*) from babyVisitors"))

bVisitors = spark.sql("select * from babyVisitors")

bVisitors.show(1)

bVisitors.printSchema()

## Hint: check for inconsistently capitalized field values. It will make your answer incorrect.

# Find the county with the highest average number of visitors across all births in that county

highAvgVisitors = bVisitors.groupBy("county").agg(F.avg("id").alias("avgVis"))\

.select(F.max(F.struct("county", "avgVis")).alias("v")).select("v.*")

display(highAvgVisitors.show())

## Hint: check for inconsistently capitalized field values. It will make your answer incorrect.

# Find the average visitor age for a birth in the county of KINGS

bVisitors.filter(col("county") ==
'KINGS').groupBy(F.upper("county")).agg(F.avg("age").alias("avgAge")).show()
## Hint: check for inconsistently capitalized field values. It will make your answer incorrect.

# Find the most common birth visitor age in the county of KINGS

# When ignoring the inconsistent capitalized field county

mostComAge = spark.sql("SELECT county, age, count(*) AS comAge FROM babyVisitors WHERE

county = 'KINGS' \

GROUP BY county, age ORDER BY comAge DESC LIMIT 1")

display (mostComAge)

# When fixing the inconsistent capitalized field county by converting to UPPER case as standard

mostComAge = spark.sql("SELECT UPPER(county), age, count(*) AS comAge FROM babyVisitors

WHERE UPPER(county) = 'KINGS' \

GROUP BY UPPER(county), age ORDER BY comAge DESC LIMIT 1")

display (mostComAge)

#4 - Written Answer

In this baby visitor dataset the apply the below logic to find the most
common age in the county "KINGS" using Spark Sql a. Find count of Age
group by county and apply UPPER function to eliminate case sensitive
distribution of records b. Ordering the resultset in descending order and
pick the 1st element from the output c. Display the final result from the
dataframe

Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
How To Work With Apache Spark and Delta Lake?
No ratings yet
How To Work With Apache Spark and Delta Lake?
40 pages
PySpark Interview Questions
0% (1)
PySpark Interview Questions
3 pages
Near Real-Time Big Data Processing
No ratings yet
Near Real-Time Big Data Processing
59 pages
Spark Optimization 1741826797
No ratings yet
Spark Optimization 1741826797
7 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
PySpark Interview Questions Shubham
0% (1)
PySpark Interview Questions Shubham
3 pages
Extended Spark Interview QA
No ratings yet
Extended Spark Interview QA
3 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Deloitte Scenario-Based Questions in Spark
No ratings yet
Deloitte Scenario-Based Questions in Spark
7 pages
Day 89
No ratings yet
Day 89
9 pages
Narsimlu - Azure Data Engineer - Resume .Pf-1
No ratings yet
Narsimlu - Azure Data Engineer - Resume .Pf-1
4 pages
ETL Process Overview in Agriculture
100% (1)
ETL Process Overview in Agriculture
42 pages
Pyspark 1
No ratings yet
Pyspark 1
19 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
PySpark Tutorial: From Basics to Advanced
No ratings yet
PySpark Tutorial: From Basics to Advanced
102 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
A - Learning - Oreilly.com-Preface Data Engineering With AWS
No ratings yet
A - Learning - Oreilly.com-Preface Data Engineering With AWS
6 pages
Data Migration and CDC Tasks
No ratings yet
Data Migration and CDC Tasks
11 pages
Pyspark STAR Questions
No ratings yet
Pyspark STAR Questions
21 pages
Data Engineer Interview Questions With Examples
No ratings yet
Data Engineer Interview Questions With Examples
8 pages
PySpark Zero To Hero Ebook
No ratings yet
PySpark Zero To Hero Ebook
6 pages
Overview of Apache Druid Architecture
No ratings yet
Overview of Apache Druid Architecture
12 pages
Analytics Consultant Resume - Ajay Budhewar
No ratings yet
Analytics Consultant Resume - Ajay Budhewar
2 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
25 Pyspark Transformation
No ratings yet
25 Pyspark Transformation
10 pages
Spark DataFrame and RDD Operations Guide
No ratings yet
Spark DataFrame and RDD Operations Guide
5 pages
ABD22 1st Exam - 6 January - Attempt Review
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
13 pages
How To Kickstart An Azure Data Engineering Project
No ratings yet
How To Kickstart An Azure Data Engineering Project
6 pages
Spark Repartition vs Coalesce Explained
No ratings yet
Spark Repartition vs Coalesce Explained
7 pages
Simulado Databricks
No ratings yet
Simulado Databricks
25 pages
Databricks Architecture Interview Preparation
No ratings yet
Databricks Architecture Interview Preparation
3 pages
Deloitee Data Engineer Interview Questions
100% (1)
Deloitee Data Engineer Interview Questions
24 pages
2025 Pyspark Interview Questions Collections
No ratings yet
2025 Pyspark Interview Questions Collections
50 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
Data Migration Project
No ratings yet
Data Migration Project
36 pages
SQL For Data Engineering
No ratings yet
SQL For Data Engineering
79 pages
Spark QA
No ratings yet
Spark QA
34 pages
Architecting Data Pipelines on GCP
No ratings yet
Architecting Data Pipelines on GCP
24 pages
SQL Most Asked Questions
No ratings yet
SQL Most Asked Questions
7 pages
Companywise Interview Questions
No ratings yet
Companywise Interview Questions
71 pages
ADF Interview Questions and Scenarios
No ratings yet
ADF Interview Questions and Scenarios
2 pages
Apache Spark for Developers
No ratings yet
Apache Spark for Developers
3 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
PySpark All Query
No ratings yet
PySpark All Query
22 pages
Spark
No ratings yet
Spark
96 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Azure Analytics Interview Answers Complete
No ratings yet
Azure Analytics Interview Answers Complete
5 pages
Goldman Sachs Data Engineer Interview Prep
No ratings yet
Goldman Sachs Data Engineer Interview Prep
4 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Associate Developer Apache Spark 3.5
No ratings yet
Associate Developer Apache Spark 3.5
6 pages
PySpark Optimization Interview Scenarios
No ratings yet
PySpark Optimization Interview Scenarios
8 pages
Understanding Spark Architecture Basics
No ratings yet
Understanding Spark Architecture Basics
25 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
DataGrokr Technical Assignment - Data Engineering - Internshala
No ratings yet
DataGrokr Technical Assignment - Data Engineering - Internshala
5 pages
CSC126 Tutorial 11 Sample Answer
No ratings yet
CSC126 Tutorial 11 Sample Answer
7 pages
C++ Polymorphism Explained
No ratings yet
C++ Polymorphism Explained
3 pages
Apriori Algorithm for Association Rule Mining
No ratings yet
Apriori Algorithm for Association Rule Mining
17 pages
Learning Outcomes: Computer Science Class-Xi Code No. 083 2021-22
No ratings yet
Learning Outcomes: Computer Science Class-Xi Code No. 083 2021-22
4 pages
Compiler Construction: Parsing: Mandar Mitra
No ratings yet
Compiler Construction: Parsing: Mandar Mitra
33 pages
Basic XPath Theory
No ratings yet
Basic XPath Theory
5 pages
Master Oracle SQL & PL/SQL Guide
No ratings yet
Master Oracle SQL & PL/SQL Guide
169 pages
Report Incident Response Middle East
No ratings yet
Report Incident Response Middle East
57 pages
Substitution Rule (User Exit) in FI - FB60 or Manage Supplier Invoices Apps For Partner Bank Type in SAP S4HANA
100% (1)
Substitution Rule (User Exit) in FI - FB60 or Manage Supplier Invoices Apps For Partner Bank Type in SAP S4HANA
47 pages
JDBC Guide for Java Developers
No ratings yet
JDBC Guide for Java Developers
22 pages
It Officer E Book
No ratings yet
It Officer E Book
180 pages
Rcu Mca Syllabus
No ratings yet
Rcu Mca Syllabus
74 pages
Bilal Haider's IT Resume and Profile
No ratings yet
Bilal Haider's IT Resume and Profile
3 pages
Deo Aiims
No ratings yet
Deo Aiims
29 pages
Big Data Workshop: Hadoop & MapReduce
No ratings yet
Big Data Workshop: Hadoop & MapReduce
35 pages
Digital Copyof New Data Analytics Syllabus
No ratings yet
Digital Copyof New Data Analytics Syllabus
6 pages
API GetProductItems
No ratings yet
API GetProductItems
2 pages
1 Oopj
No ratings yet
1 Oopj
5 pages
HANA Studio Calculation View Errors
No ratings yet
HANA Studio Calculation View Errors
18 pages
RDBMS Practicals
No ratings yet
RDBMS Practicals
23 pages
PPSC Unit-3
No ratings yet
PPSC Unit-3
50 pages
Denotative Imagemaking in Curriculum
No ratings yet
Denotative Imagemaking in Curriculum
20 pages
Apps 11i OA Framework Extension Guide
No ratings yet
Apps 11i OA Framework Extension Guide
6 pages
Qba Lec 7 2023
No ratings yet
Qba Lec 7 2023
12 pages
JSP and Servlet
No ratings yet
JSP and Servlet
9 pages
CSV File Classwork
No ratings yet
CSV File Classwork
4 pages
Java Package Creation and Usage
No ratings yet
Java Package Creation and Usage
23 pages
Rudy Singh: Software Development Profile
No ratings yet
Rudy Singh: Software Development Profile
1 page
SAP Data Dictionary Updates
No ratings yet
SAP Data Dictionary Updates
6 pages
Bhawna Resume
No ratings yet
Bhawna Resume
1 page

SA Coding Assessment

Uploaded by

SA Coding Assessment

Uploaded by

SA coding assessment: Data

This Data Engineering section is scored out of 55 points.

The following questions use the airlines dataset located at dbfs:/interview-

# Please provide your code answer for Question 1 here.

header = spark.read.csv("/interview-datasets/sa/airlines/part-00000", header="true")

airlineDf = spark.read.csv("/interview-datasets/sa/airlines/", inferSchema="false", header="true",

Explain your answer to Question 1 here.

Airlines Question 2 [5 Points]

from pyspark.sql import functions as F

# Get distinct count of UniqueCarrier

Airlines Question 3 [10 Points]

from pyspark.sql import functions as F

from pyspark.sql import types as T

airdf = airlineDf.withColumn("iDepDelay", F.col("DepDelay").cast(T.IntegerType()))

delayedAirline = airdf.filter(F.col("iDepDelay") >

I have followed below steps to derive the result:

Airlines Question 4 [15 Points]

# Please provide your code answer for Question 4

arrDelaydf = airlineDf.withColumn("iArrDelay", F.col("ArrDelay").cast(T.IntegerType()))

from pyspark.sql.functions import udf

elif a >= 0 and a < 15:

groupedArr = avgArrival.withColumn("groupedAvgArrival", groupedAvg_udf("AvgArrDelay"))

 Convert the string datatype field ArrDelay into Integer

Airlines Question 5 [15 Points]

What if you frequently read entire rows of the dataset?

I have written based on my experience but usually I refer Spark

Airlines Question 6 [5 Points]

Delta lake format provides versioning options either by version or by

This Data Engineering section is scored out of 50 points.

Using Baby Names Data Set

from hashlib import blake2b

display_name = "user_" + h.hexdigest()

print("Display Name: " + display_name)

dbutils.fs.cp('file:/tmp/rows.json', userhome + '/rows.json')

print("Baby Names Path: " + baby_names_path)

# Ensure you use baby_names_path to answer the questions. A bug in

# Please note that dbfs:/tmp is cleaned up daily at 6AM pacific

Baby Names Question 1 - Nested Data [15 Points]

1. Read in the JSON data

# Please provide your code answer for Question 1 here

# Loading JSON file into Dataframe

from pyspark.sql.functions import explode

display(sqlContext.sql("SELECT * FROM babyNames LIMIT 10"))

Baby Names Question 2 - Multiple Languages [10 Points]

sumBabyNames = spark.sql("SELECT year, first_name, SUM(count) AS

from pyspark.sql.window import *

from pyspark.sql.functions import *

Please provide your brief, written description of your code here.

Baby Names Question 3 - Performance [10 Points]

Are there any performance considerations when using different data

@F.udf('array<struct<id:string, age:string, sex:string>>')

return list(map(lambda x: x.attrib, root.findall('visitor')))

display(spark.sql("select count(*) from babyVisitors"))

bVisitors = spark.sql("select * from babyVisitors")

# When ignoring the inconsistent capitalized field county

mostComAge = spark.sql("SELECT county, age, count(*) AS comAge FROM babyVisitors WHERE

GROUP BY county, age ORDER BY comAge DESC LIMIT 1")

mostComAge = spark.sql("SELECT UPPER(county), age, count(*) AS comAge FROM babyVisitors

GROUP BY UPPER(county), age ORDER BY comAge DESC LIMIT 1")

You might also like