Spark Test Que

Uploaded by

Shree Prakash Rai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views3 pages

Spark Test Que

Uploaded by

Shree Prakash Rai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Pyspark test Questions

1. data = [
("Alice", "Engineering", 100000, 5, "2019-01-15"),
("Bob", "Engineering", 95000, 4, "2020-03-22"),
("Charlie", "HR", 70000, 2, "2018-07-30"),
("David", "HR", 60000, 3, "2019-10-10"),
("Eve", "Marketing", 85000, 4, "2021-05-15"),
("Frank", "Marketing", 80000, 3, "2017-12-01"),
("Grace", "Finance", 90000, 5, "2016-04-25"),
("Heidi", "Finance", 75000, 3, "2018-02-20"),
("Ivan", "Engineering", 95000, 4, "2020-12-18"),
("Judy", "Engineering", 92000, 2, "2017-09-11")
]
columns = ["Name", "Department", "Salary", "Experience", "JoiningDate"]
 List all employees who have a salary higher than the average salary of their
department.
 Identify the most recent joiner in each department.
 Find the median salary of employees in the company.
 List the names of employees along with their salary who have the same salary as
another employee.
 Find the employee with the second highest salary.

 sales_data = [
("2024-01-01", 1, 10, 100),
("2024-01-01", 2, 5, 200),
("2024-01-02", 1, 8, 100),
("2024-01-02", 2, 7, 200),
]
sales_columns = ["date", "product_id", "quantity", "price"]
sales_df = spark.createDataFrame(sales_data, schema=sales_columns)
 customer_data = [
(1, "diksha", 35),
(2, "vansh", 25),
(3, "adhyan", 45)
]
customer_columns = ["customer_id", "name", "age"]
customers_df = spark.createDataFrame(customer_data, schema=customer_columns)
 order_data = [
(101, 1, 1, 10),
(102, 1, 2, 5),
(103, 2, 1, 8),
(104, 3, 2, 7),
]
order_columns = ["order_id", "customer_id", "product_id", "quantity"]
orders_df = spark.createDataFrame(order_data, schema=order_columns)

 Write a PySpark script to filter sales_df to only include sales from "2024-01-01"
 Write a PySpark script to join customers_df with orders_df on customer_id and
include customer names in the result
 Write a PySpark script to group sales_df by product_id and calculate the total
quantity sold for each product.
 Write a PySpark script using window functions to calculate the running total of the
quantity sold for each product_id over time in sales_df
 rite a PySpark script to identify days in sales_df where the total revenue was
greater than 1000.
 Write a PySpark script to pivot sales_df so that each row represents a date, each
column represents a product_id, and the cell values are the total quantity sold on
that date.
 write a PySpark script to calculate the total revenue generated by each customer
based on the joined orders_df and sales_df.

 Data For task 1,2,3,4,5

 data = [(1, 'vikas','20000'),(2,'mahavat','40000'),(3, 'jhon','25000'),(4,
'rahul','30000'),(5, 'vinod','33000'), (6,'junai','52000'), (7,'arjun','18000'),
(8,'rakesh','70000'), (9,'mahima','35000'), (10,'gulshan','62000')]

columns = id, name, salary

3. Task 1
a. Show the table data vertically
b. Show the content of only 4 rows
c. Show staring 3 characters of each columns

4. Task 2
o Change the datatype of salary column
o Add a column increment having value 15% of the salary column

5. Task 3 - create a column putting values where salary smaller than 20000 be low , between
20000 to 50000 be mid , greater than 50000 be high by using when

6. Task 4- Filter the rows whose name start with ‘v’ , contains ‘a’ at the last second position and
contains ‘j’, ’e’, ’u’ using ‘where’ function

7. Task 5- create a column short_name by extracting the first 3 alphabets from the name column

8. Write schema for

data = [(1, ('vikas','yadav'),20000),(2,('mahavat','singh'),40000),\

(3, ('jhon','merchant'),25000),(4, ('rahul','verma'),30000),\
(5, ('vinod','devangan'),33000)]

columns = id, name, salary

9. Create a row having array of both number a and b

data =[(1,2),(3,4)]
schema = ['a','b']
df= spark.createDataFrame(data,schema)
10. d3=[(1,'Raghav',['Excel','azure']),(2,'Sohail',['python','AWS']),(3,'Raghav',['java',
'GCP'])]
schema2 =['id','name','skills']
o create a column having value for weather skills column contain java or not
o explode the skills column into two separate column primary_skill and
secondary_skill

11. data = [(1,

'vikas',{'hair':'black','eye':'brown'}),(2,'mahavat',{'hair':'brown','eye':'blue'}),(
3, 'jhon',{'hair':'tan','eye':'green'}),(4,
'rahul',{'hair':'grey','eye':'brown'}),(5, 'vinod',{'hair':'red','eye':'red'})]

columns = id, name, properties

o Write schema for the table
o Create as separate column extracting the hair values in it
o Create two separate column having the key and value in them respectively
o Create a separate column and extract only the keys
o Create a separate column and extract only the values

Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
30 Pyspark Coding Questions
No ratings yet
30 Pyspark Coding Questions
9 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
Journal
No ratings yet
Journal
47 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
PySpark Cheatsheet - Elaborate
No ratings yet
PySpark Cheatsheet - Elaborate
14 pages
Solutions 1742312993
No ratings yet
Solutions 1742312993
14 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Data Wrangling and EDA with PySpark
No ratings yet
Data Wrangling and EDA with PySpark
10 pages
Spark 1
No ratings yet
Spark 1
21 pages
PySpark Interview Questions
0% (1)
PySpark Interview Questions
3 pages
Pyspark Questions
No ratings yet
Pyspark Questions
2 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Quantiphi Interview
No ratings yet
Quantiphi Interview
2 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
PySpark All Query
No ratings yet
PySpark All Query
22 pages
Python - Pandas - Numpy Interview Q&A
No ratings yet
Python - Pandas - Numpy Interview Q&A
12 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Py Spark
No ratings yet
Py Spark
7 pages
PySpark DataFrame Merging Guide
No ratings yet
PySpark DataFrame Merging Guide
42 pages
How To Work With Apache Spark and Delta Lake?
No ratings yet
How To Work With Apache Spark and Delta Lake?
40 pages
PySpark Cheatsheet
100% (1)
PySpark Cheatsheet
12 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
TCS Rejected Many Due To Weak PySpark Logic!?
No ratings yet
TCS Rejected Many Due To Weak PySpark Logic!?
7 pages
Practical File IP
No ratings yet
Practical File IP
27 pages
Pyspark Hands On
0% (1)
Pyspark Hands On
189 pages
Python Pyspark Q's
No ratings yet
Python Pyspark Q's
16 pages
Unit 4 Spark SQL
No ratings yet
Unit 4 Spark SQL
49 pages
Senior Data Engineer Qs
No ratings yet
Senior Data Engineer Qs
7 pages
Pyspark SQL and DataFrames
No ratings yet
Pyspark SQL and DataFrames
6 pages
Using Spark to Read CSV Data
No ratings yet
Using Spark to Read CSV Data
5 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Pyspark Tutorial 3
No ratings yet
Pyspark Tutorial 3
5 pages
IBM PySpark CheatSheet
No ratings yet
IBM PySpark CheatSheet
2 pages
Data Engineering Part - 2
No ratings yet
Data Engineering Part - 2
21 pages
Quewtion SQL - Pyspark
No ratings yet
Quewtion SQL - Pyspark
4 pages
Big Data Project - Questions
No ratings yet
Big Data Project - Questions
2 pages
SP 4
No ratings yet
SP 4
22 pages
Interview
No ratings yet
Interview
2 pages
Spark DataFrame Sales Analysis
No ratings yet
Spark DataFrame Sales Analysis
10 pages
DW Lab File
No ratings yet
DW Lab File
18 pages
SAP FICO Interview Q&A Guide
No ratings yet
SAP FICO Interview Q&A Guide
23 pages
Toolstalk 2
No ratings yet
Toolstalk 2
7 pages
ITA Notes Unit 1
No ratings yet
ITA Notes Unit 1
7 pages
VeloCloud Lab Hol 2053 01 Ism PDF en
No ratings yet
VeloCloud Lab Hol 2053 01 Ism PDF en
58 pages
Assignment Activity Unit 1: CS 1111-01 - AY2025-T2
No ratings yet
Assignment Activity Unit 1: CS 1111-01 - AY2025-T2
2 pages
HMTS 4101-2024
No ratings yet
HMTS 4101-2024
4 pages
Flysmart+ For Ipad Takeoff Performance App: Main Runway and Runway Entries Concept - Revk
No ratings yet
Flysmart+ For Ipad Takeoff Performance App: Main Runway and Runway Entries Concept - Revk
24 pages
How To Configure SAP Purchase Order Release Strategy
No ratings yet
How To Configure SAP Purchase Order Release Strategy
12 pages
How To Create Create Personal Sub Area in Sap Succ
No ratings yet
How To Create Create Personal Sub Area in Sap Succ
2 pages
UI Design Report Example
No ratings yet
UI Design Report Example
3 pages
KPIs for Portfolio Managers' Tracking
No ratings yet
KPIs for Portfolio Managers' Tracking
16 pages
Release Notes For Avaya Equinox Management R9 1 9 Rev4
No ratings yet
Release Notes For Avaya Equinox Management R9 1 9 Rev4
11 pages
21BCE9056 Lab2 SRS - Documentation
No ratings yet
21BCE9056 Lab2 SRS - Documentation
6 pages
Explore 21 IT Career Paths and Salaries
No ratings yet
Explore 21 IT Career Paths and Salaries
7 pages
Bima Indra Gunawan: Jl. Penganten Ali No.61 Ciracas Jakarta Timur, DKI Jakarta - 13740
No ratings yet
Bima Indra Gunawan: Jl. Penganten Ali No.61 Ciracas Jakarta Timur, DKI Jakarta - 13740
2 pages
Cloud Call Center Solutions
No ratings yet
Cloud Call Center Solutions
22 pages
All-in-One Society Management System
No ratings yet
All-in-One Society Management System
24 pages
Newman & Zhao (2008)
No ratings yet
Newman & Zhao (2008)
22 pages
4106 Slides After-Midterm
No ratings yet
4106 Slides After-Midterm
229 pages
Abdulmalik Alwahibi - LinkedIn
No ratings yet
Abdulmalik Alwahibi - LinkedIn
4 pages
23 Sep To 16 Oct
No ratings yet
23 Sep To 16 Oct
15 pages
Oose Lesson Plan
No ratings yet
Oose Lesson Plan
4 pages
Factura TODS100060
No ratings yet
Factura TODS100060
1 page
Low Code
No ratings yet
Low Code
28 pages
Stripe 2022 Update
No ratings yet
Stripe 2022 Update
10 pages
Project Coordination Expertise
No ratings yet
Project Coordination Expertise
2 pages
SAP PP Interview Questions Guide
No ratings yet
SAP PP Interview Questions Guide
4 pages
Understanding Salesforce Process Builder
No ratings yet
Understanding Salesforce Process Builder
26 pages
Data Analytics Life Cycle Quiz
No ratings yet
Data Analytics Life Cycle Quiz
2 pages
Unit 4 - Building and Deploying AI Applications
No ratings yet
Unit 4 - Building and Deploying AI Applications
30 pages

Spark Test Que

Uploaded by

Spark Test Que

Uploaded by

Pyspark test Questions

 Data For task 1,2,3,4,5

columns = id, name, salary

8. Write schema for

data = [(1, ('vikas','yadav'),20000),(2,('mahavat','singh'),40000),\

columns = id, name, salary

9. Create a row having array of both number a and b

11. data = [(1,

columns = id, name, properties

You might also like