0% found this document useful (0 votes)
37 views3 pages

Spark Test Que

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views3 pages

Spark Test Que

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Pyspark test Questions

1. data = [
("Alice", "Engineering", 100000, 5, "2019-01-15"),
("Bob", "Engineering", 95000, 4, "2020-03-22"),
("Charlie", "HR", 70000, 2, "2018-07-30"),
("David", "HR", 60000, 3, "2019-10-10"),
("Eve", "Marketing", 85000, 4, "2021-05-15"),
("Frank", "Marketing", 80000, 3, "2017-12-01"),
("Grace", "Finance", 90000, 5, "2016-04-25"),
("Heidi", "Finance", 75000, 3, "2018-02-20"),
("Ivan", "Engineering", 95000, 4, "2020-12-18"),
("Judy", "Engineering", 92000, 2, "2017-09-11")
]
columns = ["Name", "Department", "Salary", "Experience", "JoiningDate"]
 List all employees who have a salary higher than the average salary of their
department.
 Identify the most recent joiner in each department.
 Find the median salary of employees in the company.
 List the names of employees along with their salary who have the same salary as
another employee.
 Find the employee with the second highest salary.

2.

 sales_data = [
("2024-01-01", 1, 10, 100),
("2024-01-01", 2, 5, 200),
("2024-01-02", 1, 8, 100),
("2024-01-02", 2, 7, 200),
]
sales_columns = ["date", "product_id", "quantity", "price"]
sales_df = spark.createDataFrame(sales_data, schema=sales_columns)
 customer_data = [
(1, "diksha", 35),
(2, "vansh", 25),
(3, "adhyan", 45)
]
customer_columns = ["customer_id", "name", "age"]
customers_df = spark.createDataFrame(customer_data, schema=customer_columns)
 order_data = [
(101, 1, 1, 10),
(102, 1, 2, 5),
(103, 2, 1, 8),
(104, 3, 2, 7),
]
order_columns = ["order_id", "customer_id", "product_id", "quantity"]
orders_df = spark.createDataFrame(order_data, schema=order_columns)

 Write a PySpark script to filter sales_df to only include sales from "2024-01-01"
 Write a PySpark script to join customers_df with orders_df on customer_id and
include customer names in the result
 Write a PySpark script to group sales_df by product_id and calculate the total
quantity sold for each product.
 Write a PySpark script using window functions to calculate the running total of the
quantity sold for each product_id over time in sales_df
 rite a PySpark script to identify days in sales_df where the total revenue was
greater than 1000.
 Write a PySpark script to pivot sales_df so that each row represents a date, each
column represents a product_id, and the cell values are the total quantity sold on
that date.
 write a PySpark script to calculate the total revenue generated by each customer
based on the joined orders_df and sales_df.

 Data For task 1,2,3,4,5


 data = [(1, 'vikas','20000'),(2,'mahavat','40000'),(3, 'jhon','25000'),(4,
'rahul','30000'),(5, 'vinod','33000'), (6,'junai','52000'), (7,'arjun','18000'),
(8,'rakesh','70000'), (9,'mahima','35000'), (10,'gulshan','62000')]

columns = id, name, salary

3. Task 1
a. Show the table data vertically
b. Show the content of only 4 rows
c. Show staring 3 characters of each columns

4. Task 2
o Change the datatype of salary column
o Add a column increment having value 15% of the salary column

5. Task 3 - create a column putting values where salary smaller than 20000 be low , between
20000 to 50000 be mid , greater than 50000 be high by using when

6. Task 4- Filter the rows whose name start with ‘v’ , contains ‘a’ at the last second position and
contains ‘j’, ’e’, ’u’ using ‘where’ function

7. Task 5- create a column short_name by extracting the first 3 alphabets from the name column

8. Write schema for

data = [(1, ('vikas','yadav'),20000),(2,('mahavat','singh'),40000),\


(3, ('jhon','merchant'),25000),(4, ('rahul','verma'),30000),\
(5, ('vinod','devangan'),33000)]

columns = id, name, salary

9. Create a row having array of both number a and b

data =[(1,2),(3,4)]
schema = ['a','b']
df= spark.createDataFrame(data,schema)
10. d3=[(1,'Raghav',['Excel','azure']),(2,'Sohail',['python','AWS']),(3,'Raghav',['java',
'GCP'])]
schema2 =['id','name','skills']
o create a column having value for weather skills column contain java or not
o explode the skills column into two separate column primary_skill and
secondary_skill

11. data = [(1,


'vikas',{'hair':'black','eye':'brown'}),(2,'mahavat',{'hair':'brown','eye':'blue'}),(
3, 'jhon',{'hair':'tan','eye':'green'}),(4,
'rahul',{'hair':'grey','eye':'brown'}),(5, 'vinod',{'hair':'red','eye':'red'})]

columns = id, name, properties


o Write schema for the table
o Create as separate column extracting the hair values in it
o Create two separate column having the key and value in them respectively
o Create a separate column and extract only the keys
o Create a separate column and extract only the values

You might also like