0% found this document useful (0 votes)

3 views7 pages

Pyspark Dataframe Interview Questions

The document provides 10 essential interview questions and solutions related to Spark DataFrame operations, covering topics such as DataFrame creation, filtering, grouping, joining, and handling missing data. It also explains the differences between PySpark Client Mode and Cluster Mode, detailing their characteristics, suitable use cases, and code implementation examples. Understanding these concepts is crucial for optimizing Spark job performance and resource utilization.

Uploaded by

ajit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views7 pages

Pyspark Dataframe Interview Questions

Uploaded by

ajit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

10 Pyspark Dataframe interview questions with solution

Here are 10 critical interview questions on Spark DataFrame operations along with their solutions:

Question 1: How do you create a DataFrame in Spark from a collection of data? #

Solution:
from [Link] import SparkSession

# Initialize Spark session

spark = [Link]("CreateDataFrame").getOrCreate()

# Sample data
data = [("John", 25), ("Doe", 30), ("Jane", 28)]

# Create DataFrame
columns = ["name", "age"]
df = [Link](data, columns)

# Show DataFrame
[Link]()

# Stop Spark session

[Link]()

Question 2: How do you select specific columns from a DataFrame? #

Solution:
# Select specific columns
selected_df = [Link]("name", "age")

# Show DataFrame
selected_df.show()

Question 3: How do you filter rows in a DataFrame based on a condition? #

Solution:
# Filter rows where age is greater than 25
filtered_df = [Link](df["age"] > 25)

# Show DataFrame
filtered_df.show()

Question 4: How do you group by a column and perform an aggregation in Spark

DataFrame? #
Solution:
# Sample data
data = [("John", "HR", 3000), ("Doe", "HR", 4000),("Jane",
"IT", 5000), ("Mary", "IT", 6000)]

# Create DataFrame
columns = ["name", "department", "salary"]
df = [Link](data, columns)

# Group by department and calculate average salary

avg_salary_df = [Link]("department").avg("salary")
# Show the result
avg_salary_df.show()

Question 5: How do you join two DataFrames in Spark? #

Solution:
# Sample data
data1 = [("John", 1), ("Doe", 2), ("Jane", 3)]
data2 = [(1, "HR"), (2, "IT"), (3, "Finance")]

# Create DataFrames
columns1 = ["name", "dept_id"]
columns2 = ["dept_id", "department"]

df1 = [Link](data1, columns1)

df2 = [Link](data2, columns2)

# Join DataFrames on dept_id

joined_df = [Link](df2, "dept_id")

# Show the result

joined_df.show()

Question 6: How do you handle missing data in Spark DataFrame? #

Solution:
# Sample data
data = [("John", None), ("Doe", 25), ("Jane", None), ("Mary", 30)]

# Create DataFrame
columns = ["name", "age"]
df = [Link](data, columns)

# Fill missing values with a default value

df_filled = [Link]({'age': 0})

# Show the result

df_filled.show()

Question 7: How do you apply a custom function to a DataFrame column using UDF? #
Solution:
from [Link]
import udf from [Link]
import StringType

# Define UDF to convert department to uppercase

def convert_uppercase(department):
return [Link]()

# Register UDF
convert_uppercase_udf = udf(convert_uppercase, StringType())

# Apply UDF to DataFrame

df_transformed = [Link]("department_upper", convert_uppercase_udf(df["department"]))

# Show the result

df_transformed.show()
Question 8: How do you sort a DataFrame by a specific column? #
Solution:
# Sort DataFrame by age
sorted_df = [Link]("age")

# Show the result

sorted_df.show()

Question 9: How do you add a new column to a DataFrame? #

Solution:
# Add a new column with a constant value
df_with_new_column = [Link]("new_column", df["age"] * 2)

# Show the result

df_with_new_column.show()

Question 10: How do you remove duplicate rows from a DataFrame? #

Solution:
# Sample data with duplicates
data = [("John", 25), ("Doe", 30), ("Jane", 28), ("John", 25)]

# Create DataFrame
columns = ["name", "age"]
df = [Link](data, columns)

# Remove duplicate rows

df_deduplicated = [Link]()

# Show the result

df_deduplicated.show()

These questions and solutions cover fundamental and advanced operations with Spark DataFrames, which are
essential for data processing and analysis using Spark.
PySpark Client Mode and Cluster Mode #
Apache Spark can run in multiple deployment modes, including client and cluster modes, which determine
where the Spark driver program runs and how tasks are scheduled across the cluster. Understanding the
differences between these modes is essential for optimizing Spark job performance and resource utilization.

1. PySpark Client Mode #

Client mode is a deployment mode where the Spark driver runs on the machine where the spark-submit
command is executed. The driver program communicates with the cluster’s executors to schedule and
execute tasks.

Key Characteristics of Client Mode: #

Driver Location: Runs on the machine where the user launches the application.
Best for Interactive Use: Ideal for development, debugging, and interactive sessions like using
notebooks (e.g., Jupyter) where you want immediate feedback.
Network Dependency: The driver needs to maintain a constant connection with the executors. If the
network connection between the client machine and the cluster is unstable, the job can fail.
Resource Utilization: The client machine’s resources (CPU, memory) are used for the driver, so a
powerful client machine is beneficial.

Code Implementation for Client Mode: #

To run a PySpark application in client mode, you would use the spark-submit command with --deploy-mode
client. Here’s an example:

spark-submit \
--master yarn \
--deploy-mode client \
--num-executors 3 \
--executor-cores 2 \
--executor-memory 4G \
--driver-memory 2G \
my_pyspark_script.py

Explanation:

--master yarn: Specifies YARN as the cluster manager.

--deploy-mode client: Runs the driver on the client machine where the command is executed.
--num-executors, --executor-cores, --executor-memory: Configures the number of executors, CPU
cores per executor, and memory allocation per executor.
--driver-memory: Allocates memory for the driver program on the client machine.
my_pyspark_script.py: The PySpark script that contains your Spark application code.

PySpark Script Example:

from [Link] import SparkSession

# Initialize SparkSession
spark = [Link] \
.appName("ClientModeExample") \
.getOrCreate()

# Sample DataFrame creation

data = [("John", 30), ("Doe", 25), ("Alice", 29)]
columns = ["Name", "Age"]
df = [Link](data, columns)

# Perform operations
[Link]()
[Link]("Age").count().show()

# Stop SparkSession
[Link]()

2. PySpark Cluster Mode #

Cluster mode is a deployment mode where the Spark driver runs inside the cluster, typically on one of the
worker nodes, and not on the client machine. This mode is more suitable for production jobs that require high
availability and reliability.

Key Characteristics of Cluster Mode: #

Driver Location: Runs on one of the cluster’s worker nodes.

Best for Production: Suitable for production environments where long-running jobs need stability and
don’t require interactive sessions.
Less Network Dependency: Since the driver is located within the cluster, it has more stable
connections with executors, reducing the risk of job failures due to network issues.
Resource Management: Utilizes cluster resources for the driver, freeing up client resources and often
providing more powerful hardware for the driver process.

Code Implementation for Cluster Mode: #

To run a PySpark application in cluster mode, you use spark-submit with --deploy-mode cluster. Here’s an
example:
spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 5 \
--executor-cores 4 \
--executor-memory 8G \
--driver-memory 4G \
--conf [Link]=false \
my_pyspark_script.py

Explanation:

--master yarn: Specifies YARN as the cluster manager.

--deploy-mode cluster: Runs the driver on a worker node within the cluster.
--num-executors, --executor-cores, --executor-memory: Configures the number of executors, CPU
cores per executor, and memory allocation per executor.
--driver-memory: Allocates memory for the driver program within the cluster.
--conf [Link]=false: Submits the application and returns immediately
without waiting for job completion. This is useful for running jobs asynchronously in a production
environment.
my_pyspark_script.py: The PySpark script that contains your Spark application code.

PySpark Script Example:

from [Link] import SparkSession

# Initialize SparkSession
spark = [Link] \
.appName("ClusterModeExample") \
.getOrCreate()

# Load data from HDFS

df = [Link]("hdfs:///path/to/[Link]", header=True, inferSchema=True)
# Perform operations
result_df = [Link](df['age'] > 30).groupBy("city").count()

# Save the result back to HDFS

result_df.[Link]("hdfs:///path/to/[Link]")

# Stop SparkSession
[Link]()

Choosing Between Client Mode and Cluster Mode #

Use Client Mode:
For interactive analysis or debugging using notebooks.
When you need immediate feedback and are running jobs from your local machine.
For smaller workloads where the driver’s resource needs are minimal.
Use Cluster Mode:
For production jobs that require high reliability and scalability.
When running long-running batch jobs or when the driver needs significant resources.
When you want to avoid network instability affecting the driver’s connection to the executors.

Conclusion #

Understanding the differences between client mode and cluster mode in PySpark is crucial for effectively
managing resources and optimizing job performance. Client mode is great for development and debugging,
while cluster mode is ideal for production environments where stability and resource management are
critical. By leveraging these modes appropriately, you can ensure your Spark jobs run efficiently and reliably.

Pyspark Interview: Abhinav Singh
No ratings yet
Pyspark Interview: Abhinav Singh
275 pages
PySpark Interview Questions (EY, Deloitte, PWC, KPMG)
No ratings yet
PySpark Interview Questions (EY, Deloitte, PWC, KPMG)
10 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Freedium - cfd-PySpark Interview Questions
No ratings yet
Freedium - cfd-PySpark Interview Questions
17 pages
PySpark Interview Questions Big Data
No ratings yet
PySpark Interview Questions Big Data
8 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
7 pages
10 Pyspark Dataframe Interview Question With Solution - v1
No ratings yet
10 Pyspark Dataframe Interview Question With Solution - v1
3 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
Pyspark
No ratings yet
Pyspark
10 pages
Pyspark Interview Questions To Strengthen Your Preparation
No ratings yet
Pyspark Interview Questions To Strengthen Your Preparation
8 pages
PySpark Interview Questions
0% (1)
PySpark Interview Questions
3 pages
PySpark Cheatsheet
100% (1)
PySpark Cheatsheet
12 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
Day 11 Notes
No ratings yet
Day 11 Notes
3 pages
PYSPARK Interview Questions
100% (4)
PYSPARK Interview Questions
126 pages
Pyspark
No ratings yet
Pyspark
6 pages
PySpark Installation and Basics Guide
100% (1)
PySpark Installation and Basics Guide
131 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Py Spark
No ratings yet
Py Spark
9 pages
Pyspark Theory Questions
No ratings yet
Pyspark Theory Questions
5 pages
Spark Questions
No ratings yet
Spark Questions
7 pages
Py Spark
No ratings yet
Py Spark
177 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
Spark Zero 2 Hero
No ratings yet
Spark Zero 2 Hero
103 pages
Pyspark
No ratings yet
Pyspark
4 pages
PySpark Core Concepts & Interview Prep
No ratings yet
PySpark Core Concepts & Interview Prep
8 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Pyspark Basics
No ratings yet
Pyspark Basics
74 pages
Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions
No ratings yet
Data Engineers Cheat Sheet - 21 Must-Know PySpark Questions
16 pages
Pyspark Interviewques 02
No ratings yet
Pyspark Interviewques 02
6 pages
Deloitte & EY Data Engineer Interview Questions
No ratings yet
Deloitte & EY Data Engineer Interview Questions
26 pages
Pyspark Interview Q & A in Topic Wise
No ratings yet
Pyspark Interview Q & A in Topic Wise
5 pages
Pyspark Dataframe Questions
No ratings yet
Pyspark Dataframe Questions
1 page
Pyspark - Notes 1
No ratings yet
Pyspark - Notes 1
3 pages
Spark Basic Info
No ratings yet
Spark Basic Info
11 pages
PySpark Real Time Q&A
No ratings yet
PySpark Real Time Q&A
5 pages
Pyspark 4
No ratings yet
Pyspark 4
5 pages
PySpark SparkSession Guide
No ratings yet
PySpark SparkSession Guide
63 pages
Pyspark Interviewques 01
No ratings yet
Pyspark Interviewques 01
7 pages
PySpark Tutorial: From Basics to Advanced
No ratings yet
PySpark Tutorial: From Basics to Advanced
102 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
PySpark Optimization Guide V2
No ratings yet
PySpark Optimization Guide V2
5 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Data Engineer
No ratings yet
Data Engineer
12 pages
Full PySpark Interview QA
No ratings yet
Full PySpark Interview QA
5 pages
Databricks PySpark Module1
No ratings yet
Databricks PySpark Module1
2 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
Page 01
No ratings yet
Page 01
2 pages
Apache Spark Components Guide
No ratings yet
Apache Spark Components Guide
6 pages
Gerated Spark Notes
No ratings yet
Gerated Spark Notes
34 pages
Data Engineering Part - 2
No ratings yet
Data Engineering Part - 2
21 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
2 pages
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Data Engineering 101 PySpark Vs Pandas 1721887961
No ratings yet
Data Engineering 101 PySpark Vs Pandas 1721887961
36 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Types of Tables
No ratings yet
Types of Tables
2 pages
Hive Hdfs Location Managed Tables
No ratings yet
Hive Hdfs Location Managed Tables
1 page
Hive Dynamic Partitions
No ratings yet
Hive Dynamic Partitions
2 pages
Hive Partitions
No ratings yet
Hive Partitions
1 page
Hive Hdfs Tables
No ratings yet
Hive Hdfs Tables
1 page
Top Pandas Functions
No ratings yet
Top Pandas Functions
19 pages
Mastering PySpark For SQL Professionals
No ratings yet
Mastering PySpark For SQL Professionals
3 pages
New Doc 2018-01-24 PDF
No ratings yet
New Doc 2018-01-24 PDF
1 page
New Doc 2018-01-24 PDF
No ratings yet
New Doc 2018-01-24 PDF
1 page
Recruitment Notification: Office of The Commissioner of Customs (Preventive), Jamnagar "
No ratings yet
Recruitment Notification: Office of The Commissioner of Customs (Preventive), Jamnagar "
2 pages
COM 101 - Introduction To Computing - Comprehensive Revision Notes For ND1 Exams
No ratings yet
COM 101 - Introduction To Computing - Comprehensive Revision Notes For ND1 Exams
5 pages
Nameless Sprite Editor Guide
No ratings yet
Nameless Sprite Editor Guide
4 pages
Complete Interview Questions CheatSheet
No ratings yet
Complete Interview Questions CheatSheet
5 pages
Archive Center Administration Guide 22.4
No ratings yet
Archive Center Administration Guide 22.4
610 pages
ReleaseNotes WiFi 23.90
No ratings yet
ReleaseNotes WiFi 23.90
3 pages
AFA Horticultural Crops Directorate
No ratings yet
AFA Horticultural Crops Directorate
64 pages
Control Memory: Risc/Cisc
No ratings yet
Control Memory: Risc/Cisc
2 pages
Sample Data For Excel
No ratings yet
Sample Data For Excel
8 pages
Instructor Slides: Distribution Without The Prior Written Consent of Mcgraw-Hill Education
No ratings yet
Instructor Slides: Distribution Without The Prior Written Consent of Mcgraw-Hill Education
71 pages
Oracle - DBA1 Multiple Choice Questions
No ratings yet
Oracle - DBA1 Multiple Choice Questions
4 pages
Instruction Manual: Carplay Integration For Mercedes-Benz NTG 4.5/4.7
No ratings yet
Instruction Manual: Carplay Integration For Mercedes-Benz NTG 4.5/4.7
15 pages
Structured Questions On Computer Specifications, Software and Storage
No ratings yet
Structured Questions On Computer Specifications, Software and Storage
4 pages
Difference Between FC and FB in Siemens PLC
No ratings yet
Difference Between FC and FB in Siemens PLC
10 pages
Tax Officer Functions Overview Guide
No ratings yet
Tax Officer Functions Overview Guide
1 page
Cookie Policy and Tracking Notice - Hirevue
No ratings yet
Cookie Policy and Tracking Notice - Hirevue
5 pages
BUELLTOOTH ECM Interface
100% (1)
BUELLTOOTH ECM Interface
40 pages
Command Prompt Guide & Usage Tips
No ratings yet
Command Prompt Guide & Usage Tips
7 pages
JAVA UNIT-1 Notes
No ratings yet
JAVA UNIT-1 Notes
94 pages
(Ebook PDF) Fundamentals of Graphics Communication 7th Edition by Gary Bertoline PDF Download
100% (10)
(Ebook PDF) Fundamentals of Graphics Communication 7th Edition by Gary Bertoline PDF Download
58 pages
CorelDRAW Glass Button Tutorial
No ratings yet
CorelDRAW Glass Button Tutorial
9 pages
Flutter SDK Installation Guide
No ratings yet
Flutter SDK Installation Guide
27 pages
Student Orientation ExPrep (Empty)
No ratings yet
Student Orientation ExPrep (Empty)
8 pages
Lecture04 Scanning Vulnerability
No ratings yet
Lecture04 Scanning Vulnerability
58 pages
Object Diagram
No ratings yet
Object Diagram
1 page
Logcat 1644293699914
No ratings yet
Logcat 1644293699914
7 pages
Code PDF
No ratings yet
Code PDF
7 pages
UNIT IV Virtual Environment BCME 802
No ratings yet
UNIT IV Virtual Environment BCME 802
12 pages
Imp CPP Ques
No ratings yet
Imp CPP Ques
2 pages
Session 2 TIB - SSODL
No ratings yet
Session 2 TIB - SSODL
9 pages
Database Learning for Students
No ratings yet
Database Learning for Students
161 pages

Pyspark Dataframe Interview Questions

Uploaded by

Pyspark Dataframe Interview Questions

Uploaded by

10 Pyspark Dataframe interview questions with solution

Question 1: How do you create a DataFrame in Spark from a collection of data? #

# Initialize Spark session

# Stop Spark session

Question 2: How do you select specific columns from a DataFrame? #

Question 3: How do you filter rows in a DataFrame based on a condition? #

Question 4: How do you group by a column and perform an aggregation in Spark

# Group by department and calculate average salary

Question 5: How do you join two DataFrames in Spark? #

df1 = [Link](data1, columns1)

# Join DataFrames on dept_id

# Show the result

Question 6: How do you handle missing data in Spark DataFrame? #

# Fill missing values with a default value

# Show the result

# Define UDF to convert department to uppercase

# Apply UDF to DataFrame

# Show the result

# Show the result

Question 9: How do you add a new column to a DataFrame? #

# Show the result

Question 10: How do you remove duplicate rows from a DataFrame? #

# Remove duplicate rows

# Show the result

1. PySpark Client Mode #

Key Characteristics of Client Mode: #

Code Implementation for Client Mode: #

--master yarn: Specifies YARN as the cluster manager.

PySpark Script Example:

# Sample DataFrame creation

2. PySpark Cluster Mode #

Key Characteristics of Cluster Mode: #

Driver Location: Runs on one of the cluster’s worker nodes.

Code Implementation for Cluster Mode: #

--master yarn: Specifies YARN as the cluster manager.

PySpark Script Example:

# Load data from HDFS

# Save the result back to HDFS

Choosing Between Client Mode and Cluster Mode #

You might also like