10 Pyspark Dataframe interview questions with solution
Here are 10 critical interview questions on Spark DataFrame operations along with their solutions:
Question 1: How do you create a DataFrame in Spark from a collection of data? #
Solution:
from [Link] import SparkSession
# Initialize Spark session
spark = [Link]("CreateDataFrame").getOrCreate()
# Sample data
data = [("John", 25), ("Doe", 30), ("Jane", 28)]
# Create DataFrame
columns = ["name", "age"]
df = [Link](data, columns)
# Show DataFrame
[Link]()
# Stop Spark session
[Link]()
Question 2: How do you select specific columns from a DataFrame? #
Solution:
# Select specific columns
selected_df = [Link]("name", "age")
# Show DataFrame
selected_df.show()
Question 3: How do you filter rows in a DataFrame based on a condition? #
Solution:
# Filter rows where age is greater than 25
filtered_df = [Link](df["age"] > 25)
# Show DataFrame
filtered_df.show()
Question 4: How do you group by a column and perform an aggregation in Spark
DataFrame? #
Solution:
# Sample data
data = [("John", "HR", 3000), ("Doe", "HR", 4000),("Jane",
"IT", 5000), ("Mary", "IT", 6000)]
# Create DataFrame
columns = ["name", "department", "salary"]
df = [Link](data, columns)
# Group by department and calculate average salary
avg_salary_df = [Link]("department").avg("salary")
# Show the result
avg_salary_df.show()
Question 5: How do you join two DataFrames in Spark? #
Solution:
# Sample data
data1 = [("John", 1), ("Doe", 2), ("Jane", 3)]
data2 = [(1, "HR"), (2, "IT"), (3, "Finance")]
# Create DataFrames
columns1 = ["name", "dept_id"]
columns2 = ["dept_id", "department"]
df1 = [Link](data1, columns1)
df2 = [Link](data2, columns2)
# Join DataFrames on dept_id
joined_df = [Link](df2, "dept_id")
# Show the result
joined_df.show()
Question 6: How do you handle missing data in Spark DataFrame? #
Solution:
# Sample data
data = [("John", None), ("Doe", 25), ("Jane", None), ("Mary", 30)]
# Create DataFrame
columns = ["name", "age"]
df = [Link](data, columns)
# Fill missing values with a default value
df_filled = [Link]({'age': 0})
# Show the result
df_filled.show()
Question 7: How do you apply a custom function to a DataFrame column using UDF? #
Solution:
from [Link]
import udf from [Link]
import StringType
# Define UDF to convert department to uppercase
def convert_uppercase(department):
return [Link]()
# Register UDF
convert_uppercase_udf = udf(convert_uppercase, StringType())
# Apply UDF to DataFrame
df_transformed = [Link]("department_upper", convert_uppercase_udf(df["department"]))
# Show the result
df_transformed.show()
Question 8: How do you sort a DataFrame by a specific column? #
Solution:
# Sort DataFrame by age
sorted_df = [Link]("age")
# Show the result
sorted_df.show()
Question 9: How do you add a new column to a DataFrame? #
Solution:
# Add a new column with a constant value
df_with_new_column = [Link]("new_column", df["age"] * 2)
# Show the result
df_with_new_column.show()
Question 10: How do you remove duplicate rows from a DataFrame? #
Solution:
# Sample data with duplicates
data = [("John", 25), ("Doe", 30), ("Jane", 28), ("John", 25)]
# Create DataFrame
columns = ["name", "age"]
df = [Link](data, columns)
# Remove duplicate rows
df_deduplicated = [Link]()
# Show the result
df_deduplicated.show()
These questions and solutions cover fundamental and advanced operations with Spark DataFrames, which are
essential for data processing and analysis using Spark.
PySpark Client Mode and Cluster Mode #
Apache Spark can run in multiple deployment modes, including client and cluster modes, which determine
where the Spark driver program runs and how tasks are scheduled across the cluster. Understanding the
differences between these modes is essential for optimizing Spark job performance and resource utilization.
1. PySpark Client Mode #
Client mode is a deployment mode where the Spark driver runs on the machine where the spark-submit
command is executed. The driver program communicates with the cluster’s executors to schedule and
execute tasks.
Key Characteristics of Client Mode: #
Driver Location: Runs on the machine where the user launches the application.
Best for Interactive Use: Ideal for development, debugging, and interactive sessions like using
notebooks (e.g., Jupyter) where you want immediate feedback.
Network Dependency: The driver needs to maintain a constant connection with the executors. If the
network connection between the client machine and the cluster is unstable, the job can fail.
Resource Utilization: The client machine’s resources (CPU, memory) are used for the driver, so a
powerful client machine is beneficial.
Code Implementation for Client Mode: #
To run a PySpark application in client mode, you would use the spark-submit command with --deploy-mode
client. Here’s an example:
spark-submit \
--master yarn \
--deploy-mode client \
--num-executors 3 \
--executor-cores 2 \
--executor-memory 4G \
--driver-memory 2G \
my_pyspark_script.py
Explanation:
--master yarn: Specifies YARN as the cluster manager.
--deploy-mode client: Runs the driver on the client machine where the command is executed.
--num-executors, --executor-cores, --executor-memory: Configures the number of executors, CPU
cores per executor, and memory allocation per executor.
--driver-memory: Allocates memory for the driver program on the client machine.
my_pyspark_script.py: The PySpark script that contains your Spark application code.
PySpark Script Example:
from [Link] import SparkSession
# Initialize SparkSession
spark = [Link] \
.appName("ClientModeExample") \
.getOrCreate()
# Sample DataFrame creation
data = [("John", 30), ("Doe", 25), ("Alice", 29)]
columns = ["Name", "Age"]
df = [Link](data, columns)
# Perform operations
[Link]()
[Link]("Age").count().show()
# Stop SparkSession
[Link]()
2. PySpark Cluster Mode #
Cluster mode is a deployment mode where the Spark driver runs inside the cluster, typically on one of the
worker nodes, and not on the client machine. This mode is more suitable for production jobs that require high
availability and reliability.
Key Characteristics of Cluster Mode: #
Driver Location: Runs on one of the cluster’s worker nodes.
Best for Production: Suitable for production environments where long-running jobs need stability and
don’t require interactive sessions.
Less Network Dependency: Since the driver is located within the cluster, it has more stable
connections with executors, reducing the risk of job failures due to network issues.
Resource Management: Utilizes cluster resources for the driver, freeing up client resources and often
providing more powerful hardware for the driver process.
Code Implementation for Cluster Mode: #
To run a PySpark application in cluster mode, you use spark-submit with --deploy-mode cluster. Here’s an
example:
spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 5 \
--executor-cores 4 \
--executor-memory 8G \
--driver-memory 4G \
--conf [Link]=false \
my_pyspark_script.py
Explanation:
--master yarn: Specifies YARN as the cluster manager.
--deploy-mode cluster: Runs the driver on a worker node within the cluster.
--num-executors, --executor-cores, --executor-memory: Configures the number of executors, CPU
cores per executor, and memory allocation per executor.
--driver-memory: Allocates memory for the driver program within the cluster.
--conf [Link]=false: Submits the application and returns immediately
without waiting for job completion. This is useful for running jobs asynchronously in a production
environment.
my_pyspark_script.py: The PySpark script that contains your Spark application code.
PySpark Script Example:
from [Link] import SparkSession
# Initialize SparkSession
spark = [Link] \
.appName("ClusterModeExample") \
.getOrCreate()
# Load data from HDFS
df = [Link]("hdfs:///path/to/[Link]", header=True, inferSchema=True)
# Perform operations
result_df = [Link](df['age'] > 30).groupBy("city").count()
# Save the result back to HDFS
result_df.[Link]("hdfs:///path/to/[Link]")
# Stop SparkSession
[Link]()
Choosing Between Client Mode and Cluster Mode #
Use Client Mode:
For interactive analysis or debugging using notebooks.
When you need immediate feedback and are running jobs from your local machine.
For smaller workloads where the driver’s resource needs are minimal.
Use Cluster Mode:
For production jobs that require high reliability and scalability.
When running long-running batch jobs or when the driver needs significant resources.
When you want to avoid network instability affecting the driver’s connection to the executors.
Conclusion #
Understanding the differences between client mode and cluster mode in PySpark is crucial for effectively
managing resources and optimizing job performance. Client mode is great for development and debugging,
while cluster mode is ideal for production environments where stability and resource management are
critical. By leveraging these modes appropriately, you can ensure your Spark jobs run efficiently and reliably.