Handling Null
Values & Missing
Data in PySpark
Karthik Kondpak Swipe for more
1. Checking for Null Values
Before handling null values, it's good practice to check
for their presence in a DataFrame.
Using isNull() and isNotNull()
from pyspark.sql.functions import col
df.filter(col("column_name").isNull()).show()
df.filter(col("column_name").isNotNull()).show()
Using na Functions
df.select([col(c).isNull().alias(c) for c in
df.columns]).show()
https://www.seekhobigdata.com/ Swipe for more
Counting Null Values
from pyspark.sql.functions import count, when
df.select([count(when(col(c).isNull(), c)).alias(c)
for c in df.columns]).show()
https://www.seekhobigdata.com/ Swipe for more
2. Dropping Null Values
PySpark provides the .dropna() function to remove
rows with null values.
df.na.drop()
Dropping Rows Based on Conditions
Drop rows if any column has null:
df.na.drop("any")
https://www.seekhobigdata.com/ Swipe for more
Drop rows only if all columns are null:
df.na.drop("all")
Drop rows based on a subset of columns:
df.na.drop(subset=["column1", "column2"])
https://www.seekhobigdata.com/ Swipe for more
3. Filling Null Values
Instead of dropping null values, you can replace
them using .fillna().
Filling with a Specific Value
df.na.fill("default_value").show()
df.na.fill(0).show()
Filling with Different Values for Different
Columns
df.na.fill({"column1": "unknown", "column2":
0}).show()
https://www.seekhobigdata.com/ Swipe for more
4. Replacing Null with
Mean/Median/Mode
To replace null values with the column mean,
median, or mode:
Filling with Mean
from pyspark.sql.functions import mean
mean_value =
df.select(mean(col("column_name"))).collect()[0]
[0]
df = df.na.fill(mean_value, subset=
["column_name"])
https://www.seekhobigdata.com/ Swipe for more
Filling with Median
from pyspark.sql.functions import expr
median_value =
df.approxQuantile("column_name", [0.5], 0.01)[0]
df = df.na.fill(median_value, subset=
["column_name"])
Filling with Mode
from pyspark.sql.functions import count, col
mode_value =
df.groupBy("column_name").count().orderBy(col(
"count").desc()).first()[0]
df = df.na.fill(mode_value, subset=
["column_name"])
https://www.seekhobigdata.com/ Swipe for more
5. Replacing Null Values with Forward Fill
& Backward Fill
PySpark doesn’t have built-in functions for forward fill
(ffill) or backward fill (bfill), but you can achieve it using
window functions.
Forward Fill (ffill)
from pyspark.sql.window import Window
from pyspark.sql.functions import last
window_spec =
Window.orderBy("some_column").rowsBetween(-
sys.maxsize, 0)
df = df.withColumn("column_name",
last("column_name", True).over(window_spec))
https://www.seekhobigdata.com/ Swipe for more
Backward Fill (bfill)
window_spec =
Window.orderBy("some_column").rowsBetween(
0, sys.maxsize)
df = df.withColumn("column_name",
last("column_name", False).over(window_spec))
https://www.seekhobigdata.com/ Swipe for more
6. Handling Null Values in Aggregations
By default, PySpark ignores null values in aggregations.
Count Including Nulls
df.agg(count("column_name")).show()
Count Excluding Nulls
df.agg(count(when(col("column_name").isNotNull
(), True))).show()
https://www.seekhobigdata.com/ Swipe for more
7. Handling Null Values in Joins
While performing joins, null values in keys can create
issues.
Using fillna() Before Joins
df1 = df1.na.fill("unknown", subset=
["join_column"])
df2 = df2.na.fill("unknown", subset=
["join_column"])
joined_df = df1.join(df2, on="join_column",
how="inner")
https://www.seekhobigdata.com/ Swipe for more
Using coalesce() to Handle Null Joins
from pyspark.sql.functions import coalesce
df1 = df1.withColumn("join_column",
coalesce(col("join_column"), lit("default_value")))
df2 = df2.withColumn("join_column",
coalesce(col("join_column"), lit("default_value")))
joined_df = df1.join(df2, "join_column", "inner")
https://www.seekhobigdata.com/ Swipe for more
8. Handling Nulls in Complex Data
Types
Handling Null in Arrays
from pyspark.sql.functions import array_remove
df = df.withColumn("array_column",
array_remove("array_column", None))
Handling Null in Structs
df = df.withColumn("struct_column",
col("struct_column").alias("new_struct")).drop("st
ruct_column")
https://www.seekhobigdata.com/ Swipe for more
If you
find this
helpful, like
and share it
with your
friends
https://www.seekhobigdata.com/