How to Use na() and isEmpty() Functions in PySpark
Author: Aamir Shahzad
Published on: March 2025
Introduction
In this blog post, you’ll learn how to use na() and isEmpty() functions in PySpark to handle missing data and validate whether a DataFrame is empty. These functions are crucial for data preprocessing and validation in big data pipelines.
What is na() in PySpark?
The na() function returns an object of DataFrameNaFunctions, which is used to handle null values in a DataFrame. Common methods include:
fill()- Replace null values with a specified value.drop()- Remove rows containing null values.replace()- Replace specific values.
Example: Using na() Function
from pyspark.sql import SparkSession
# Initialize Spark Session
spark = SparkSession.builder.appName("PySpark_na_and_isEmpty").getOrCreate()
# Sample data with nulls
data = [
("Aamir Shahzad", "Engineering", 5000),
("Ali", None, 4000),
("Raza", "Marketing", None),
("Bob", "Sales", 4200),
("Lisa", None, None)
]
columns = ["Name", "Department", "Salary"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Show original DataFrame
df.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering| 5000|
| Ali| null| 4000|
| Raza| Marketing| null|
| Bob| Sales| 4200|
| Lisa| null| null|
+-------------+-----------+------+
Fill null values in Department and Salary columns
df_filled = df.na.fill({
"Department": "Not Assigned",
"Salary": 0
})
df_filled.show()
Expected Output
+-------------+-------------+------+
| Name| Department|Salary|
+-------------+-------------+------+
|Aamir Shahzad| Engineering| 5000|
| Ali| Not Assigned| 4000|
| Raza| Marketing| 0|
| Bob| Sales| 4200|
| Lisa| Not Assigned| 0|
+-------------+-------------+------+
Drop rows with any null values
df_dropped = df.na.drop()
df_dropped.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering| 5000|
| Bob| Sales| 4200|
+-------------+-----------+------+
Replace a specific value
df_replaced = df.na.replace("Sales", "Business Development")
df_replaced.show()
Expected Output
+-------------+----------------------+------+
| Name| Department |Salary|
+-------------+----------------------+------+
|Aamir Shahzad| Engineering| 5000|
| Ali| null| 4000|
| Raza| Marketing| null|
| Bob|Business Development | 4200|
| Lisa| null| null|
+-------------+----------------------+------+
What is isEmpty() in PySpark?
The isEmpty() function checks whether a DataFrame is empty (has no rows). This is helpful to validate results of filters, joins, or transformations.
Example: Using isEmpty() Function
# Filter rows with Salary greater than 10000
df_filtered = df.filter(df.Salary > 10000)
# Check if DataFrame is empty
if df_filtered.isEmpty():
print("The DataFrame is empty!")
else:
df_filtered.show()
Expected Output
The DataFrame is empty!
Explanation: There are no rows in the DataFrame where Salary > 10000, so isEmpty() returns True.
Watch the Video Tutorial
For a complete walkthrough of the na() and isEmpty() functions in PySpark, check out the video tutorial below:



No comments:
Post a Comment
Note: Only a member of this blog may post a comment.