PySpark Null & Comparison Functions Explained
This PySpark tutorial explains how to use essential functions for handling nulls, filtering data, and performing pattern matching in DataFrames using:
between()
isNull()
andisNotNull()
isin()
like()
,rlike()
, andilike()
1. Create a Sample DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("NullComparisonOps").getOrCreate()
data = [
(1, "Aamir", 50000),
(2, "Ali", None),
(3, "Bob", 45000),
(4, "Lisa", 60000),
(5, "Zara", None),
(6, "ALINA", 55000)
]
columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)
df.show()
2. Use between()
Function
Select employees whose salary is between 45000 and 60000:
df.filter(col("salary").between(45000, 60000)).show()
3. Use isNull()
and isNotNull()
Filter rows where salary is null:
df.filter(col("salary").isNull()).show()
Filter rows where salary is not null:
df.filter(col("salary").isNotNull()).show()
4. Use isin()
Function
Filter names that are in the list ["Aamir", "Lisa"]:
df.filter(col("name").isin("Aamir", "Lisa")).show()
5. Use like()
, rlike()
, and ilike()
Names that start with 'A':
df.filter(col("name").like("A%")).show()
Names matching regex (e.g., all names ending in 'a'):
df.filter(col("name").rlike(".*a$")).show()
Case-insensitive LIKE (if using Spark 3.3+):
df.filter(col("name").ilike("ali%")).show()
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.