PySpark String Functions Explained
In this tutorial, you'll learn how to use PySpark string functions like contains(), startswith(), substr(), and endswith(). These functions are very helpful for filtering, searching, and extracting string data in PySpark DataFrames.
🔹 Sample Data
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("StringFunctionsExample").getOrCreate()
data = [
(1, "Aamir"),
(2, "Ali"),
(3, "Bob"),
(4, "Lisa"),
(5, "Zara"),
(6, "ALINA"),
(7, "amrita"),
(8, "Sana")
]
columns = ["id", "name"]
df = spark.createDataFrame(data, columns)
df.show()
🔹 contains() Function
Filter rows where name contains "a":
df.filter(col("name").contains("a")).show()
🔹 startswith() Function
Filter rows where name starts with "A":
df.filter(col("name").startswith("A")).show()
🔹 endswith() Function
Filter rows where name ends with "a":
df.filter(col("name").endswith("a")).show()
🔹 substr() Function
Extract first two characters from name:
df.withColumn("first_two", col("name").substr(1, 2)).show()



No comments:
Post a Comment
Note: Only a member of this blog may post a comment.