Extracting Substrings in PySpark
In this tutorial, you'll learn how to use PySpark string functions like substr(), substring(), overlay(), left(), and right() to manipulate string columns in DataFrames.
1️⃣ Initialize Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SubstringTutorial").getOrCreate()
2️⃣ Sample DataFrame
data = [("EMP-0001-NC", "maria.fernandez@healthcare.org"),
("EMP-0002-TX", "john.davis@fintech.com"),
("EMP-0003-CA", "lucy.liu@retailmart.net"),
("EMP-0004-FL", "peter.parker@dailybugle.com"),
("EMP-0005-NY", "bruce.wayne@gotham.org")]
columns = ["employee_code", "email"]
df = spark.createDataFrame(data, columns)
df.show(truncate=False)
Output:
+-------------+-----------------------------+
|employee_code|email |
+-------------+-----------------------------+
|EMP-0001-NC |maria.fernandez@healthcare.org|
|EMP-0002-TX |john.davis@fintech.com |
|EMP-0003-CA |lucy.liu@retailmart.net |
|EMP-0004-FL |peter.parker@dailybugle.com |
|EMP-0005-NY |bruce.wayne@gotham.org |
+-------------+-----------------------------+
3️⃣ substr() Function
Extract substring starting at a specific position with defined length.
from pyspark.sql.functions import col, substr
df_substr = df.withColumn("emp_id", substr("employee_code", 5, 4))
df_substr.show(truncate=False)
Output:
+-------------+-----------------------------+------+
|employee_code|email |emp_id|
+-------------+-----------------------------+------+
|EMP-0001-NC |maria.fernandez@healthcare.org|0001 |
|EMP-0002-TX |john.davis@fintech.com |0002 |
|EMP-0003-CA |lucy.liu@retailmart.net |0003 |
|EMP-0004-FL |peter.parker@dailybugle.com |0004 |
|EMP-0005-NY |bruce.wayne@gotham.org |0005 |
+-------------+-----------------------------+------+
4️⃣ substring() Function
Alias for substr() - works the same way.
from pyspark.sql.functions import substring
df_substring = df.withColumn("emp_id", substring("employee_code", 5, 4))
df_substring.show(truncate=False)
5️⃣ overlay() Function
Replaces part of the string with another string.
from pyspark.sql.functions import overlay
df_overlay = df.withColumn("masked_email", overlay("email", "XXXXXX", 1, 5))
df_overlay.show(truncate=False)
6️⃣ left() and right() Functions
Extract leftmost or rightmost characters from string.
from pyspark.sql.functions import left, right
df_left = df.withColumn("left_code", left("employee_code", 3))
df_right = df.withColumn("right_code", right("employee_code", 2))
df_left.show(truncate=False)
df_right.show(truncate=False)



No comments:
Post a Comment
Note: Only a member of this blog may post a comment.