Substring Functions in PySpark: substr(), substring(), overlay(), left(), right() with Real Examples | PySpark Tutorial

Extract Substrings in PySpark | substr(), substring(), overlay(), left(), right()

Extracting Substrings in PySpark

In this tutorial, you'll learn how to use PySpark string functions like substr(), substring(), overlay(), left(), and right() to manipulate string columns in DataFrames.

1️⃣ Initialize Spark Session

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SubstringTutorial").getOrCreate()

2️⃣ Sample DataFrame

data = [("EMP-0001-NC", "maria.fernandez@healthcare.org"),
        ("EMP-0002-TX", "john.davis@fintech.com"),
        ("EMP-0003-CA", "lucy.liu@retailmart.net"),
        ("EMP-0004-FL", "peter.parker@dailybugle.com"),
        ("EMP-0005-NY", "bruce.wayne@gotham.org")]

columns = ["employee_code", "email"]

df = spark.createDataFrame(data, columns)
df.show(truncate=False)

Output:

+-------------+-----------------------------+
|employee_code|email                        |
+-------------+-----------------------------+
|EMP-0001-NC  |maria.fernandez@healthcare.org|
|EMP-0002-TX  |john.davis@fintech.com       |
|EMP-0003-CA  |lucy.liu@retailmart.net      |
|EMP-0004-FL  |peter.parker@dailybugle.com  |
|EMP-0005-NY  |bruce.wayne@gotham.org       |
+-------------+-----------------------------+

3️⃣ substr() Function

Extract substring starting at a specific position with defined length.

from pyspark.sql.functions import col, substr
df_substr = df.withColumn("emp_id", substr("employee_code", 5, 4))
df_substr.show(truncate=False)

Output:

+-------------+-----------------------------+------+
|employee_code|email                        |emp_id|
+-------------+-----------------------------+------+
|EMP-0001-NC  |maria.fernandez@healthcare.org|0001  |
|EMP-0002-TX  |john.davis@fintech.com       |0002  |
|EMP-0003-CA  |lucy.liu@retailmart.net      |0003  |
|EMP-0004-FL  |peter.parker@dailybugle.com  |0004  |
|EMP-0005-NY  |bruce.wayne@gotham.org       |0005  |
+-------------+-----------------------------+------+

4️⃣ substring() Function

Alias for substr() - works the same way.

from pyspark.sql.functions import substring
df_substring = df.withColumn("emp_id", substring("employee_code", 5, 4))
df_substring.show(truncate=False)

5️⃣ overlay() Function

Replaces part of the string with another string.

from pyspark.sql.functions import overlay
df_overlay = df.withColumn("masked_email", overlay("email", "XXXXXX", 1, 5))
df_overlay.show(truncate=False)

6️⃣ left() and right() Functions

Extract leftmost or rightmost characters from string.

from pyspark.sql.functions import left, right

df_left = df.withColumn("left_code", left("employee_code", 3))
df_right = df.withColumn("right_code", right("employee_code", 2))

df_left.show(truncate=False)
df_right.show(truncate=False)

📺 Watch the Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.