Clean and Transform Strings in PySpark using regexp_replace()

Clean and Transform Strings in PySpark using `regexp_replace()`

If you're dealing with messy strings in your data pipelines, PySpark's regexp_replace() function is your go-to tool. This tutorial walks through common use cases like masking emails, cleaning up price text, and normalizing phone numbers using regular expressions.

📘 Sample DataFrame

data = [
    ("aamir@example.com", "Price: $123.45", "(123) 456-7890"),
    ("user_1@domain.org", "Total = 999.99", "123-456-7890"),
    ("test.email@site.net", "12.34 per item", "123.456.7890")
]

columns = ["email", "price_text", "phone"]
df = spark.createDataFrame(data, columns)
df.show()

Output:

+----------------------+---------------+--------------+
| email                | price_text    | phone        |
+----------------------+---------------+--------------+
| aamir@example.com    | Price: $123.45| (123) 456-7890|
| user_1@domain.org    | Total = 999.99| 123-456-7890  |
| test.email@site.net  | 12.34 per item| 123.456.7890  |
+----------------------+---------------+--------------+

🔐 Mask Email Usernames

from pyspark.sql.functions import regexp_replace, col

df = df.withColumn("masked_email", regexp_replace("email", "^.*@", "***@"))
df.select("email", "masked_email").show(truncate=False)

Output: Replaces everything before `@` with `***@`

💲 Clean Price Text (remove $ sign)

df = df.withColumn("clean_price", regexp_replace("price_text", "[$=]", ""))
df.select("price_text", "clean_price").show(truncate=False)

Output: Removes symbols like `$` and `=` from the price

📞 Normalize Phone Numbers

df = df.withColumn("clean_phone", regexp_replace("phone", "[^0-9]", ""))
df.select("phone", "clean_phone").show(truncate=False)

Output: Keeps only digits, useful for standardizing phone numbers

Welcome To TechBrothersIT

Label

Clean & Transform Strings in PySpark Using regexp_replace() | Replace Text with Regex | PySpark Tutorial