Clean and Transform Strings in PySpark using regexp_replace()
If you're dealing with messy strings in your data pipelines, PySpark's regexp_replace()
function is your go-to tool. This tutorial walks through common use cases like masking emails, cleaning up price text, and normalizing phone numbers using regular expressions.
📘 Sample DataFrame
data = [
("aamir@example.com", "Price: $123.45", "(123) 456-7890"),
("user_1@domain.org", "Total = 999.99", "123-456-7890"),
("test.email@site.net", "12.34 per item", "123.456.7890")
]
columns = ["email", "price_text", "phone"]
df = spark.createDataFrame(data, columns)
df.show()
Output:
+----------------------+---------------+--------------+
| email | price_text | phone |
+----------------------+---------------+--------------+
| aamir@example.com | Price: $123.45| (123) 456-7890|
| user_1@domain.org | Total = 999.99| 123-456-7890 |
| test.email@site.net | 12.34 per item| 123.456.7890 |
+----------------------+---------------+--------------+
🔐 Mask Email Usernames
from pyspark.sql.functions import regexp_replace, col
df = df.withColumn("masked_email", regexp_replace("email", "^.*@", "***@"))
df.select("email", "masked_email").show(truncate=False)
Output: Replaces everything before `@` with `***@`
💲 Clean Price Text (remove $ sign)
df = df.withColumn("clean_price", regexp_replace("price_text", "[$=]", ""))
df.select("price_text", "clean_price").show(truncate=False)
Output: Removes symbols like `$` and `=` from the price
📞 Normalize Phone Numbers
df = df.withColumn("clean_phone", regexp_replace("phone", "[^0-9]", ""))
df.select("phone", "clean_phone").show(truncate=False)
Output: Keeps only digits, useful for standardizing phone numbers
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.