Extract Substrings with regexp_extract() in PySpark
The regexp_extract()
function allows you to use regular expressions to extract substrings from string columns in PySpark. This is extremely useful when working with emails, logs, or structured patterns like phone numbers or dates.
📘 Sample Data
data = [
("user1@example.com", "[INFO] Login failed at 2024-04-15 10:23:45", "(123) 456-7890"),
("john.doe@mail.org", "[ERROR] Disk full at 2024-04-15 12:00:00", "125-456-7890"),
("alice@company.net", "[WARN] High memory at 2024-04-15 14:30:10", "123.456.7890")
]
columns = ["email", "log", "phone"]
df = spark.createDataFrame(data, columns)
df.show()
📧 Extract Domain from Email
from pyspark.sql.functions import regexp_extract, col
df = df.withColumn("domain", regexp_extract(col("email"), "@(.*)", 1))
df.select("email", "domain").show(truncate=False)
Output: Extracts everything after @
in the email.
🔍 Extract Log Level
df = df.withColumn("log_level", regexp_extract(col("log"), "\\[(.*?)\\]", 1))
df.select("log", "log_level").show(truncate=False)
Output: Extracts the text inside square brackets (e.g., INFO, ERROR, WARN).
📅 Extract Date from Log
pattern_date = "(\\d{4}-\\d{2}-\\d{2})"
df = df.withColumn("log_date", regexp_extract(col("log"), pattern_date, 1))
df.select("log", "log_date").show(truncate=False)
Output: Captures date in the format YYYY-MM-DD from the log string.
📞 Extract Area Code from Phone Number
df = df.withColumn("area_code", regexp_extract(col("phone"), "(\\d{3})", 1))
df.select("phone", "area_code").show(truncate=False)
Output: Captures the first 3 digits (area code) from the phone number.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.