Extract Substrings in PySpark with regexp_extract()

Extract Substrings with regexp_extract() in PySpark

The regexp_extract() function allows you to use regular expressions to extract substrings from string columns in PySpark. This is extremely useful when working with emails, logs, or structured patterns like phone numbers or dates.

📘 Sample Data

data = [
  ("user1@example.com", "[INFO] Login failed at 2024-04-15 10:23:45", "(123) 456-7890"),
  ("john.doe@mail.org", "[ERROR] Disk full at 2024-04-15 12:00:00", "125-456-7890"),
  ("alice@company.net", "[WARN] High memory at 2024-04-15 14:30:10", "123.456.7890")
]

columns = ["email", "log", "phone"]
df = spark.createDataFrame(data, columns)
df.show()

📧 Extract Domain from Email

from pyspark.sql.functions import regexp_extract, col

df = df.withColumn("domain", regexp_extract(col("email"), "@(.*)", 1))
df.select("email", "domain").show(truncate=False)

Output: Extracts everything after @ in the email.

🔍 Extract Log Level

df = df.withColumn("log_level", regexp_extract(col("log"), "\\[(.*?)\\]", 1))
df.select("log", "log_level").show(truncate=False)

Output: Extracts the text inside square brackets (e.g., INFO, ERROR, WARN).

📅 Extract Date from Log

pattern_date = "(\\d{4}-\\d{2}-\\d{2})"
df = df.withColumn("log_date", regexp_extract(col("log"), pattern_date, 1))
df.select("log", "log_date").show(truncate=False)

Output: Captures date in the format YYYY-MM-DD from the log string.

📞 Extract Area Code from Phone Number

df = df.withColumn("area_code", regexp_extract(col("phone"), "(\\d{3})", 1))
df.select("phone", "area_code").show(truncate=False)

Output: Captures the first 3 digits (area code) from the phone number.

Welcome To TechBrothersIT

Label

PySpark Tutorial: Extract Data with regexp_extract() in PySpark | Regex Patterns Made Easy #pyspark