PySpark Tutorial: Extract Data with regexp_extract() in PySpark | Regex Patterns Made Easy #pyspark

Extract Substrings in PySpark with regexp_extract()

Extract Substrings with regexp_extract() in PySpark

The regexp_extract() function allows you to use regular expressions to extract substrings from string columns in PySpark. This is extremely useful when working with emails, logs, or structured patterns like phone numbers or dates.

📘 Sample Data

data = [
  ("user1@example.com", "[INFO] Login failed at 2024-04-15 10:23:45", "(123) 456-7890"),
  ("john.doe@mail.org", "[ERROR] Disk full at 2024-04-15 12:00:00", "125-456-7890"),
  ("alice@company.net", "[WARN] High memory at 2024-04-15 14:30:10", "123.456.7890")
]

columns = ["email", "log", "phone"]
df = spark.createDataFrame(data, columns)
df.show()

📧 Extract Domain from Email

from pyspark.sql.functions import regexp_extract, col

df = df.withColumn("domain", regexp_extract(col("email"), "@(.*)", 1))
df.select("email", "domain").show(truncate=False)

Output: Extracts everything after @ in the email.

🔍 Extract Log Level

df = df.withColumn("log_level", regexp_extract(col("log"), "\\[(.*?)\\]", 1))
df.select("log", "log_level").show(truncate=False)

Output: Extracts the text inside square brackets (e.g., INFO, ERROR, WARN).

📅 Extract Date from Log

pattern_date = "(\\d{4}-\\d{2}-\\d{2})"
df = df.withColumn("log_date", regexp_extract(col("log"), pattern_date, 1))
df.select("log", "log_date").show(truncate=False)

Output: Captures date in the format YYYY-MM-DD from the log string.

📞 Extract Area Code from Phone Number

df = df.withColumn("area_code", regexp_extract(col("phone"), "(\\d{3})", 1))
df.select("phone", "area_code").show(truncate=False)

Output: Captures the first 3 digits (area code) from the phone number.

🎥 Watch the Full Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.