Extract Substrings Easily in PySpark with regexp_substr() | Real-World Regex Examples | PySpark Tutorial

How to Use regexp_substr() in PySpark | Extract with Regex

Extract Substrings with regexp_substr() in PySpark

In this tutorial, you'll learn how to use the regexp_substr() function in PySpark to extract specific patterns or substrings using regular expressions. This function is especially helpful for extracting dates, prices, or identifiers from messy text data.

📘 Sample Data

data = [
    ("[INFO] Task completed at 2024-04-10 14:33:22", "Price: $199.99"),
    ("[ERROR] Failed on 2022-12-25 08:15:00", "Price: $49.50 + tax"),
    ("[WARN] Updated 2022-01-01 10:00:00", "10")
]
cols = ["log_msg", "price"]
df = spark.createDataFrame(data, cols)
df.show(truncate=False)

Output:

+------------------------------------------+-------------------+
|log_msg                                   |price              |
+------------------------------------------+-------------------+
|[INFO] Task completed at 2024-04-10 14:33:22|Price: $199.99     |
|[ERROR] Failed on 2022-12-25 08:15:00     |Price: $49.50 + tax|
|[WARN] Updated 2022-01-01 10:00:00        |10                 |
+------------------------------------------+-------------------+

📅 Extract Date from Log Message

from pyspark.sql.functions import regexp_substr, col

df = df.withColumn("log_date", regexp_substr(col("log_msg"), "\\d{4}-\\d{2}-\\d{2}", 0))
df.show(truncate=False)

Output: Extracts the date in format YYYY-MM-DD from the log message.

💲 Extract Price from Price String

df = df.withColumn("extracted_price", regexp_substr(col("price"), "\\d+\\.\\d+", 0))
df.show(truncate=False)

Output: Extracts the numeric price value from the text string.

🎥 Watch the Full Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.