Extract Substrings with regexp_substr() in PySpark
In this tutorial, you'll learn how to use the regexp_substr()
function in PySpark to extract specific patterns or substrings using regular expressions. This function is especially helpful for extracting dates, prices, or identifiers from messy text data.
📘 Sample Data
data = [
("[INFO] Task completed at 2024-04-10 14:33:22", "Price: $199.99"),
("[ERROR] Failed on 2022-12-25 08:15:00", "Price: $49.50 + tax"),
("[WARN] Updated 2022-01-01 10:00:00", "10")
]
cols = ["log_msg", "price"]
df = spark.createDataFrame(data, cols)
df.show(truncate=False)
Output:
+------------------------------------------+-------------------+
|log_msg |price |
+------------------------------------------+-------------------+
|[INFO] Task completed at 2024-04-10 14:33:22|Price: $199.99 |
|[ERROR] Failed on 2022-12-25 08:15:00 |Price: $49.50 + tax|
|[WARN] Updated 2022-01-01 10:00:00 |10 |
+------------------------------------------+-------------------+
📅 Extract Date from Log Message
from pyspark.sql.functions import regexp_substr, col
df = df.withColumn("log_date", regexp_substr(col("log_msg"), "\\d{4}-\\d{2}-\\d{2}", 0))
df.show(truncate=False)
Output: Extracts the date in format YYYY-MM-DD from the log message.
💲 Extract Price from Price String
df = df.withColumn("extracted_price", regexp_substr(col("price"), "\\d+\\.\\d+", 0))
df.show(truncate=False)
Output: Extracts the numeric price value from the text string.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.