split_part() in PySpark – Extract String Parts by Delimiter
In this tutorial, you'll learn how to use the split_part()
function in PySpark to extract specific substrings by a given delimiter, such as pulling username from an email, or ZIP code from a location string.
📘 Sample Data
data = [
("john.doe@example.com", "NY-10001-USA"),
("alice.smith@domain.org", "CA-90001-USA"),
("bob99@company.net", "TX-73301-USA")
]
columns = ["email", "location"]
df = spark.createDataFrame(data, columns)
df.show()
Output:
+----------------------+--------------+
| email | location |
+----------------------+--------------+
| john.doe@example.com | NY-10001-USA |
| alice.smith@domain.org | CA-90001-USA |
| bob99@company.net | TX-73301-USA |
+----------------------+--------------+
📍 Extract Specific Parts Using split_part()
from pyspark.sql.functions import split_part, col
df = df.withColumn("username", split_part(col("email"), "@", 1)) \
.withColumn("domain", split_part(col("email"), "@", 2)) \
.withColumn("state", split_part(col("location"), "-", 1)) \
.withColumn("zip", split_part(col("location"), "-", 2)) \
.withColumn("country", split_part(col("location"), "-", 3))
df.select("email", "username", "domain", "location", "state", "zip", "country").show(truncate=False)
Output:
+----------------------+----------------+------------------+--------------+-----+-----+-------+
| email | username | domain | location |state| zip |country|
+----------------------+----------------+------------------+--------------+-----+-----+-------+
| john.doe@example.com | john.doe | example.com | NY-10001-USA | NY |10001| USA |
| alice.smith@domain.org| alice.smith | domain.org | CA-90001-USA | CA |90001| USA |
| bob99@company.net | bob99 | company.net | TX-73301-USA | TX |73301| USA |
+----------------------+----------------+------------------+--------------+-----+-----+-------+
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.