String Splitting in PySpark
In this tutorial, you’ll learn how to use split(str, pattern[, limit])
to break strings into arrays. We'll cover email parsing, splitting full names, and handling pipe-delimited data.
📦 Sample Data
data = [
("john.doe@example.com", "John Doe", "john|doe|35|NY"),
("alice.smith@mail.org", "Alice Smith", "alice|smith|29|CA"),
("bob.jones@test.net", "Bob Jones", "bob|jones|42|TX")
]
columns = ["email", "full_name", "user_data"]
df = spark.createDataFrame(data, columns)
df.show()
Output:
+----------------------+-----------+--------------------+
| email | full_name | user_data |
+----------------------+-----------+--------------------+
| john.doe@example.com | John Doe | john|doe|35|NY |
| alice.smith@mail.org | Alice Smith | alice|smith|29|CA |
| bob.jones@test.net | Bob Jones | bob|jones|42|TX |
+----------------------+-----------+--------------------+
📬 Split Email into Username and Domain
from pyspark.sql.functions import split
df = df.withColumn("email_parts", split("email", "@"))
df.select("email", "email_parts").show(truncate=False)
Output: Shows username and domain in an array
👤 Split Full Name into First and Last Name
df = df.withColumn("name_split", split("full_name", " "))
df.select("full_name", "name_split").show(truncate=False)
Output: ["John", "Doe"], ["Alice", "Smith"], etc.
📎 Split Pipe-Delimited User Data
df = df.withColumn("user_fields", split("user_data", "\\|"))
df.select("user_data", "user_fields").show(truncate=False)
Output: ["john", "doe", "35", "NY"], etc.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.