Split Strings in PySpark | split(str, pattern, limit) Function Explained with Examples | PySpark Tutorial

String Splitting in PySpark | split(str, pattern[, limit]) Explained

String Splitting in PySpark

In this tutorial, you’ll learn how to use split(str, pattern[, limit]) to break strings into arrays. We'll cover email parsing, splitting full names, and handling pipe-delimited data.

📦 Sample Data

data = [
  ("john.doe@example.com", "John Doe", "john|doe|35|NY"),
  ("alice.smith@mail.org", "Alice Smith", "alice|smith|29|CA"),
  ("bob.jones@test.net", "Bob Jones", "bob|jones|42|TX")
]

columns = ["email", "full_name", "user_data"]
df = spark.createDataFrame(data, columns)
df.show()

Output:

+----------------------+-----------+--------------------+
| email                | full_name | user_data          |
+----------------------+-----------+--------------------+
| john.doe@example.com | John Doe  | john|doe|35|NY      |
| alice.smith@mail.org | Alice Smith | alice|smith|29|CA |
| bob.jones@test.net   | Bob Jones | bob|jones|42|TX     |
+----------------------+-----------+--------------------+

📬 Split Email into Username and Domain

from pyspark.sql.functions import split

df = df.withColumn("email_parts", split("email", "@"))
df.select("email", "email_parts").show(truncate=False)

Output: Shows username and domain in an array

👤 Split Full Name into First and Last Name

df = df.withColumn("name_split", split("full_name", " "))
df.select("full_name", "name_split").show(truncate=False)

Output: ["John", "Doe"], ["Alice", "Smith"], etc.

📎 Split Pipe-Delimited User Data

df = df.withColumn("user_fields", split("user_data", "\\|"))
df.select("user_data", "user_fields").show(truncate=False)

Output: ["john", "doe", "35", "NY"], etc.

🎥 Watch the Full Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.