How to perform String Cleaning in PySpark lower, trim, initcap Explained with Real Data | PySpark Tutorial

How to Clean Strings in PySpark | lower(), trim(), initcap() Explained with Real Data

How to Clean Strings in PySpark

Using lower(), trim(), and initcap() with Real Data

📌 What You’ll Learn

  • How to use lower() to convert text to lowercase
  • How to use trim() to remove leading/trailing spaces
  • How to use initcap() to capitalize first letter of each word
  • Chaining multiple string functions

📊 Sample Data

data = [
    (" Aamir ",),
    ("LISA ",),
    ("  charLie   ",),
    ("BOB",),
    (" eli",)
]
columns = ["raw_name"]
df = spark.createDataFrame(data, columns)
df.show(truncate=False)
Output:
+-----------+
|raw_name   |
+-----------+
| Aamir     |
|LISA       |
|  charLie  |
|BOB        |
| eli       |
+-----------+

🔧 Cleaned Data using PySpark Functions

1️⃣ Apply trim()

from pyspark.sql.functions import trim
df_trimmed = df.withColumn("trimmed", trim("raw_name"))
df_trimmed.show(truncate=False)

2️⃣ Apply lower() and upper()

from pyspark.sql.functions import lower, upper
df_lower = df_trimmed.withColumn("lowercase", lower("trimmed"))
df_upper = df_trimmed.withColumn("uppercase", upper("trimmed"))
df_lower.show(truncate=False)
df_upper.show(truncate=False)

3️⃣ Apply initcap()

from pyspark.sql.functions import initcap
df_initcap = df_trimmed.withColumn("titlecase", initcap("trimmed"))
df_initcap.show(truncate=False)

🎥 Watch the Full Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.