String Formatting in PySpark | concat_ws, format_number, printf, repeat, lpad, rpad

String Formatting in PySpark

This tutorial demonstrates how to use PySpark string functions like concat_ws, format_number, format_string, printf, repeat, lpad, and rpad for formatting, combining, and manipulating string values in DataFrames.

📦 Sample Data Creation

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName(\"StringFormattingDemo\").getOrCreate()

data = [
    (\"John\", 950000, \"USD\", \"Smith\"),
    (\"Alice\", 120500, \"EUR\", \"Brown\"),
    (\"Bob\", 87999, \"INR\", \"Taylor\")
]

columns = [\"first_name\", \"salary\", \"country\", \"last_name\"]
df = spark.createDataFrame(data, columns)
df.show()

📊 Sample DataFrame Output

+-----------+--------+-------+----------+
|first_name | salary |country| last_name|
+-----------+--------+-------+----------+
| John      | 950000 | USD   | Smith    |
| Alice     | 120500 | EUR   | Brown    |
| Bob       |  87999 | INR   | Taylor   |
+-----------+--------+-------+----------+

🔗 concat_ws()

df.withColumn("full_name", concat_ws("-", "first_name", "country")).show()

Output: Creates a new column combining first name and country with hyphen

💲 format_number()

df.withColumn("formatted_salary", format_number("salary", 2)).show()

Output: Adds a formatted string version of salary with 2 decimal places

🧾 format_string()

df.withColumn("greeting", format_string("Hello %s %s", col("first_name"), col("country"))).show()

Output: Hello John USD

🔢 printf()

df.withColumn("price_tag", printf("Amount = %.2f", col("salary"))).show()

Output: Amount = 950000.00

🔁 repeat()

df.withColumn("excited", repeat(lit("test"), 2)).show()

Output: testtest

⬅️➡️ lpad() and rpad()

df.withColumn("last_lpad", lpad("last_name", 10, "*")) \
  .withColumn("first_rpad", rpad("first_name", 10, "-")).show()

Output: Pads strings to given length with characters on left/right

Welcome To TechBrothersIT

Label