String Formatting in PySpark
This tutorial demonstrates how to use PySpark string functions like concat_ws
, format_number
, format_string
, printf
, repeat
, lpad
, and rpad
for formatting, combining, and manipulating string values in DataFrames.
๐ฆ Sample Data Creation
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName(\"StringFormattingDemo\").getOrCreate()
data = [
(\"John\", 950000, \"USD\", \"Smith\"),
(\"Alice\", 120500, \"EUR\", \"Brown\"),
(\"Bob\", 87999, \"INR\", \"Taylor\")
]
columns = [\"first_name\", \"salary\", \"country\", \"last_name\"]
df = spark.createDataFrame(data, columns)
df.show()
๐ Sample DataFrame Output
+-----------+--------+-------+----------+
|first_name | salary |country| last_name|
+-----------+--------+-------+----------+
| John | 950000 | USD | Smith |
| Alice | 120500 | EUR | Brown |
| Bob | 87999 | INR | Taylor |
+-----------+--------+-------+----------+
๐ concat_ws()
df.withColumn("full_name", concat_ws("-", "first_name", "country")).show()
Output: Creates a new column combining first name and country with hyphen
๐ฒ format_number()
df.withColumn("formatted_salary", format_number("salary", 2)).show()
Output: Adds a formatted string version of salary with 2 decimal places
๐งพ format_string()
df.withColumn("greeting", format_string("Hello %s %s", col("first_name"), col("country"))).show()
Output: Hello John USD
๐ข printf()
df.withColumn("price_tag", printf("Amount = %.2f", col("salary"))).show()
Output: Amount = 950000.00
๐ repeat()
df.withColumn("excited", repeat(lit("test"), 2)).show()
Output: testtest
⬅️➡️ lpad() and rpad()
df.withColumn("last_lpad", lpad("last_name", 10, "*")) \
.withColumn("first_rpad", rpad("first_name", 10, "-")).show()
Output: Pads strings to given length with characters on left/right
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.