PySpark cast() vs astype() Explained
In this tutorial, we'll explore how to convert PySpark DataFrame columns from one type to another using cast() and astype(). You'll learn how to convert string columns to integers, floats, and doubles in a clean and efficient way.
1. Sample DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("CastExample").getOrCreate()
data = [
("1", "Aamir", "50000.5"),
("2", "Ali", "45000.0"),
("3", "Bob", None),
("4", "Lisa", "60000.75")
]
columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)
df.printSchema()
df.show()
2. Using cast() Function
Convert id to integer and salary to float:
df_casted = df.withColumn("id", col("id").cast("int")) \
.withColumn("salary", col("salary").cast("float"))
df_casted.printSchema()
df_casted.show()
3. Using astype() Function
This is an alias for cast() and used in the same way:
df_astype = df_casted.withColumn("salary", col("salary").astype("double"))
df_astype.printSchema()
df_astype.show()
Output:
Original DataFrame (all columns as strings):
+---+-----+--------+
| id| name| salary |
+---+-----+--------+
| 1 |Aamir|50000.5 |
| 2 | Ali |45000.0 |
| 3 | Bob | null |
| 4 |Lisa |60000.75|
+---+-----+--------+
After cast():
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- salary: float (nullable = true)
After astype():
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- salary: double (nullable = true)



No comments:
Post a Comment
Note: Only a member of this blog may post a comment.