Core Data Types in PySpark Explained
Understand and work with IntegerType, FloatType, DoubleType, DecimalType, and StringType in PySpark. Learn through practical examples and code snippets.
1. Introduction
In this tutorial, we explore the fundamental data types in PySpark and how they are used in DataFrames. We cover commonly used types such as:
IntegerType
FloatType
DoubleType
DecimalType
StringType
2. Sample PySpark Code
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.appName("CoreDataTypesDemo").getOrCreate()
schema = StructType([
StructField("int_col", IntegerType(), True),
StructField("float_col", FloatType(), True),
StructField("double_col", DoubleType(), True),
StructField("long_col", LongType(), True),
StructField("short_col", ShortType(), True),
StructField("byte_col", ByteType(), True),
StructField("bool_col", BooleanType(), True),
StructField("string_col", StringType(), True),
StructField("decimal_col", StringType(), True) # Initial as string
])
data = [
(1, 3.14, 2.7182818284, 9223372036854775807, 32767, 127, True, "Aamir", "1234.5678")
]
df = spark.createDataFrame(data, schema=schema)
df.show()
df.printSchema()
3. Convert Decimal Column to DecimalType
from pyspark.sql.functions import col
from pyspark.sql.types import DecimalType
df = df.withColumn("decimal_col", col("decimal_col").cast(DecimalType(10, 4)))
df.printSchema()
4. Output
+-------+---------+-----------+--------+--------+--------+--------+-----------+------------+
|int_col|float_col|double_col |long_col|short_col|byte_col|bool_col|string_col|decimal_col |
+-------+---------+-----------+--------+--------+--------+--------+-----------+------------+
|1 |3.14 |2.718281828|9223372 |32767 |127 |true |Aamir |1234.5678 |
+-------+---------+-----------+--------+--------+--------+--------+-----------+------------+
root
|-- int_col: integer (nullable = true)
|-- float_col: float (nullable = true)
|-- double_col: double (nullable = true)
|-- long_col: long (nullable = true)
|-- short_col: short (nullable = true)
|-- byte_col: byte (nullable = true)
|-- bool_col: boolean (nullable = true)
|-- string_col: string (nullable = true)
|-- decimal_col: decimal(10,4) (nullable = true)
5. Conclusion
Understanding core data types in PySpark helps build efficient DataFrames. This tutorial covered key types used in most data processing tasks.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.