Core Data Types in PySpark Explained | IntegerType, FloatType, DoubleType, DecimalType, StringType

Core Data Types in PySpark Explained

Understand and work with IntegerType, FloatType, DoubleType, DecimalType, and StringType in PySpark. Learn through practical examples and code snippets.

1. Introduction

In this tutorial, we explore the fundamental data types in PySpark and how they are used in DataFrames. We cover commonly used types such as:

IntegerType
FloatType
DoubleType
DecimalType
StringType

2. Sample PySpark Code

from pyspark.sql import SparkSession
from pyspark.sql.types import *

spark = SparkSession.builder.appName("CoreDataTypesDemo").getOrCreate()

schema = StructType([
    StructField("int_col", IntegerType(), True),
    StructField("float_col", FloatType(), True),
    StructField("double_col", DoubleType(), True),
    StructField("long_col", LongType(), True),
    StructField("short_col", ShortType(), True),
    StructField("byte_col", ByteType(), True),
    StructField("bool_col", BooleanType(), True),
    StructField("string_col", StringType(), True),
    StructField("decimal_col", StringType(), True)  # Initial as string
])

data = [
    (1, 3.14, 2.7182818284, 9223372036854775807, 32767, 127, True, "Aamir", "1234.5678")
]

df = spark.createDataFrame(data, schema=schema)
df.show()
df.printSchema()

3. Convert Decimal Column to DecimalType

from pyspark.sql.functions import col
from pyspark.sql.types import DecimalType

df = df.withColumn("decimal_col", col("decimal_col").cast(DecimalType(10, 4)))
df.printSchema()

4. Output

+-------+---------+-----------+--------+--------+--------+--------+-----------+------------+
|int_col|float_col|double_col |long_col|short_col|byte_col|bool_col|string_col|decimal_col |
+-------+---------+-----------+--------+--------+--------+--------+-----------+------------+
|1      |3.14     |2.718281828|9223372 |32767   |127     |true    |Aamir      |1234.5678   |
+-------+---------+-----------+--------+--------+--------+--------+-----------+------------+

root
 |-- int_col: integer (nullable = true)
 |-- float_col: float (nullable = true)
 |-- double_col: double (nullable = true)
 |-- long_col: long (nullable = true)
 |-- short_col: short (nullable = true)
 |-- byte_col: byte (nullable = true)
 |-- bool_col: boolean (nullable = true)
 |-- string_col: string (nullable = true)
 |-- decimal_col: decimal(10,4) (nullable = true)

5. Conclusion

Understanding core data types in PySpark helps build efficient DataFrames. This tutorial covered key types used in most data processing tasks.

Welcome To TechBrothersIT

Label