PySpark Complex Data Types Explained
ArrayType, MapType, StructType & StructField for Beginners
Introduction
This tutorial walks you through working with complex data types in PySpark including ArrayType, MapType, StructType, and StructField. These data types are essential when dealing with semi-structured data such as JSON and deeply nested data schemas.
1. Importing Required Libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import * # for StructType, ArrayType etc.
2. Creating a Spark Session
spark = SparkSession.builder \
.appName("ComplexDataTypesDemo") \
.getOrCreate()
3. Define Schema with Complex Data Types
schema = StructType([
StructField("id", IntegerType(), True),
StructField("skills", ArrayType(StringType(), True), True),
StructField("meta", MapType(StringType(), StringType(), True), True),
StructField("full_name", StructType([
StructField("first", StringType(), True),
StructField("last", StringType(), True)
]), True)
])
4. Sample Data
data = [
(1, ["Python", "Spark"], {"course": "PySpark", "level": "Beginner"}, ("Aamir", "Shahzad")),
(2, ["SQL", "Databricks"], {"course": "Spark SQL", "level": "Intermediate"}, ("Ali", "Raza"))
]
5. Create DataFrame
df = spark.createDataFrame(data, schema=schema)
6. Display Data
df.show(truncate=False)
7. Print Schema
df.printSchema()
Output
+---+----------------+----------------------------+-------------+
|id |skills |meta |full_name |
+---+----------------+----------------------------+-------------+
|1 |[Python, Spark] |{course -> PySpark, level -> Beginner} |{first -> Aamir, last -> Shahzad} |
|2 |[SQL, Databricks]|{course -> Spark SQL, level -> Intermediate} |{first -> Ali, last -> Raza} |
+---+----------------+----------------------------+-------------+
root
|-- id: integer (nullable = true)
|-- skills: array (nullable = true)
| |-- element: string (containsNull = true)
|-- meta: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- full_name: struct (nullable = true)
| |-- first: string (nullable = true)
| |-- last: string (nullable = true)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.