PySpark Complex Data Types Explained _ ArrayType, MapType, StructType & StructField for Beginners | PySpark Tutorial

PySpark Complex Data Types Explained | ArrayType, MapType, StructType & StructField for Beginners

PySpark Complex Data Types Explained

ArrayType, MapType, StructType & StructField for Beginners

Introduction

This tutorial walks you through working with complex data types in PySpark including ArrayType, MapType, StructType, and StructField. These data types are essential when dealing with semi-structured data such as JSON and deeply nested data schemas.

1. Importing Required Libraries

from pyspark.sql import SparkSession
from pyspark.sql.types import *  # for StructType, ArrayType etc.

2. Creating a Spark Session

spark = SparkSession.builder \
    .appName("ComplexDataTypesDemo") \
    .getOrCreate()

3. Define Schema with Complex Data Types

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("skills", ArrayType(StringType(), True), True),
    StructField("meta", MapType(StringType(), StringType(), True), True),
    StructField("full_name", StructType([
        StructField("first", StringType(), True),
        StructField("last", StringType(), True)
    ]), True)
])

4. Sample Data

data = [
    (1, ["Python", "Spark"], {"course": "PySpark", "level": "Beginner"}, ("Aamir", "Shahzad")),
    (2, ["SQL", "Databricks"], {"course": "Spark SQL", "level": "Intermediate"}, ("Ali", "Raza"))
]

5. Create DataFrame

df = spark.createDataFrame(data, schema=schema)

6. Display Data

df.show(truncate=False)

7. Print Schema

df.printSchema()

Output

+---+----------------+----------------------------+-------------+
|id |skills          |meta                        |full_name    |
+---+----------------+----------------------------+-------------+
|1  |[Python, Spark] |{course -> PySpark, level -> Beginner} |{first -> Aamir, last -> Shahzad} |
|2  |[SQL, Databricks]|{course -> Spark SQL, level -> Intermediate} |{first -> Ali, last -> Raza} |
+---+----------------+----------------------------+-------------+

root
 |-- id: integer (nullable = true)
 |-- skills: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- meta: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- full_name: struct (nullable = true)
 |    |-- first: string (nullable = true)
 |    |-- last: string (nullable = true)

Watch the Full Tutorial

Tags

#pyspark, #techbrothersit, #databrickstutorial, #complexdatatypes, #sparkstructs, PySpark, PySpark tutorial, StructType, ArrayType, MapType, StructField, Spark SQL, data engineering, databricks, PySpark training

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.