PySpark Tutorial: Flatten Arrays & Structs with explode(), inline(), and struct() #pyspark

PySpark Tutorial: Flatten Arrays and Structs with explode(), inline(), and struct()

🔥 PySpark Tutorial: Flatten Arrays and Structs

Learn how to use explode(), inline(), and struct() in PySpark to work with nested array and struct data efficiently.

📦 Sample Data

data_array = [
  ("Aamir", ["apple", "banana", "cherry"]),
  ("Sara", ["orange", "grape"]),
  ("John", ["melon", "kiwi", "pineapple"]),
  ("Lina", ["pear", "peach"])
]

df_array = spark.createDataFrame(data_array, ["name", "fruits"])
df_array.show()

Output:

+-----+------------------------+
| name|                  fruits|
+-----+------------------------+
|Aamir| [apple, banana, cherry]|
| Sara|         [orange, grape]|
| John| [melon, kiwi, pineapple]|
| Lina|           [pear, peach]|
+-----+------------------------+

💥 explode() – Flatten Arrays

df_exploded = df_array.select("name", explode("fruits").alias("fruit"))
df_exploded.show()

Output:

+-----+--------+
| name|   fruit|
+-----+--------+
|Aamir|  apple |
|Aamir| banana |
|Aamir| cherry |
| Sara| orange |
| Sara|  grape |
| John|  melon |
| John|   kiwi |
| John|pineapple|
| Lina|   pear |
| Lina|  peach |
+-----+--------+

🧱 struct() – Create Struct from Columns

df_struct = df_array.select("name", struct("fruits").alias("fruit_struct"))
df_struct.show(truncate=False)

🧩 inline() – Flatten Array of Structs

from pyspark.sql.functions import inline
from pyspark.sql.types import StructType, StructField, StringType, ArrayType

data_struct = [
  ("Aamir", [{"fruit": "apple", "color": "red"}, {"fruit": "banana", "color": "yellow"}]),
  ("Sara", [{"fruit": "orange", "color": "orange"}]),
  ("John", [{"fruit": "melon", "color": "green"}, {"fruit": "kiwi", "color": "brown"}]),
  ("Lina", [{"fruit": "pear", "color": "green"}, {"fruit": "peach", "color": "pink"}])
]

schema = StructType([
  StructField("name", StringType(), True),
  StructField("fruits", ArrayType(StructType([
    StructField("fruit", StringType(), True),
    StructField("color", StringType(), True)
  ])), True)
])

df_struct = spark.createDataFrame(data_struct, schema)
df_inline = df_struct.select("name", inline("fruits"))
df_inline.show()

🎥 Watch Full Video Tutorial

📄 Some of the contents in this website were created with assistance from ChatGPT and Gemini.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.