PySpark JSON Functions Explained | How to Parse, Transform & Extract JSON Fields in PySpark | PySpark Tutorial

PySpark JSON Functions Explained – Complete Hands-On Guide

JSON Functions in PySpark – Complete Hands-On Tutorial

In this guide, you'll learn how to work with JSON strings and columns using built-in PySpark SQL functions like get_json_object, from_json, to_json, schema_of_json, explode, and more.

๐Ÿ“ฅ Sample JSON Data


data_json = [
  ('{"name": "Aamir", "age": 30, "city": "New York"}',),
  ('{"name": "Sara", "age": 25, "city": "San Francisco"}',)
]

df_json = spark.createDataFrame(data_json, ["json_data"])
df_json.show(truncate=False)
    

๐Ÿ’ก Output:


+--------------------------------------------------------+
|json_data                                               |
+--------------------------------------------------------+
|{"name": "Aamir", "age": 30, "city": "New York"}        |
|{"name": "Sara", "age": 25, "city": "San Francisco"}    |
+--------------------------------------------------------+
    

๐Ÿ” Extract Data with get_json_object()


from pyspark.sql.functions import get_json_object

df_extracted = df_json.select(get_json_object("json_data", "$.name").alias("name"))
df_extracted.show()
    

๐Ÿ’ก Output:


+------+
| name |
+------+
|Aamir |
|Sara  |
+------+
    

๐Ÿ”„ Convert JSON to Struct with from_json()


from pyspark.sql.functions import from_json
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True)
])

df_struct = df_json.select(from_json("json_data", schema).alias("parsed_json"))
df_struct.show(truncate=False)
    

๐Ÿงพ Convert Struct to JSON with to_json()


from pyspark.sql.functions import to_json

df_back_to_json = df_struct.select(to_json("parsed_json").alias("json_string"))
df_back_to_json.show(truncate=False)
    

๐Ÿ“œ Extract Schema using schema_of_json()


from pyspark.sql.functions import schema_of_json

df_json.select(schema_of_json("json_data").alias("schema")).show(truncate=False)
    

๐Ÿ“บ Video Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.