JSON Functions in PySpark – Complete Hands-On Tutorial
In this guide, you'll learn how to work with JSON strings and columns using built-in PySpark SQL functions like get_json_object
, from_json
, to_json
, schema_of_json
, explode
, and more.
๐ฅ Sample JSON Data
data_json = [
('{"name": "Aamir", "age": 30, "city": "New York"}',),
('{"name": "Sara", "age": 25, "city": "San Francisco"}',)
]
df_json = spark.createDataFrame(data_json, ["json_data"])
df_json.show(truncate=False)
๐ก Output:
+--------------------------------------------------------+
|json_data |
+--------------------------------------------------------+
|{"name": "Aamir", "age": 30, "city": "New York"} |
|{"name": "Sara", "age": 25, "city": "San Francisco"} |
+--------------------------------------------------------+
๐ Extract Data with get_json_object()
from pyspark.sql.functions import get_json_object
df_extracted = df_json.select(get_json_object("json_data", "$.name").alias("name"))
df_extracted.show()
๐ก Output:
+------+
| name |
+------+
|Aamir |
|Sara |
+------+
๐ Convert JSON to Struct with from_json()
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("city", StringType(), True)
])
df_struct = df_json.select(from_json("json_data", schema).alias("parsed_json"))
df_struct.show(truncate=False)
๐งพ Convert Struct to JSON with to_json()
from pyspark.sql.functions import to_json
df_back_to_json = df_struct.select(to_json("parsed_json").alias("json_string"))
df_back_to_json.show(truncate=False)
๐ Extract Schema using schema_of_json()
from pyspark.sql.functions import schema_of_json
df_json.select(schema_of_json("json_data").alias("schema")).show(truncate=False)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.