How to Write DataFrame to Parquet File in Azure Blob Storage Using PySpark
In this tutorial, you'll learn how to use PySpark to save a DataFrame as a Parquet file in Azure Blob Storage. We'll walk through setting up Spark, configuring Azure access, and writing the data efficiently.
1️⃣ Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WriteParquetToBlob").getOrCreate()
2️⃣ Create Sample DataFrame
data = [
("Aamir Shahzad", "Lahore", "Pakistan"),
("Ali Raza", "Karachi", "Pakistan"),
("Bob", "New York", "USA"),
("Lisa", "Toronto", "Canada")
]
columns = ["full_name", "city", "country"]
df = spark.createDataFrame(data, columns)
df.show()
3️⃣ Configure Azure Blob Storage
# Replace with your actual values
storage_account = "yourstorageaccount"
container = "yourcontainer"
sas_token = "your_sas_token"
spark.conf.set(
f"fs.azure.sas.{container}.{storage_account}.blob.core.windows.net",
sas_token
)
4️⃣ Define Output Path in Azure
output_path = f"wasbs://{container}@{storage_account}.blob.core.windows.net/people_parquet"
5️⃣ Write DataFrame to Parquet
df.repartition(1).write \
.format("parquet") \
.mode("overwrite") \
.save(output_path)
print(f"✅ DataFrame written as Parquet to Azure Blob at: {output_path}")
6️⃣ Read the Parquet File Back
df_read = spark.read \
.format("parquet") \
.load(output_path)
print("✅ Data read from Azure Blob (Parquet):")
df_read.show()
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.