How to Write DataFrame to Parquet File in Azure Blob Storage Using PySpark | PySpark Tutorial

How to Write DataFrame to Parquet File in Azure Blob Storage | PySpark Tutorial

How to Write DataFrame to Parquet File in Azure Blob Storage Using PySpark

In this tutorial, you'll learn how to use PySpark to save a DataFrame as a Parquet file in Azure Blob Storage. We'll walk through setting up Spark, configuring Azure access, and writing the data efficiently.

1️⃣ Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WriteParquetToBlob").getOrCreate()

2️⃣ Create Sample DataFrame

data = [
  ("Aamir Shahzad", "Lahore", "Pakistan"),
  ("Ali Raza", "Karachi", "Pakistan"),
  ("Bob", "New York", "USA"),
  ("Lisa", "Toronto", "Canada")
]
columns = ["full_name", "city", "country"]

df = spark.createDataFrame(data, columns)
df.show()

3️⃣ Configure Azure Blob Storage

# Replace with your actual values
storage_account = "yourstorageaccount"
container = "yourcontainer"
sas_token = "your_sas_token"

spark.conf.set(
  f"fs.azure.sas.{container}.{storage_account}.blob.core.windows.net",
  sas_token
)

4️⃣ Define Output Path in Azure

output_path = f"wasbs://{container}@{storage_account}.blob.core.windows.net/people_parquet"

5️⃣ Write DataFrame to Parquet

df.repartition(1).write \
  .format("parquet") \
  .mode("overwrite") \
  .save(output_path)

print(f"✅ DataFrame written as Parquet to Azure Blob at: {output_path}")

6️⃣ Read the Parquet File Back

df_read = spark.read \
  .format("parquet") \
  .load(output_path)

print("✅ Data read from Azure Blob (Parquet):")
df_read.show()

📺 Watch the Full Tutorial

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.