Write Single CSV, Parquet, JSON Files Using Apache Spark Pool in Azure Synapse | Azure Synapse Analytics Tutorial

Write Single CSV, Parquet, JSON Files Using Apache Spark Pool in Azure Synapse

Write Single CSV, Parquet, JSON Files Using Apache Spark Pool in Azure Synapse

📘 Overview

When using Apache Spark Pools in Azure Synapse Analytics, writing data to a single output file (CSV, Parquet, or JSON) is a common requirement for downstream systems, data sharing, and export scenarios. By default, Spark writes output in a distributed format (multiple part files), but you can force it to generate just one.

🛠️ Step-by-Step: Writing Single Output File

✅ Step 1: Sample Data

%%pyspark
data = [("Alice", "USA", 1000), ("Bob", "Canada", 1500)]
columns = ["Name", "Country", "Sales"]

df = spark.createDataFrame(data, columns)
df.show()

✅ Step 2: Coalesce to Single Partition

df_single = df.coalesce(1)  # Combine all data into one partition

✅ Step 3: Write to CSV

df_single.write.mode("overwrite").option("header", "true").csv("abfss://output@storageaccount.dfs.core.windows.net/singlefile/csv/")

✅ Step 4: Write to Parquet

df_single.write.mode("overwrite").parquet("abfss://output@storageaccount.dfs.core.windows.net/singlefile/parquet/")

✅ Step 5: Write to JSON

df_single.write.mode("overwrite").json("abfss://output@storageaccount.dfs.core.windows.net/singlefile/json/")

💡 Notes

  • coalesce(1) reduces to one partition — required for a single output file
  • The result will be saved as part-00000 file; you may rename it manually in Data Lake or with Python
  • Use .option("header", "true") for CSV if you want column headers

📦 Output Location (ADLS Gen2)

Ensure your Spark pool has access to the storage container. Use abfss:// format with the correct filesystem and storage account.

📌 Best Practices

  • Use repartition(1) instead of coalesce(1) if you're dealing with uneven or skewed partitions
  • Always validate output size; single files are not optimal for huge datasets
  • Use `.mode("overwrite")` cautiously to avoid accidental loss of data

📈 Use Cases

  • Exporting datasets for business users or external partners
  • Feeding ML pipelines or BI dashboards
  • Generating one file per run for versioned archival

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.