Write Single CSV, Parquet, JSON Files Using Apache Spark Pool in Azure Synapse
📘 Overview
When using Apache Spark Pools in Azure Synapse Analytics, writing data to a single output file (CSV, Parquet, or JSON) is a common requirement for downstream systems, data sharing, and export scenarios. By default, Spark writes output in a distributed format (multiple part files), but you can force it to generate just one.
🛠️ Step-by-Step: Writing Single Output File
✅ Step 1: Sample Data
%%pyspark
data = [("Alice", "USA", 1000), ("Bob", "Canada", 1500)]
columns = ["Name", "Country", "Sales"]
df = spark.createDataFrame(data, columns)
df.show()
✅ Step 2: Coalesce to Single Partition
df_single = df.coalesce(1) # Combine all data into one partition
✅ Step 3: Write to CSV
df_single.write.mode("overwrite").option("header", "true").csv("abfss://output@storageaccount.dfs.core.windows.net/singlefile/csv/")
✅ Step 4: Write to Parquet
df_single.write.mode("overwrite").parquet("abfss://output@storageaccount.dfs.core.windows.net/singlefile/parquet/")
✅ Step 5: Write to JSON
df_single.write.mode("overwrite").json("abfss://output@storageaccount.dfs.core.windows.net/singlefile/json/")
💡 Notes
- coalesce(1) reduces to one partition — required for a single output file
- The result will be saved as
part-00000
file; you may rename it manually in Data Lake or with Python - Use
.option("header", "true")
for CSV if you want column headers
📦 Output Location (ADLS Gen2)
Ensure your Spark pool has access to the storage container. Use abfss://
format with the correct filesystem and storage account.
📌 Best Practices
- Use
repartition(1)
instead ofcoalesce(1)
if you're dealing with uneven or skewed partitions - Always validate output size; single files are not optimal for huge datasets
- Use `.mode("overwrite")` cautiously to avoid accidental loss of data
📈 Use Cases
- Exporting datasets for business users or external partners
- Feeding ML pipelines or BI dashboards
- Generating one file per run for versioned archival
📺 Watch the Video Tutorial
📚 Credit: Content created with the help of ChatGPT and Gemini.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.