How to use Write function to Create Single CSV file in Blog Storage from DataFrame | PySpark Tutorial

How to Write a Single CSV File to Azure Blob Storage from PySpark DataFrame

How to Write a Single CSV File to Azure Blob Storage from PySpark DataFrame

By default, PySpark writes DataFrames to CSV as multiple files due to partitioning. In this tutorial, you'll learn how to use coalesce(1) to save your DataFrame as a single CSV file to Azure Blob Storage, along with setting headers and options.

1️⃣ Step 1: Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WriteSingleCSV").getOrCreate()

2️⃣ Step 2: Create Sample DataFrame

data = [
  ("Aamir Shahzad", "Lahore", "Pakistan"),
  ("Ali Raza", "Karachi", "Pakistan"),
  ("Bob", "New York", "USA"),
  ("Lisa", "Toronto", "Canada")
]
columns = ["full_name", "city", "country"]
df = spark.createDataFrame(data, schema=columns)

df.show()

3️⃣ Step 3: Configure Azure Blob Storage Access

spark.conf.set("fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net", "<your-access-key>")

4️⃣ Step 4: Write to Single CSV File

output_path = "wasbs://<your-container>@<your-storage-account>.blob.core.windows.net/people_data"

df.coalesce(1).write \
  .option("header", "true") \
  .mode("overwrite") \
  .csv(output_path)

5️⃣ Step 5: Confirm CSV File in Blob

# You can list files using dbutils if on Databricks:
files = dbutils.fs.ls(output_path)
for f in files:
    print(f.name)

๐Ÿ“บ Watch the Full Tutorial

Azure Synapse Analytics Dedicated SQL Pool Overview | What It Is & How It Works | Azure Synapse Analytics Tutorial

Azure Synapse Analytics - Dedicated SQL Pool Overview

Dedicated SQL Pool in Azure Synapse Analytics

๐Ÿ”น What is a Dedicated SQL Pool?

A Dedicated SQL Pool in Azure Synapse Analytics (formerly Azure SQL Data Warehouse) is a scalable, high-performance data warehousing solution built on a Massively Parallel Processing (MPP) architecture. It enables analytics over large volumes of structured data using standard SQL.

๐Ÿ”น Key Characteristics

  • Formerly: Azure SQL Data Warehouse
  • Deployment: Standalone or via Synapse Studio
  • Architecture: MPP (Massively Parallel Processing)
  • Elasticity: Compute and storage scale independently
  • Cost Efficiency: Pause/resume compute resources to save costs

๐Ÿ”น Core Components

  • Control Node: Acts as the SQL endpoint and orchestrates query execution.
  • Compute Nodes: Run the distributed portions of the query.
  • Distributions: Data is distributed across 60 partitions for parallelism.
  • Data Movement Service (DMS): Transfers data between nodes when needed.

⚙️ Query Execution Flow

  1. User submits query to the Control Node.
  2. The query is parsed and optimized, and an execution plan is created.
  3. Tasks are dispatched to Compute Nodes by the MPP engine.
  4. Each node processes its data slice independently.
  5. DMS transfers data as needed (for joins/aggregations).
  6. Final result is assembled by the Control Node and returned to the user.

๐Ÿ”น Distribution Methods

  • HASH: Distributes rows by hashing a column. Best for large fact tables.
  • ROUND ROBIN: Evenly distributes rows randomly. Good for staging data.
  • REPLICATE: Copies the table to all nodes. Ideal for small dimension tables.

Use the DISTRIBUTION clause to define the method during table creation.

⚙️ Data Warehouse Unit (DWU)

DWU (Data Warehouse Unit) represents the combined CPU, memory, and IO power provisioned for the Dedicated SQL Pool. It defines performance tiers.

Performance Level Compute Nodes Memory (GB)
DW100c160
DW200c1120
DW500c1300
DW1000c2600
DW1500c3900
.........
DW30000c6018000

๐Ÿง  SQL Features Support

Supported Features:

  • Tables, Views, Indexes, Schemas
  • Stored Procedures, Functions, Temp Tables
  • Dynamic SQL, DML, DDL
  • Dynamic Management Views (DMVs)

Not Supported:

  • Triggers
  • Cross-database queries

Works Differently Than SQL Server: Identity columns, constraints, and relationships behave differently than in traditional SQL Server environments.

✅ Benefits

  • Massive scalability up to petabytes
  • Cost-effective with pause/resume capabilities
  • Ideal for enterprise-scale analytics and BI
  • Supports full T-SQL language features

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

PySpark writeTo() Explained: Save, Append, Overwrite DataFrames to Tables | PySpark Tutorial

PySpark writeTo() Explained | Save, Append, Overwrite DataFrames

PySpark writeTo() Explained – Save, Append, Overwrite DataFrames

In this tutorial, you'll learn how to use the writeTo() function in PySpark to write DataFrames into managed or external tables using different write modes like append, overwrite, and more.

Step 1: Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WriteToExample").getOrCreate()

Step 2: Create a Sample DataFrame

data = [
  ("Aamir Shahzad1", "Lahore", "Pakistan"),
  ("Ali Raza1", "Karachi", "Pakistan"),
  ("Bob1", "New York", "USA"),
  ("Lisa", "Toronto", "Canada")
]
columns = ["full_name", "city", "country"]

df = spark.createDataFrame(data, schema=columns)
df.show()

Step 3: Write to Table using writeTo()

# This will create or replace the table if it exists
df.writeTo("default.people_table").createOrReplace()
print("✅ Table 'people_table' written using writeTo(). You can query it using SQL.")

Step 4: Query Table using Spark SQL

spark.sql("SELECT * FROM default.people_table").show()

Notes:

  • You can use .create() to fail if the table already exists.
  • Use .append() to add data to an existing table.
  • This works with tables in Hive metastore or Unity Catalog in Databricks.
  • The table will be stored under the 'default' database unless otherwise specified.

๐Ÿ“บ Watch the Full Tutorial

Azure Synapse Serverless SQL Pool Feature Support – What’s Allowed and What’s No | Azure Synapse Analytics Tutorial

Serverless SQL Pool - Feature Support Overview | Azure Synapse Analytics

๐Ÿ“˜ Serverless SQL Pool - Feature Support Overview | Azure Synapse Analytics

Azure Synapse Analytics Serverless SQL Pool enables you to analyze data in your data lake without needing to provision infrastructure. It's ideal for ad hoc querying and lightweight reporting scenarios.

✅ What's Supported in Serverless SQL Pool

  • Schemas, Views, Stored Procedures, Functions: You can define logical schemas, and create reusable views, stored procedures, and inline table-valued functions.
  • External Resources: Serverless SQL supports querying external sources such as Azure Data Lake Storage (ADLS), Azure Blob Storage, and Azure Cosmos DB analytical store.
  • T-SQL Functions: A broad range of built-in T-SQL functions are supported (string, date, math, and aggregate).
  • Auto Statistics: Automatically generated statistics are used to optimize queries. Manual updates aren't supported.
  • Cross-Database Queries: You can query across databases within the same Synapse workspace.

❌ What’s Not Supported

  • Persistent Tables: Serverless SQL doesn’t support traditional DDL table creation. Use EXTERNAL TABLE definitions instead.
  • Triggers: No support for DML triggers since DML operations are restricted.
  • INSERT/UPDATE/DELETE: You cannot perform data modification via T-SQL. These are read-only environments.

๐Ÿ”— Best Practices & Recommendations

  • Use for read-heavy, on-demand analytics
  • Pair with Synapse Pipelines or Spark for write operations
  • Ideal for CSV, Parquet, and JSON file querying directly from data lake

๐Ÿงช Example Query with External Table


SELECT *
FROM OPENROWSET(
    BULK 'https://yourlakehouse.dfs.core.windows.net/sales/customer.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0',
    HEADER_ROW = TRUE
) AS result;

๐Ÿ“บ Watch the Full Video Tutorial

Credit: This blog post was created with the help of ChatGPT and Gemini.