Incrementally Write Data to Delta Lake in Azure Synapse Analytics | Azure Synapse Tutorial

Incrementally Write Data to Delta Lake in Azure Synapse Analytics

Incrementally Write Data to Delta Lake in Azure Synapse Analytics

📘 Overview

Delta Lake provides ACID-compliant storage that enables scalable and reliable data lake solutions. With Apache Spark Pools in Azure Synapse Analytics, you can incrementally write data to Delta tables using merge operations or overwrite modes for upserts.

💡 Why Incremental Writes?

  • Efficient handling of new or updated records
  • Reduced cost and faster performance over full reloads
  • Supports upsert (insert + update) logic

🛠️ Step-by-Step: Upsert to Delta Table

1. Load New Data

%%pyspark
new_data = [
    (1, "Alice", "2024-01-01"),
    (2, "Bob", "2024-01-02")
]
columns = ["id", "name", "modified_date"]
df_new = spark.createDataFrame(new_data, columns)

2. Write Base Delta Table (if not exists)

df_new.write.format("delta").mode("overwrite") \
    .save("abfss://container@account.dfs.core.windows.net/delta/customer")

3. Merge New Data (Incremental Write)

from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "abfss://container@account.dfs.core.windows.net/delta/customer")

delta_table.alias("target").merge(
    df_new.alias("source"),
    "target.id = source.id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

📦 Notes

  • You must import DeltaTable from the Delta Lake module
  • The merge function ensures existing records are updated and new ones inserted
  • Delta Lake auto-manages transaction logs for rollback and audit

✅ Best Practices

  • Use partitioning if writing large volumes of data
  • Track modified dates to avoid reprocessing old records
  • Validate schema before merges to prevent errors

📈 Use Cases

  • CDC (Change Data Capture) implementation
  • Daily/Hourly incremental ingestion jobs
  • Data warehouse staging layer with Delta Lake

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

PySpark Tutorial: Flatten Arrays & Structs with explode(), inline(), and struct() #pyspark

PySpark Tutorial: Flatten Arrays and Structs with explode(), inline(), and struct()

🔥 PySpark Tutorial: Flatten Arrays and Structs

Learn how to use explode(), inline(), and struct() in PySpark to work with nested array and struct data efficiently.

📦 Sample Data

data_array = [
  ("Aamir", ["apple", "banana", "cherry"]),
  ("Sara", ["orange", "grape"]),
  ("John", ["melon", "kiwi", "pineapple"]),
  ("Lina", ["pear", "peach"])
]

df_array = spark.createDataFrame(data_array, ["name", "fruits"])
df_array.show()

Output:

+-----+------------------------+
| name|                  fruits|
+-----+------------------------+
|Aamir| [apple, banana, cherry]|
| Sara|         [orange, grape]|
| John| [melon, kiwi, pineapple]|
| Lina|           [pear, peach]|
+-----+------------------------+

💥 explode() – Flatten Arrays

df_exploded = df_array.select("name", explode("fruits").alias("fruit"))
df_exploded.show()

Output:

+-----+--------+
| name|   fruit|
+-----+--------+
|Aamir|  apple |
|Aamir| banana |
|Aamir| cherry |
| Sara| orange |
| Sara|  grape |
| John|  melon |
| John|   kiwi |
| John|pineapple|
| Lina|   pear |
| Lina|  peach |
+-----+--------+

🧱 struct() – Create Struct from Columns

df_struct = df_array.select("name", struct("fruits").alias("fruit_struct"))
df_struct.show(truncate=False)

🧩 inline() – Flatten Array of Structs

from pyspark.sql.functions import inline
from pyspark.sql.types import StructType, StructField, StringType, ArrayType

data_struct = [
  ("Aamir", [{"fruit": "apple", "color": "red"}, {"fruit": "banana", "color": "yellow"}]),
  ("Sara", [{"fruit": "orange", "color": "orange"}]),
  ("John", [{"fruit": "melon", "color": "green"}, {"fruit": "kiwi", "color": "brown"}]),
  ("Lina", [{"fruit": "pear", "color": "green"}, {"fruit": "peach", "color": "pink"}])
]

schema = StructType([
  StructField("name", StringType(), True),
  StructField("fruits", ArrayType(StructType([
    StructField("fruit", StringType(), True),
    StructField("color", StringType(), True)
  ])), True)
])

df_struct = spark.createDataFrame(data_struct, schema)
df_inline = df_struct.select("name", inline("fruits"))
df_inline.show()

🎥 Watch Full Video Tutorial

📄 Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Create, Modify & Manage Tables with T-SQL in Fabric Warehouse | Microsoft Fabric Tutorial

Create, Modify & Manage Tables with T-SQL in Fabric Warehouse | Microsoft Fabric Tutorial

Create, Modify & Manage Tables with T-SQL in Fabric Warehouse

This tutorial walks you through essential T-SQL operations in Microsoft Fabric Warehouse. You’ll learn how to create tables, insert data, update schema, drop columns, and even handle schema evolution when direct alterations aren't supported. Ideal for developers and DBAs new to Fabric!

📘 Step 1: Create a New Table

Let’s begin by creating a basic Employee table.

CREATE TABLE dbo.Employee (
    EmployeeID INT,
    FirstName VARCHAR(50),
    LastName VARCHAR(50),
    Department VARCHAR(50)
);

SELECT * FROM dbo.Employee;

✅ Step 2: Insert Sample Data

INSERT INTO dbo.Employee (EmployeeID, FirstName, LastName, Department)
VALUES
(1, 'Aamir', 'Shahzad', 'IT'),
(2, 'Sara', 'Ali', 'HR'),
(3, 'John', 'Doe', 'Finance');

👁️ Step 3: View Data

SELECT * FROM dbo.Employee;

➕ Step 4: Add a New Column

Add a HireDate column to your table:

ALTER TABLE dbo.Employee
ADD HireDate DATE;

📝 Step 5: Update New Column with Values

UPDATE dbo.Employee SET HireDate = '2023-01-01' WHERE Department = 'IT';
UPDATE dbo.Employee SET HireDate = '2023-02-15' WHERE EmployeeID = 2;
UPDATE dbo.Employee SET HireDate = '2023-03-10' WHERE Department = 'Finance';

👁️ View Updated Table

SELECT * FROM dbo.Employee;

🧽 Step 6: Drop an Unused Column

Remove the Department column once it's no longer needed:

ALTER TABLE dbo.Employee
DROP COLUMN Department;

👀 View Table After Dropping Column

SELECT * FROM dbo.Employee;

⚠️ Step 7: Change Data Type (Unsupported Directly)

Direct column type change like increasing VARCHAR(50) to VARCHAR(100) is not currently supported in Fabric Warehouse. Instead, use a workaround.

✅ Workaround: Migrate to New Table

-- Step 1: Create a new table with updated column size
CREATE TABLE dbo.Employee_New (
    EmployeeID INT,
    FirstName VARCHAR(100),
    LastName VARCHAR(50),
    HireDate DATE
);

-- Step 2: Copy data
INSERT INTO dbo.Employee_New (EmployeeID, FirstName, LastName, HireDate)
SELECT EmployeeID, FirstName, LastName, HireDate
FROM dbo.Employee;

-- Step 3: Drop old table
DROP TABLE dbo.Employee;

-- Step 4: Rename new table
EXEC sp_rename 'dbo.Employee_New', 'Employee';

-- Final view
SELECT * FROM dbo.Employee;

🎬 Watch the Full Tutorial

Blog post written with the help of ChatGPT.

Read CSV, Parquet & JSON Files Using Apache Spark Pool in Azure Synapse | #azuresynapse | Azure Synapse Analytics Tutorial

Read CSV, Parquet & JSON Files Using Apache Spark Pool in Azure Synapse

Read CSV, Parquet & JSON Files Using Apache Spark Pool in Azure Synapse

📘 Overview

Apache Spark Pools in Azure Synapse Analytics allow you to read and process data from various file formats such as CSV, Parquet, and JSON stored in Azure Data Lake. This is especially useful for data exploration, transformation, and analytics workflows.

🛠️ Prerequisites

  • Spark Pool attached to your Synapse Workspace
  • Storage account with access permissions configured
  • Notebook interface using %%pyspark or language of choice

✅ Example 1: Read a CSV File

%%pyspark
df_csv = spark.read.option("header", "true") \
                   .option("inferSchema", "true") \
                   .csv("abfss://data@yourstorageaccount.dfs.core.windows.net/input/customers.csv")
df_csv.show()

✅ Example 2: Read a Parquet File

%%pyspark
df_parquet = spark.read.parquet("abfss://data@yourstorageaccount.dfs.core.windows.net/input/sales.parquet")
df_parquet.printSchema()
df_parquet.show()

✅ Example 3: Read a JSON File

%%pyspark
df_json = spark.read.option("multiline", "true") \
                    .json("abfss://data@yourstorageaccount.dfs.core.windows.net/input/products.json")
df_json.display()

🔍 Explanation

  • option("header", "true"): Treats first row as header (CSV)
  • inferSchema: Automatically detects column types (CSV/JSON)
  • multiline: Important when dealing with nested or pretty-printed JSON

📦 File Path Format

Use abfss:// protocol to point to Azure Data Lake Gen2:

abfss://<container>@<account>.dfs.core.windows.net/<path>/filename

📌 Tips

  • For large files, prefer Parquet for performance and compression
  • Validate file path and permissions to avoid access errors
  • Use display(df) or df.show() to preview data
  • Always check and handle nulls or unexpected formats

🎯 Use Cases

  • Data ingestion and transformation in Spark
  • Reading raw data from staging zones
  • Feeding analytics and ML pipelines

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

Write Single CSV, Parquet, JSON Files Using Apache Spark Pool in Azure Synapse | Azure Synapse Analytics Tutorial

Write Single CSV, Parquet, JSON Files Using Apache Spark Pool in Azure Synapse

Write Single CSV, Parquet, JSON Files Using Apache Spark Pool in Azure Synapse

📘 Overview

When using Apache Spark Pools in Azure Synapse Analytics, writing data to a single output file (CSV, Parquet, or JSON) is a common requirement for downstream systems, data sharing, and export scenarios. By default, Spark writes output in a distributed format (multiple part files), but you can force it to generate just one.

🛠️ Step-by-Step: Writing Single Output File

✅ Step 1: Sample Data

%%pyspark
data = [("Alice", "USA", 1000), ("Bob", "Canada", 1500)]
columns = ["Name", "Country", "Sales"]

df = spark.createDataFrame(data, columns)
df.show()

✅ Step 2: Coalesce to Single Partition

df_single = df.coalesce(1)  # Combine all data into one partition

✅ Step 3: Write to CSV

df_single.write.mode("overwrite").option("header", "true").csv("abfss://output@storageaccount.dfs.core.windows.net/singlefile/csv/")

✅ Step 4: Write to Parquet

df_single.write.mode("overwrite").parquet("abfss://output@storageaccount.dfs.core.windows.net/singlefile/parquet/")

✅ Step 5: Write to JSON

df_single.write.mode("overwrite").json("abfss://output@storageaccount.dfs.core.windows.net/singlefile/json/")

💡 Notes

  • coalesce(1) reduces to one partition — required for a single output file
  • The result will be saved as part-00000 file; you may rename it manually in Data Lake or with Python
  • Use .option("header", "true") for CSV if you want column headers

📦 Output Location (ADLS Gen2)

Ensure your Spark pool has access to the storage container. Use abfss:// format with the correct filesystem and storage account.

📌 Best Practices

  • Use repartition(1) instead of coalesce(1) if you're dealing with uneven or skewed partitions
  • Always validate output size; single files are not optimal for huge datasets
  • Use `.mode("overwrite")` cautiously to avoid accidental loss of data

📈 Use Cases

  • Exporting datasets for business users or external partners
  • Feeding ML pipelines or BI dashboards
  • Generating one file per run for versioned archival

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

Transforming Arrays and Maps in PySpark : Advanced Functions_ transform(), filter(), zip_with() | PySpark Tutorial

Transform Arrays & Maps in PySpark | transform(), filter(), zip_with()

Transforming Arrays and Maps in PySpark

This tutorial explains advanced functions in PySpark to manipulate array and map collections using:

  • transform()
  • filter()
  • zip_with()

Sample Data Setup

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, transform, filter, zip_with
from pyspark.sql.types import ArrayType, StringType

spark = SparkSession.builder.appName("TransformFilterZipWith").getOrCreate()

data = [("John", [1, 2, 3, 4]), ("Lisa", [10, 20, 30]), ("Aamir", [5, 15, 25])]
df = spark.createDataFrame(data, ["name", "numbers"])
df.show()
        

Output:

+-----+------------+
| name|     numbers|
+-----+------------+
| John| [1, 2, 3, 4]|
| Lisa|[10, 20, 30]|
|Aamir| [5, 15, 25]|
+-----+------------+

1️⃣ transform()

Definition: Applies a transformation to each element in the array.

df.select("name", transform("numbers", lambda x: x * 2).alias("transformed")).show()

Output:

+-----+------------------+
| name|        transformed|
+-----+------------------+
| John|     [2, 4, 6, 8] |
| Lisa|  [20, 40, 60]    |
|Aamir| [10, 30, 50]     |
+-----+------------------+

2️⃣ filter()

Definition: Filters array elements based on a condition.

df.select("name", filter("numbers", lambda x: x > 10).alias("filtered")).show()

Output:

+-----+----------------+
| name|        filtered|
+-----+----------------+
| John|        []      |
| Lisa|     [20, 30]   |
|Aamir|     [15, 25]   |
+-----+----------------+

3️⃣ zip_with()

Definition: Combines two arrays element-wise using a binary function.

from pyspark.sql.functions import lit, array

df2 = df.withColumn("multiplier", array(lit(10), lit(20), lit(30), lit(40)))
df2.select("name", zip_with("numbers", "multiplier", lambda x, y: x + y).alias("zipped")).show()
        

Output:

+-----+------------------+
| name|           zipped |
+-----+------------------+
| John| [11, 22, 33, 44] |
| Lisa|  [30, 40, 60]    |
|Aamir| [15, 35, 55]     |
+-----+------------------+

📺 Watch the Full Tutorial on YouTube

© 2025 Aamir Shahzad | PySpark Tutorials

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Microsoft Fabric Warehouse Tutorial – Schemas, Tables & SELECT Queries | Microsoft Fabric Tutorial

Microsoft Fabric Warehouse Tutorial – Schemas, Tables & SELECT Queries

Microsoft Fabric Warehouse Tutorial – Schemas, Tables & SELECT Queries

In this tutorial, we explore the fundamentals of Microsoft Fabric Warehouse including schemas, table creation, data insertion, and various types of SELECT queries. This guide is perfect for beginners looking to understand structured querying in Fabric’s T-SQL engine.

📘 What is a Schema?

A schema in Microsoft Fabric Warehouse is a logical container that holds database objects such as tables, views, and procedures. Think of it as a folder or namespace to help organize your database.

✅ Create a Schema

CREATE SCHEMA Sales;

📘 What is a Table?

A table is a structured object used to store rows of data across predefined columns. Tables are always created within schemas and support SQL operations like INSERT, SELECT, UPDATE, and DELETE.

✅ Create a Table Inside the Schema

CREATE TABLE Sales.Customer (
    CustomerID INT,
    FirstName VARCHAR(50),
    LastName VARCHAR(50),
    Country VARCHAR(20),
    SignupDate DATETIME2(3)
);

✅ Insert Sample Data

INSERT INTO Sales.Customer (CustomerID, FirstName, LastName, Country, SignupDate)
VALUES
(1, 'Aamir', 'Shahzad', 'USA', '2024-01-01'),
(2, 'Sara', 'Ali', 'Canada', '2024-02-15'),
(3, 'John', 'Doe', 'UK', '2024-03-10');

📘 What is SELECT?

The SELECT statement is used to query and retrieve data from one or more tables. You can select all columns, specific columns, filter results, sort data, and aggregate results.

✅ Select All Records

SELECT * FROM Sales.Customer;
SELECT * FROM dbo.Customer;

✅ Select Specific Columns

SELECT FirstName, Country FROM Sales.Customer;

✅ Filter Data Using WHERE

SELECT * FROM Sales.Customer
WHERE CustomerID > 1;

✅ Sort Data Using ORDER BY

SELECT * FROM Sales.Customer
ORDER BY SignupDate DESC;

✅ Count Total Rows

SELECT COUNT(*) AS TotalCustomers FROM Sales.Customer;

🎬 Watch the Full Tutorial

Blog post written with the help of ChatGPT.

Using Magic Commands in Azure Synapse Notebooks | Examples & Tips | #azuresynapse #pyspark | Azure Synapse Analytics Tutorial

Using Magic Commands in Azure Synapse Notebooks | Examples & Tips

Using Magic Commands in Azure Synapse Notebooks | Examples & Tips

📘 What Are Magic Commands?

In Azure Synapse Notebooks, magic commands allow you to switch the execution context or language in a specific cell. These commands start with %% and tell Synapse which language or runtime to use in the cell.

🧪 Common Magic Commands in Synapse

  • %%pyspark – Run Python code using PySpark
  • %%spark – Run Scala code
  • %%sql – Execute Spark SQL queries
  • %%csharp – Run .NET for Apache Spark code
  • %%md – Write Markdown text (headings, notes)

✅ Example: PySpark Cell

%%pyspark
df = spark.read.csv('abfss://data@account.dfs.core.windows.net/sample.csv', header=True)
df.show()

✅ Example: Spark SQL Cell

%%sql
SELECT TOP 10 * FROM mytable

✅ Example: Markdown Cell

%%md
### This is a Markdown cell
Use this for annotations and titles

⚙️ How to Use Magic Commands

  • Click + Code in Synapse Notebook
  • Begin the cell with the desired magic command (e.g., %%pyspark)
  • Write your code below and run the cell

💡 Tips

  • Use %%sql for querying Delta/Parquet data from Spark tables
  • Use %%md to format notebook sections and add comments
  • Stick to one language per cell; switching context within a cell is not supported

📈 Best Practices

  • Label each section of your notebook with %%md to keep it organized
  • Run small chunks of code in separate cells for easier debugging
  • Use PySpark for transformation logic and SQL for querying

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

Mastering PySpark Map Functions: create_map(), map_keys(), map_concat(),map_values | PySpark Tutorial

Mastering PySpark Map Functions | create_map(), map_keys(), map_concat(), map_values()

Mastering PySpark Map Functions

In this tutorial, you'll learn how to use key PySpark map functions including create_map(), map_keys(), map_values(), map_concat(), and more with practical examples and real outputs.

🔧 Step 1: Initialize Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MapFunctions").getOrCreate()

📦 Step 2: Create DataFrame with map column

from pyspark.sql.functions import create_map, lit

data = [
    ("Aamir", "USA", "English"),
    ("Sara", "Canada", "French"),
    ("John", "UK", "English"),
    ("Lina", "Mexico", "Spanish")
]

df = spark.createDataFrame(data, ["name", "country", "language"])
df.show()
Output:
+------+--------+--------+
| name | country|language|
+------+--------+--------+
|Aamir | USA    |English |
|Sara  | Canada |French  |
|John  | UK     |English |
|Lina  | Mexico |Spanish |
+------+--------+--------+

🧱 create_map()

Creates a new map from key-value pairs.

df_map = df.select("name", create_map(
    lit("country"), lit("USA"),
    lit("language"), lit("English")
).alias("new_map"))
df_map.show(truncate=False)
Output:
+------+--------------------------+
| name | new_map                  |
+------+--------------------------+
| Aamir| {country -> USA, language -> English} |
| Sara | {country -> USA, language -> English} |
| John | {country -> USA, language -> English} |
| Lina | {country -> USA, language -> English} |
+------+--------------------------+

🗝 map_keys()

Returns an array of all keys from a map.

from pyspark.sql.functions import map_keys

df_keys = df_map.select("name", map_keys("new_map").alias("keys"))
df_keys.show(truncate=False)

📥 map_values()

Returns an array of all values from a map.

from pyspark.sql.functions import map_values

df_values = df_map.select("name", map_values("new_map").alias("values"))
df_values.show(truncate=False)

🔁 map_concat()

Concatenates two or more maps into one.

from pyspark.sql.functions import map_concat

df_concat = df.select("name", map_concat(
    create_map(lit("status"), lit("active")),
    create_map(lit("region"), lit("east"))
).alias("concatenated_map"))
df_concat.show(truncate=False)

🔍 map_contains_key()

Checks whether a key exists in a map.

from pyspark.sql.functions import map_contains_key

df_contains_key = df_map.select("name", map_contains_key("new_map", lit("country")).alias("has_country"))
df_contains_key.show(truncate=False)

📺 Watch the Full Video Tutorial

© 2025 Aamir Shahzad | PySpark Tutorials

Some of the contents in this website were created with assistance from ChatGPT and Gemini

Load Data to Warehouse Table from ADLS Gen2 Using Dataflow Gen2 | Microsoft Fabric Tutorial

Load Data to Warehouse Table from ADLS Gen2 Using Dataflow Gen2 | Microsoft Fabric Tutorial

Load Data to Warehouse Table from ADLS Gen2 Using Dataflow Gen2

In this Microsoft Fabric tutorial, you'll learn how to create a Dataflow Gen2 to load data from Azure Data Lake Storage Gen2 (ADLS Gen2) directly into a Warehouse table. Dataflows offer a graphical interface for no-code or low-code ETL and are ideal for self-service data preparation and reuse across the Fabric ecosystem.

✅ Setting up Dataflow Gen2 in Microsoft Fabric

To begin, go to your Microsoft Fabric workspace and click “New > Dataflow Gen2”. Choose Blank Dataflow or start from a template. You’ll enter the Power Query web-based editor, which allows you to pull, transform, and publish data.

Dataflow Gen2 supports various sources and destinations and is optimized for working with Fabric Lakehouses, Warehouses, and cloud file sources.

✅ Connecting to ADLS Gen2 as a Source

To load data from ADLS Gen2:

  • Click “+ Add New Source” and select Azure Data Lake Storage Gen2.
  • Provide the account URL and navigate to the desired folder or file.
  • Select the format (CSV, Parquet, etc.), and preview the dataset.
Once the data is previewed, click “Transform Data” to begin preparing it for loading.

✅ Transforming and Mapping Data for Warehouse Tables

Inside the Power Query editor:

  • Rename columns to match your Warehouse schema.
  • Change data types (e.g., convert string to datetime, float, boolean, etc.).
  • Apply filters, remove nulls, or enrich with calculated columns.
After transformation, go to “Destination” and choose a Fabric Warehouse table. If the table does not exist, you can optionally create it from your transformed schema.

✅ Publishing and Running the Dataflow

Once the transformations and mappings are complete:

  1. Click “Publish” to save the dataflow to your workspace.
  2. Run the dataflow manually or schedule it to refresh at a specific interval.
  3. Monitor load progress and view row counts and error logs from the Dataflow Monitoring tab.
The loaded data is now available for querying via T-SQL or Power BI directly from your Fabric Warehouse.

✅ Tips for Efficient and Scalable Ingestion

  • Use incremental refresh: Configure filters for date/time columns to process only new or updated rows.
  • Partition large files: Break input files into manageable chunks to optimize performance.
  • Reuse dataflows: Promote common logic to shared dataflows across multiple teams or projects.
  • Profile data: Use the “Column profile” tool in Power Query to identify anomalies before loading.
  • Test transformations: Always validate small samples before running a full ingest on large datasets.

🎬 Watch the Full Tutorial

Blog post written with the help of ChatGPT.

How to Create Your First Azure Synapse Notebook | Step-by-Step Tutorial for Synapse Analytics

How to Create Your First Azure Synapse Notebook

How to Create Your First Azure Synapse Notebook

📘 What is a Synapse Notebook?

An Azure Synapse Notebook is an interactive development environment inside Synapse Studio where you can run data exploration, transformation, and analytics workloads using Spark. You can write code in PySpark, SQL, Scala, or .NET.

🧰 Use Cases for Synapse Notebooks

  • Data exploration on Data Lake or SQL Pools
  • Machine learning experiments
  • ETL and data processing using Spark
  • Running parameterized reports and scripts

🛠️ Steps to Create Your First Notebook

✅ Step 1: Launch Synapse Studio

Go to your Synapse workspace in the Azure portal and click Open Synapse Studio.

✅ Step 2: Create a New Notebook

  • Navigate to the Develop tab.
  • Click on the + icon → Notebook.

✅ Step 3: Add and Configure Code Cells

  • Select the language for each cell (e.g., %%pyspark, %%sql).
  • Write your logic in the cell and run it using the play ▶️ button.

✅ Step 4: Attach Spark Pool

Choose a configured Apache Spark Pool at the top to run the notebook.

✅ Step 5: Save and Organize

Click Save As to name and store your notebook inside a Synapse folder.

💡 Tips

  • You can switch languages in each cell using %%sql, %%pyspark, etc.
  • Use markdown cells for annotations: click “+ Text” to add headings, notes, etc.
  • Notebooks can be integrated into Synapse Pipelines for orchestration.

🔍 Example PySpark Cell

%%pyspark
df = spark.read.csv('abfss://data@storageaccount.dfs.core.windows.net/sample.csv', header=True)
df.show()

📈 Example SQL Cell

%%sql
SELECT TOP 10 * FROM mytable

📦 Save & Share

You can export notebooks as .ipynb or .json files and share them across teams. Use Git integration in Synapse Studio for version control.

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

Advanced PySpark Array Functions Tutorial | slice(), concat(), element_at(), sequence() Explained

Advanced PySpark Array Functions | slice(), concat(), element_at(), sequence()

🔍 Advanced Array Manipulations in PySpark

This tutorial explores advanced array functions in PySpark including slice(), concat(), element_at(), and sequence() with real-world DataFrame examples.

📦 Sample DataFrame

data = [
    ("Aamir", [1, 2, 3, 4, 5, 6]),
    ("Sara", [7, 8, 9, 10, 11]),
    ("John", [12, 13, 14, 15])
]
df = spark.createDataFrame(data, ["name", "arr"])
df.show()

✅ Output

+------+------------------+
| name | arr              |
+------+------------------+
| Aamir| [1, 2, 3, 4, 5, 6]|
| Sara | [7, 8, 9, 10, 11]|
| John | [12, 13, 14, 15] |
+------+------------------+

1️⃣ slice()

Definition: Returns a subset of the array starting at a specified index for a given length.

from pyspark.sql.functions import slice
df.select("name", "arr", slice("arr", 2, 3).alias("sliced_array")).show()

✅ Output

+------+------------------+-------------+
| name | arr              | sliced_array|
+------+------------------+-------------+
| Aamir| [1, 2, 3, 4, 5, 6]| [2, 3, 4]   |
| Sara | [7, 8, 9, 10, 11]| [8, 9, 10]  |
| John | [12, 13, 14, 15] | [13, 14, 15]|
+------+------------------+-------------+

2️⃣ concat()

Definition: Combines multiple arrays into one array.

from pyspark.sql.functions import concat, array, lit
df.select("name", "arr", concat("arr", array(lit(100), lit(200))).alias("concatenated")).show()

✅ Output

+------+------------------+------------------------+
| name | arr              | concatenated           |
+------+------------------+------------------------+
| Aamir| [1, 2, 3, 4, 5, 6]| [1, 2, 3, 4, 5, 6, 100, 200]|
| Sara | [7, 8, 9, 10, 11]| [7, 8, 9, 10, 11, 100, 200]|
| John | [12, 13, 14, 15] | [12, 13, 14, 15, 100, 200]|
+------+------------------+------------------------+

3️⃣ element_at()

Definition: Returns the element at the specified index in the array. Index starts at 1.

from pyspark.sql.functions import element_at
df.select("name", "arr", element_at("arr", 4).alias("element_at_4")).show()

✅ Output

+------+------------------+-------------+
| name | arr              | element_at_4|
+------+------------------+-------------+
| Aamir| [1, 2, 3, 4, 5, 6]| 4          |
| Sara | [7, 8, 9, 10, 11]| 10         |
| John | [12, 13, 14, 15] | 15         |
+------+------------------+-------------+

4️⃣ sequence()

Definition: Generates an array of numbers from start to end.

from pyspark.sql.functions import sequence
df.select("name", "arr", sequence(lit(1), lit(5)).alias("seq_1_to_5")).show()

✅ Output

+------+------------------+-------------+
| name | arr              | seq_1_to_5  |
+------+------------------+-------------+
| Aamir| [1, 2, 3, 4, 5, 6]| [1, 2, 3, 4, 5] |
| Sara | [7, 8, 9, 10, 11]| [1, 2, 3, 4, 5] |
| John | [12, 13, 14, 15] | [1, 2, 3, 4, 5] |
+------+------------------+-------------+

📺 Watch the Full Tutorial

Load Data to Warehouse Table from ADLS Gen2 Using Pipeline | Microsoft Fabric Tutorial

Load Data to Warehouse Table from ADLS Gen2 Using Pipeline | Microsoft Fabric Tutorial

Load Data to Warehouse Table from ADLS Gen2 Using Pipeline

In this step-by-step Microsoft Fabric tutorial, you'll learn how to build a pipeline that connects to Azure Data Lake Storage Gen2, retrieves CSV/Parquet files, maps the data, and loads it into a Fabric Warehouse table. Pipelines in Microsoft Fabric offer a low-code, efficient approach to manage data flows across cloud environments.

✅ How to Configure a Pipeline in Microsoft Fabric

Begin by navigating to your Microsoft Fabric workspace and selecting “New > Data pipeline”. Give your pipeline a meaningful name. You’ll see a blank canvas where you can add different activities like source, transformation, and sink (destination).

Pipelines in Fabric resemble Azure Data Factory and provide native support for integrating data from a wide variety of sources including ADLS Gen2, SQL, Lakehouse, REST APIs, and more.

✅ How to Connect to ADLS Gen2 and Select Source Files

Drag the Copy Data activity onto the canvas. In the source tab:

  • Click “+ New” to create a new connection to your ADLS Gen2 account.
  • Provide the storage account URL or browse the linked services.
  • Navigate to the desired container and folder where your files are stored (e.g. /input/customer.csv).
  • Choose file format (CSV, Parquet, etc.), and configure schema detection options.

✅ How to Map and Load Data into Warehouse Tables

On the Sink tab of the Copy Data activity:

  • Select your destination as a Microsoft Fabric Warehouse.
  • Pick the appropriate Warehouse and table name (e.g., dbo.Customer).
  • Enable schema mapping. Fabric attempts auto-mapping, but you can also manually map source columns to destination fields.
  • Choose write behavior – e.g., Insert, Upsert, or Truncate + Load.

✅ End-to-End Data Flow Setup and Execution

Once both source and sink are configured:

  1. Validate the pipeline to catch schema or connection errors.
  2. Click “Publish All” to save your work.
  3. Trigger the pipeline manually or schedule it via the trigger tab.
The data will flow from ADLS Gen2 into your Fabric Warehouse, and you can verify it by querying the target table.

✅ Best Practices for Pipeline-Based Data Ingestion

  • Use parameterized pipelines to make reusable components for different file sources or tables.
  • Monitor execution logs to diagnose failures or slow performance.
  • Partition large datasets when reading from lake to avoid memory pressure during ingestion.
  • Schedule during off-peak hours to maximize performance and reduce contention.
  • Set up retry policies for fault tolerance in case of transient connectivity issues.

🎬 Watch the Full Tutorial

Blog post written with the help of ChatGPT.

How to Create and Configure Spark Pools in Azure Synapse Analytics | Azure Synapse Analytics Tutorial

How to Create and Configure Spark Pools in Azure Synapse Analytics

How to Create and Configure Spark Pools in Azure Synapse Analytics

📘 Overview

Apache Spark Pools in Azure Synapse Analytics are powerful, scalable environments used for big data processing, data exploration, and machine learning. Creating and configuring Spark Pools properly is essential for optimal performance and cost management. This guide walks you through the process step-by-step.

🛠️ Step-by-Step: Create a Spark Pool

✅ Step 1: Go to Synapse Workspace

  • Open the Azure Portal.
  • Search for and open your Azure Synapse Analytics workspace.

✅ Step 2: Create a New Apache Spark Pool

  • In the Synapse workspace pane, click on Apache Spark pools.
  • Click + New to create a new Spark pool.

✅ Step 3: Configure the Basics

  • Name: Enter a unique Spark pool name (e.g., sparkpool-dev).
  • Node size: Choose based on workload (e.g., Small, Medium, Large).
  • Node count: Define the minimum and maximum number of nodes.
  • Auto scale: Enable for dynamic scaling during load variations.
  • Auto pause: Set idle time after which Spark shuts down to save costs.

✅ Step 4: Review and Create

  • Click Review + create.
  • Verify configuration and click Create.

⚙️ Optional Settings

  • Dynamic Allocation: Lets Spark grow or shrink executors at runtime.
  • Tags: Add tags for billing and organizational metadata.
  • Library Management: Add custom libraries later from Synapse Studio.

📌 Tips for Configuration

  • Start with smaller nodes and auto-scale — monitor usage, then scale up if needed.
  • Use auto-pause to shut down Spark sessions during inactivity.
  • Keep TTL (Time-to-Live) low in dev/test environments to reduce cost.
  • Ensure the Spark pool is attached to the right Linked Services and Storage Accounts.

💡 When to Use Spark Pools

  • Running PySpark or Scala notebooks
  • Building ETL/ELT data pipelines
  • Training and scoring machine learning models
  • Querying semi-structured data like JSON, Parquet, or CSV in Data Lake

📈 Monitoring Spark Pool Usage

  • Use Synapse Studio's Monitor Hub to view active sessions and Spark jobs.
  • Integrate with Azure Log Analytics for deeper performance insights.

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

PySpark Set-Like Array Functions: arrays_overlap(), array_union(), flatten(), array_distinct() Explained | PySpark Tutorial

106- PySpark Set-Like Array Functions | arrays_overlap(), array_union(), flatten(), array_distinct() Explained

106- PySpark Set-Like Array Functions | arrays_overlap(), array_union(), flatten(), array_distinct() Explained

In this tutorial, we will explore some useful PySpark array functions that deal with set-like operations. Functions like arrays_overlap(), array_union(), flatten(), and array_distinct() are essential for transforming and manipulating array data in a way that resembles set operations.

Introduction to Set-Like Array Functions

PySpark provides powerful array functions that allow us to perform set-like operations such as finding intersections between arrays, flattening nested arrays, and removing duplicates from arrays. These operations are similar to those used in set theory, but they operate on arrays in PySpark DataFrames.

1. arrays_overlap()

Definition: The arrays_overlap() function returns true if two arrays share at least one common element, otherwise false.


from pyspark.sql import SparkSession
from pyspark.sql.functions import arrays_overlap

# Sample data
data = [([1, 2, 3], [3, 4, 5]), ([6, 7, 8], [1, 2, 3])]

# Create Spark session
spark = SparkSession.builder.appName("SetLikeFunctions").getOrCreate()

# Create DataFrame
df = spark.createDataFrame(data, ["array1", "array2"])

# Apply arrays_overlap function
df.select(arrays_overlap("array1", "array2").alias("overlap")).show()
        

Output:


+------+
|overlap|
+------+
|  true|
| false|
+------+
        

2. array_union()

Definition: The array_union() function returns a new array that contains the union of the two input arrays, removing any duplicates.


from pyspark.sql.functions import array_union

# Sample data
data = [([1, 2, 3], [3, 4, 5]), ([6, 7], [7, 8, 9])]

# Create DataFrame
df = spark.createDataFrame(data, ["array1", "array2"])

# Apply array_union function
df.select(array_union("array1", "array2").alias("union")).show()
        

Output:


+-----------+
|      union|
+-----------+
|[1, 2, 3, 4, 5]|
|[6, 7, 8, 9]|
+-----------+
        

3. flatten()

Definition: The flatten() function takes an array of arrays and returns a single array with all the elements from the inner arrays.


from pyspark.sql.functions import flatten

# Sample data
data = [([1, 2], [3, 4]), ([5, 6], [7, 8])]

# Create DataFrame
df = spark.createDataFrame(data, ["array1", "array2"])

# Apply flatten function
df.select(flatten("array1").alias("flattened_array")).show()
        

Output:


+--------------+
|flattened_array|
+--------------+
|    [1, 2]|
|    [5, 6]|
+--------------+
        

4. array_distinct()

Definition: The array_distinct() function returns an array with all duplicate elements removed.


from pyspark.sql.functions import array_distinct

# Sample data
data = [([1, 2, 2, 3], [3, 4, 4, 5]), ([6, 7, 7, 8], [7, 8, 9, 9])]

# Create DataFrame
df = spark.createDataFrame(data, ["array1", "array2"])

# Apply array_distinct function
df.select(array_distinct("array1").alias("distinct_array")).show()
        

Output:


+--------------+
|distinct_array|
+--------------+
|    [1, 2, 3]|
|    [6, 7, 8]|
+--------------+
        

Watch the Tutorial

In this tutorial, we explored set-like operations on arrays using PySpark's built-in functions like arrays_overlap(), array_union(), flatten(), and array_distinct(). These functions are highly useful for data manipulation and transformation in PySpark DataFrames.

What Are Apache Spark Pools in Azure Synapse Analytics | Overview & Use Cases | #azuresynapse | Azure Synapse Analytics Tutorial

What Are Apache Spark Pools in Azure Synapse Analytics | Overview & Use Cases

What Are Apache Spark Pools in Azure Synapse Analytics | Overview & Use Cases

📘 Definition

Apache Spark Pools in Azure Synapse are distributed computing environments used to run big data processing and analytics workloads using Apache Spark — an open-source analytics engine for large-scale data processing. These pools are pre-integrated, managed by Azure, and offer a serverless-like experience.

💡 Key Concepts

  • Managed Spark Environment: No need to install or maintain infrastructure.
  • Language Support: Supports PySpark, Scala, .NET (C#), and Spark SQL.
  • Autoscaling: Automatically scales based on the workload.
  • TTL (Time-to-Live): Automatically shuts down idle sessions to reduce cost.
  • Notebook Integration: Works seamlessly within Synapse Studio.
  • Multi-User Support: Multiple users can run notebooks on the same pool concurrently.

💮 Why Use Spark Pools in Synapse?

  • Unified Analytics Platform: Combines SQL and Spark in one workspace.
  • Big Data Ready: Perfect for analyzing large datasets from Azure Data Lake Storage.
  • ETL and ML Workflows: Ideal for transformations, feature engineering, and training ML models.
  • Team Collaboration: Share and schedule notebooks directly in Synapse Studio.

🔹 Architecture Highlights

  • Apache Spark runtime executes notebooks and batch jobs.
  • Hosted within the Synapse Workspace.
  • Supports access to Azure Data Lake, Blob Storage, Azure SQL, and Synapse SQL Pools.
  • Can be used within Synapse Pipelines or standalone notebooks.

🔧 Common Use Cases

  • Interactive data exploration and transformation
  • ETL/ELT pipelines using PySpark or Scala
  • Data science workflows and ML model training
  • Querying massive data lakes
  • Data standardization and cleansing

⚖️ Spark Pool vs Dedicated SQL Pool

Feature Spark Pool Dedicated SQL Pool
Engine Apache Spark MPP SQL Engine (T-SQL)
Language PySpark, Scala, .NET, SQL T-SQL
Best For Big data, unstructured/semi-structured data Structured data warehousing
Execution In-memory parallel processing Disk-based distributed SQL
Use Case ML, ETL, streaming Reporting, OLAP, BI

🔹 Summary

Apache Spark Pools in Azure Synapse provide a powerful, scalable, and collaborative solution for data engineers, data scientists, and analysts. With features like notebook integration, autoscaling, and full support for modern data science languages, Spark Pools make it easy to run big data and machine learning workloads in the cloud — without worrying about infrastructure.

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

PySpark Array Functions : array(), array_contains(), sort_array(), array_size() Explained with Examples | PySpark Tutorial

PySpark Array Functions | array(), array_contains(), sort_array(), array_size() Explained with Examples

PySpark Array Functions | array(), array_contains(), sort_array(), array_size() Explained with Examples

Introduction to PySpark Array Functions

In this tutorial, we will explore various PySpark Array functions that help with manipulating arrays. We will cover:

  • array(): Creates an array from columns in a DataFrame.
  • array_contains(): Checks if an array contains a specific element.
  • sort_array(): Sorts an array in ascending or descending order.
  • array_size(): Returns the size of the array.

Example 1: Creating Arrays with array()

Definition: The array() function is used to create an array from columns in a DataFrame.


from pyspark.sql import SparkSession
from pyspark.sql.functions import array

# Initialize Spark session
spark = SparkSession.builder.appName("PySpark Array Functions").getOrCreate()

# Sample data
data = [("Aamir", 25), ("Sara", 30), ("John", 22)]
df = spark.createDataFrame(data, ["name", "age"])

# Create an array from columns
df_with_array = df.select("name", array("name", "age").alias("name_age_array"))
df_with_array.show()
        

Output:


+-----+-------------+
| name| name_age_array|
+-----+-------------+
| Aamir| [Aamir, 25] |
| Sara| [Sara, 30] |
| John| [John, 22] |
+-----+-------------+
        

Example 2: Using array_contains() to Check for an Element

Definition: The array_contains() function checks if an array contains a specific element.


from pyspark.sql.functions import array_contains

# Check if the array contains the value 30
df_with_check = df_with_array.withColumn("contains_30", array_contains("name_age_array", 30))
df_with_check.show()
        

Output:


+-----+-------------+------------+
| name| name_age_array| contains_30|
+-----+-------------+------------+
| Aamir| [Aamir, 25] | false|
| Sara| [Sara, 30] | true |
| John| [John, 22] | false|
+-----+-------------+------------+
        

Example 3: Sorting an Array with sort_array()

Definition: The sort_array() function sorts an array in ascending or descending order.


from pyspark.sql.functions import sort_array

# Create an array and sort it
df_sorted = df.withColumn("sorted_ages", sort_array(array("age")).alias("sorted_ages"))
df_sorted.show()
        

Output:


+-----+---+-----------+
| name|age| sorted_ages|
+-----+---+-----------+
| Aamir| 25| [25] |
| Sara| 30| [30] |
| John| 22| [22] |
+-----+---+-----------+
        

Example 4: Getting Array Size with array_size()

Definition: The array_size() function returns the size of the array.


from pyspark.sql.functions import array_size

# Get the size of the array
df_with_size = df_with_array.withColumn("array_size", array_size("name_age_array"))
df_with_size.show()
        

Output:


+-----+-------------+-----------+
| name| name_age_array| array_size|
+-----+-------------+-----------+
| Aamir| [Aamir, 25] | 2 |
| Sara| [Sara, 30] | 2 |
| John| [John, 22] | 2 |
+-----+-------------+-----------+
        

Watch the Full Tutorial

Backup and Restore for Dedicated SQL Pools in Azure Synapse Analytics | Azure Synapse Tutorial

Backup and Restore for Dedicated SQL Pools in Azure Synapse Analytics

Backup and Restore for Dedicated SQL Pools in Azure Synapse Analytics

📘 Introduction

In Azure Synapse Analytics, Dedicated SQL Pools are automatically backed up by the platform, enabling you to restore to a point-in-time or clone for dev/test purposes. Understanding how to manage backups and perform restores is essential for business continuity and disaster recovery planning.

🔐 Backup Options in Synapse

  • Automatic Backups: Handled by Azure. Includes full and incremental backups.
  • User-Initiated Snapshots: Create on-demand restore points (preview feature).
  • Geo-Redundant Backups: Synapse supports geo-backups across paired regions.

♻️ Restore Options

  • Point-in-Time Restore: Restore a dedicated SQL pool to a past timestamp (within 7 days).
  • Restore as New Pool: You cannot overwrite an existing pool. Instead, restore into a new name.

🛠️ Restore Example Using Portal

  1. Navigate to your Synapse Workspace → Dedicated SQL Pools.
  2. Select the pool you want to restore.
  3. Click on Restore.
  4. Select Restore Point (either point-in-time or snapshot).
  5. Provide a new pool name and click Restore.

💡 Restore Use Cases

  • Recover from accidental deletes or updates
  • Clone production data for testing
  • Validate changes before promoting them to prod

✅ Important Notes

  • Backups are retained for 7 days by default.
  • You cannot restore over the existing pool — always a new pool name is required.
  • No need to configure backup — it's automatic and included in Synapse pricing.

📦 Alternative Options

If you need longer-term retention:

  • Export table data to Parquet/CSV in Data Lake
  • Use Azure Data Factory or Synapse Pipelines for automation

📈 Monitoring Restore Progress

You can track the restore operation in the Azure Portal under Notifications or by checking the status of the new dedicated SQL pool.

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

PySpark Date Truncation Tutorial : trunc(), date_trunc(), last_day() with Real Examples | PySpark Tutorial

104- PySpark Date Truncation Tutorial | trunc(), date_trunc(), last_day() with Real Examples

104- PySpark Date Truncation Tutorial | trunc(), date_trunc(), last_day() with Real Examples

Introduction

In this tutorial, we'll dive into PySpark's date truncation functions: trunc(), date_trunc(), and last_day(). These functions are extremely helpful for manipulating date and time data efficiently. We will walk through real-world examples to demonstrate how to use them effectively for various date manipulations.

1. trunc() Function

Definition: The trunc() function truncates a date or timestamp to a specific unit, such as year, month, day, etc.

Example: Truncate the date to the beginning of the year:


from pyspark.sql.functions import trunc
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("DateTruncation").getOrCreate()

# Sample DataFrame
data = [("2022-07-14",), ("2021-11-09",)]
df = spark.createDataFrame(data, ["date"])

# Truncate the date to the start of the year
df.select(trunc("date", "YYYY").alias("year_start")).show(truncate=False)
        

Output:


+----------+
|year_start|
+----------+
|2022-01-01|
|2021-01-01|
+----------+
        

2. date_trunc() Function

Definition: The date_trunc() function truncates a date or timestamp to the specified unit, such as "hour", "minute", or "day". This function is more flexible than trunc().

Example: Truncate the date to the start of the month:


from pyspark.sql.functions import date_trunc

# Truncate the date to the start of the month
df.select(date_trunc("month", "date").alias("month_start")).show(truncate=False)
        

Output:


+-----------+
|month_start|
+-----------+
|2022-07-01 |
|2021-11-01 |
+-----------+
        

3. last_day() Function

Definition: The last_day() function returns the last day of the month for the given date or timestamp.

Example: Get the last day of the month:


from pyspark.sql.functions import last_day

# Get the last day of the month
df.select(last_day("date").alias("last_day_of_month")).show(truncate=False)
        

Output:


+----------------+
|last_day_of_month|
+----------------+
|2022-07-31      |
|2021-11-30      |
+----------------+
        

Watch the Full Tutorial

For more PySpark tutorials, don't forget to subscribe to my channel and check out other tutorials.

Monitor Long Running & Frequent Queries in Lakehouse Tables | Microsoft Fabric Tutorial for Beginners

Monitor Long Running & Frequent Queries in Lakehouse Tables | Microsoft Fabric Tutorial

Monitor Long Running & Frequent Queries in Lakehouse Tables

Microsoft Fabric Tutorial

📘 Overview

Performance tuning is a key part of maintaining a healthy analytics platform. In this tutorial, you'll learn how to monitor long-running and frequent queries against Lakehouse tables in Microsoft Fabric using built-in monitoring tools, helping you identify bottlenecks and optimize workloads effectively.

✅ Topics Covered

  • How to access monitoring tools in Microsoft Fabric
  • How to view historical query performance
  • Identify bottlenecks and optimize slow queries
  • Monitor frequent queries for performance tuning
  • Use metrics for better resource management and debugging

🔍 Accessing Monitoring Tools

  1. Navigate to your Lakehouse in the Fabric workspace.
  2. Click on the SQL Analytics Endpoint tab.
  3. Open the Monitor section from the left panel.
  4. Choose Query History or Performance dashboards.

📊 Analyze Long-Running Queries

  • Sort the query history by Duration to identify slow queries.
  • Click on each query to view the execution plan and performance details.
  • Look for high-cost operations like joins on unindexed columns or large shuffles.

🔁 Identify and Tune Frequent Queries

  • Filter or group queries by Text Hash to identify frequently executed ones.
  • Review their execution stats, I/O, and compute time.
  • Optimize such queries by adding caching, reducing complexity, or restructuring joins.

📈 Metrics & Best Practices

  • Use duration, rows read, and memory used to assess query impact.
  • Capture performance baselines over time to detect anomalies.
  • Schedule heavy queries during off-peak hours when possible.
  • Document repetitive queries and use views to centralize logic.

🎥 Watch the Full Tutorial

Blog created with help from ChatGPT and Gemini.

Microsoft Fabric Warehouse Tutorial for Beginners Lakehouse vs Warehouse | Microsoft Fabric Tutorial

Microsoft Fabric Warehouse Tutorial for Beginners – Lakehouse vs Warehouse

Microsoft Fabric Warehouse Tutorial for Beginners:
Lakehouse vs Warehouse

What is a Microsoft Fabric Warehouse?

A Microsoft Fabric Warehouse is a dedicated SQL-based storage and compute layer in Microsoft Fabric that delivers data-warehousing performance on an elastic, serverless foundation. Built on the same open Delta Lake format that powers Fabric Lakehouse, it provides full T-SQL support, sophisticated cost-based query optimization, and seamless integration with Power BI, Data Factory pipelines, and Fabric’s real-time experiences.

Think of it as your “single source of truth” for structured, curated data that requires fast analytic queries, governed schemas, and enterprise security—all without the traditional admin overhead of provisioning or scaling physical hardware.

Key Features & Use-Cases of a Fabric Warehouse

  • Instant Elasticity – compute spins up on demand and pauses automatically when idle.
  • Open Delta Storage – lake-native tables stored in your OneLake, enabling ACID transactions plus time-travel.
  • End-to-End T-SQL – use the language you know for DDL, DML, and advanced analytics.
  • Built-in Integration – drag-and-drop pipelines, notebooks, and Power BI all in one workspace.
  • Fine-Grained Security – row-level & column-level security, AAD-based access, and Purview tag lineage.

Typical scenarios include: enterprise reporting marts, finance conformed dimensions, customer 360 views, or any workload where sub-second SQL latency is business-critical.

Lakehouse vs Warehouse – Where Do They Differ?

Dimension Lakehouse Warehouse
Interface & Language Spark, PySpark, Scala, SQL on Spark Fully ANSI-T-SQL; no Spark surface required
Typical Data Semi-structured & big data Highly structured, curated facts & dims
Performance Goal High-throughput, ELT heavy lifting Low-latency BI & ad-hoc SQL analytics
Compute Model Spark clusters (per job or session) Instant-on serverless SQL engines
Best When… You need ML, streaming, or huge parquet ingest You need governed, predictable BI query speed

When Should You Use Each Model?

Use a Lakehouse when your data teams are heavily Spark-oriented, working with large semi-structured sources, or preparing data for machine learning. It shines for data engineering pipelines, streaming ingestion, and advanced data science.

Use a Warehouse when business users demand consistently fast SQL reports, dimensional models, and fine-grained RBAC. Warehouses are optimized for dashboard refreshes, financial consolidations, and ad-hoc slicing by analysts who live in Power BI or Excel.

In practice, many Fabric solutions combine both—a Lakehouse for raw/bronze & silver data processing and a Warehouse for the gold, consumption-ready layer.

Step-by-Step Walkthrough for Absolute Beginners

  1. Create a Workspace: In the Fabric portal, hit New ► Workspace and assign a Fabric capacity.
  2. Add a Warehouse: Click New ► Warehouse, give it a name, and hit Create.
  3. Load Data: Open Get Data, choose Azure Data Lake / OneLake, or simply upload a CSV file.
  4. Build Tables: Use the CREATE TABLE … AS SELECT pattern or define schemas via the designer, then load your data.
  5. Query with T-SQL: Run SELECT TOP 100 * to validate and explore.
  6. Secure & Share: Set row-level security, assign workspace roles, and publish a Power BI report directly from the Warehouse ribbon.
  7. Monitor Performance: Check the Warehouse Monitoring pane for query metrics and auto-pause settings.
Blog post drafted with help from ChatGPT.

Table Partitioning in Azure Synapse Dedicated SQL Pool | Concepts & Demo - Azure Synapse Tutorial

Table Partitioning in Azure Synapse Dedicated SQL Pool | Concepts & Demo

Table Partitioning in Azure Synapse Dedicated SQL Pool | Concepts & Demo

📘 What is Table Partitioning?

Table partitioning in Azure Synapse Dedicated SQL Pool is a powerful technique used to divide a large table into smaller, more manageable parts (called partitions) based on a partition key — usually a date column. This helps optimize performance and simplify maintenance tasks.

🔍 Why Use Table Partitioning?

  • Performance Boost (Partition Elimination): Only relevant partitions are scanned during queries, reducing IO and speeding up results.
  • Faster Data Management: Load or remove data by entire partition instead of row-by-row.
  • Maintenance Efficiency: Index rebuilds can be scoped to specific partitions.

📐 RANGE RIGHT vs. RANGE LEFT

Partitioning uses boundary values. With RANGE RIGHT (used in this demo), the boundary value belongs to the higher partition.

PARTITION BY RANGE RIGHT FOR VALUES (20200101, 20210101, 20220101)
PartitionData Range
1OrderDateKey ≤ 20200101
220200101 < OrderDateKey ≤ 20210101
320210101 < OrderDateKey ≤ 20220101
4OrderDateKey > 20220101

🛠️ Demo Script Highlights

✅ 1. Create Partitioned Table

CREATE TABLE dbo.SalesFact_Partitioned
(
    SaleID INT,
    ProductID INT,
    Quantity INT,
    SaleDate DATE,
    OrderDateKey INT,
    CustomerID INT
)
WITH
(
    DISTRIBUTION = HASH(CustomerID),
    CLUSTERED COLUMNSTORE INDEX,
    PARTITION (OrderDateKey RANGE RIGHT FOR VALUES (
        20200101, 20210101, 20220101, 20230101, 20240101
    ))
);

✅ 2. Insert Sample Data

INSERT INTO dbo.SalesFact_Partitioned (...) VALUES (..., 20200101, ...); -- Goes to Partition 1
INSERT INTO dbo.SalesFact_Partitioned (...) VALUES (..., 20200715, ...); -- Partition 2
-- Repeat for partitions 3–5

✅ 3. Create Identical Staging Table

CREATE TABLE dbo.SalesFact_Stage (...) 
WITH (
    DISTRIBUTION = HASH(CustomerID),
    CLUSTERED COLUMNSTORE INDEX,
    PARTITION (OrderDateKey RANGE RIGHT FOR VALUES (
        20200101, 20210101, 20220101, 20230101, 20240101
    ))
);

✅ 4. Switch Partition from Stage to Main Table

This is an instant metadata operation (very fast):

ALTER TABLE dbo.SalesFact_Stage
SWITCH PARTITION 2 TO dbo.SalesFact_Partitioned PARTITION 2 WITH (TRUNCATE_TARGET = ON);

📊 Check Row Count Per Partition

SELECT row_count, partition_nmbr 
FROM dbo.vTableSizes 
WHERE table_name = 'SalesFact_Partitioned' AND row_count > 0;

📌 Notes & Best Practices

  • Staging and target tables must match exactly (schema, index, distribution, partition)
  • Use partition switching for bulk loading and archiving scenarios
  • Filter by partitioning column to enable partition elimination in queries

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.