T-SQL Notebooks in Fabric – Run Queries, Visualize Data & Save Views | Microsoft Fabric Tutorial

T-SQL Notebooks in Fabric – Run Queries, Visualize Data & Save Views

Microsoft Fabric now supports T-SQL Notebooks — a powerful, interactive workspace for data engineers, SQL developers, and analysts to write, execute, and visualize T-SQL code with documentation and rich outputs.

📌 Key Features of T-SQL Notebooks

Write and run T-SQL code directly in notebook cells
Connect to Warehouses or SQL analytics endpoints
Visualize results using built-in charts and column profiling
Save SQL queries as Views
Supports cross-warehouse querying with 3-part naming
Schedule notebooks and use them in pipelines

🧱 Step 1: Add Warehouse as Data Source

Click + Data sources in the notebook toolbar
Select your desired Warehouse or SQL analytics endpoint
Set one as the Primary source

🧪 Example Queries

-- Query top records from a table
SELECT TOP 5 * FROM TechDWH.dbo.Customers;

💡 Code Templates

Right-click a table in Object Explorer to generate SQL templates:

SELECT Top 100
Create Table
Drop Table
CTAS (Create Table As Select)

💾 Save Query as View

Write your query and highlight it, then click “Save as View” in the toolbar:

SELECT * FROM TechDWH.dbo.Orders
WHERE OrderAmount > 1000;

🔗 Cross-Warehouse Queries

SELECT * FROM SalesDW.dbo.SalesCustomer;

Use three-part naming convention: Database.Schema.Table

📊 Visualizing Query Results

After query execution, go to the Inspect tab
Explore data distribution and column stats
Switch to charts (bar, pie, etc.) for quick insights

SELECT TOP 100 CustomerID, FirstName, LastName, Country FROM TechDWH.dbo.Customers;

🎬 Watch the Full Tutorial

How to Read Lake Database Tables & Views in Azure Synapse Using PySpark | Azure Synapse Tutorial

How to Read Lake Database Tables & Views in Azure Synapse Using PySpark

📘 Overview

In Azure Synapse Analytics, you can use PySpark to read tables and views from a Lake Database. This allows you to harness the power of Apache Spark for analytics on curated datasets in your Data Lake.

✅ Why Read Lake Database in PySpark?

Supports reading from Delta and Parquet formats
Leverages Spark for distributed processing
Easy integration with Synapse Notebooks

🛠️ Step-by-Step Guide

Step 1: Read a Table from Lake Database

df = spark.read.table("LakeDBName.TableName")
df.show()

Replace LakeDBName and TableName with your actual Lake Database and table names.

Step 2: Read a View from Lake Database

view_df = spark.read.table("LakeDBName.ViewName")
view_df.display()

Step 3: Register as Temp View for SQL Queries

df.createOrReplaceTempView("temp_table")
spark.sql("SELECT * FROM temp_table WHERE column = 'value'").show()

🔎 Sample Use Case

Read curated sales data from Lake Database
Register as temp view
Join with another dataset using Spark SQL

📌 Notes

Ensure the Spark pool has access to the Lake Database metadata
Use display() in notebooks to visualize the output

📺 Watch the Full Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

PySpark Tutorial: Extract Data with regexp_extract() in PySpark | Regex Patterns Made Easy #pyspark

Extract Substrings in PySpark with regexp_extract()

Extract Substrings with regexp_extract() in PySpark

The regexp_extract() function allows you to use regular expressions to extract substrings from string columns in PySpark. This is extremely useful when working with emails, logs, or structured patterns like phone numbers or dates.

📘 Sample Data

data = [
  ("user1@example.com", "[INFO] Login failed at 2024-04-15 10:23:45", "(123) 456-7890"),
  ("john.doe@mail.org", "[ERROR] Disk full at 2024-04-15 12:00:00", "125-456-7890"),
  ("alice@company.net", "[WARN] High memory at 2024-04-15 14:30:10", "123.456.7890")
]

columns = ["email", "log", "phone"]
df = spark.createDataFrame(data, columns)
df.show()

📧 Extract Domain from Email

from pyspark.sql.functions import regexp_extract, col

df = df.withColumn("domain", regexp_extract(col("email"), "@(.*)", 1))
df.select("email", "domain").show(truncate=False)

Output: Extracts everything after @ in the email.

🔍 Extract Log Level

df = df.withColumn("log_level", regexp_extract(col("log"), "\\[(.*?)\\]", 1))
df.select("log", "log_level").show(truncate=False)

Output: Extracts the text inside square brackets (e.g., INFO, ERROR, WARN).

📅 Extract Date from Log

pattern_date = "(\\d{4}-\\d{2}-\\d{2})"
df = df.withColumn("log_date", regexp_extract(col("log"), pattern_date, 1))
df.select("log", "log_date").show(truncate=False)

Output: Captures date in the format YYYY-MM-DD from the log string.

📞 Extract Area Code from Phone Number

df = df.withColumn("area_code", regexp_extract(col("phone"), "(\\d{3})", 1))
df.select("phone", "area_code").show(truncate=False)

Output: Captures the first 3 digits (area code) from the phone number.

🎥 Watch the Full Tutorial

What Are Views in Lake Databases | Azure Synapse Analytics Tutorial

What Are Views in Lake Databases | Azure Synapse Analytics Explained

📘 Overview

In Azure Synapse Analytics, views in Lake Databases allow users to define logical query layers on top of data without duplicating it. These views are stored in the metastore and can be queried using both Spark and Serverless SQL Pools, enabling seamless analytics across structured and semi-structured data.

🧠 What Is a View in Lake Database?

A view in a Lake Database is a virtual table created by saving a SQL query. It does not store data physically but references data from existing Delta or Parquet tables.

🛠️ How to Create a View in Lake Database

✅ Step 1: Use Spark SQL to Create a View

%%spark
CREATE OR REPLACE VIEW lake.vw_customer_summary AS
SELECT country, COUNT(*) AS customer_count
FROM lake.customer
GROUP BY country;

✅ Step 2: Query the View

%%sql
SELECT * FROM lake.vw_customer_summary;

📌 Benefits of Views

No data duplication — views reference base tables
Reusable logic — great for summarizing and reporting
Accessible from both Spark and Serverless SQL Pools
Can be used to hide complexity or standardize queries

🧩 Use Cases

Creating data marts from Lakehouse tables
Power BI models referencing summarized data
Abstracting joins, filters, and calculations

⚠️ Limitations

Views in Lake Databases are read-only
Cannot include procedural logic or parameters
Underlying tables must be Delta/Parquet and accessible to Spark and Serverless

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

PySpark Tutorial: Using find_in_set() in PySpark | Search String Position in a Delimited List

Using find_in_set() in PySpark | String Position in Comma-Separated Values

Using find_in_set() in PySpark

In this tutorial, you'll learn how to use the find_in_set() function in PySpark to search for a specific string inside a comma-separated list. The function returns the position (1-based index) of the string if found, or 0 if not found.

📘 Sample Data

data = [
    ("apple", "apple,banana,grape"),
    ("car", "She went to the park. Then she read a book."),
    ("pen", "He is smart. Isn't he? Yes, he is.")
]

columns = ["word", "sentence"]
df = spark.createDataFrame(data, columns)
df.show()

Output:

+-----+-----------------------------+
|word |sentence                     |
+-----+-----------------------------+
|apple|apple,banana,grape           |
|car  |She went to the park...      |
|pen  |He is smart. Isn't he? Yes...|
+-----+-----------------------------+

🔍 Apply find_in_set() to Search Word Index

from pyspark.sql.functions import find_in_set, col

df = df.withColumn("word_index", find_in_set(col("word"), col("sentence")))
df.show(truncate=False)

Output:

+-----+--------------------------+-----------+
|word |sentence                  |word_index |
+-----+--------------------------+-----------+
|apple|apple,banana,grape        |1          |
|car  |She went to the park...   |0          |
|pen  |He is smart. Isn't he?... |0          |
+-----+--------------------------+-----------+

🎥 Watch the Full Tutorial

Cross Warehouse Queries in Fabric – Query Across Warehouses with T-SQL | Microsoft Fabric Tutorial

Cross Warehouse Queries in Fabric – Query Across Warehouses with T-SQL

In Microsoft Fabric, you can execute cross warehouse queries to join or fetch data from multiple warehouses within the same workspace using standard T-SQL. This is useful for creating unified reports, performing analytics across departments, or consolidating data silos.

📘 What Are Cross Warehouse Queries?

Allows querying across two or more Fabric Warehouses
Supports read-only access to external warehouse tables
Uses Linked Server-style 3-part or 4-part naming convention
Enables you to join data without ETL duplication

✅ Prerequisites

Both warehouses must be in the same workspace
You must have read permissions on the external warehouse
SQL editor must be used within a Fabric Warehouse context

🧪 Example Query

Assume you are connected to SalesWarehouse and want to pull data from FinanceWarehouse:

SELECT o.OrderID, o.CustomerID, f.Balance
FROM dbo.Orders o
JOIN [FinanceWarehouse].[dbo].[CustomerAccounts] f
  ON o.CustomerID = f.CustomerID;

This query joins the Orders table from the current warehouse with the CustomerAccounts table from the FinanceWarehouse.

🎯 Use Cases

Central reporting across multiple business domains
Joining data from regional and global warehouses
Performing data validation across systems
Avoiding ETL and staging for short-term analytics

⚠️ Limitations

Write operations (INSERT, UPDATE) are not allowed on external warehouse tables
Performance depends on table sizes and Fabric capacity
Cross queries are scoped to the same workspace only
Only supported in SQL Editor, not in Dataflow Gen2 or Power BI visuals directly

💡 Tips for Efficient Cross Warehouse Queries

Use TOP or WHERE clauses to limit data early
Apply indexing strategies on foreign keys in both warehouses
Use SELECT ... INTO to materialize external data into local staging tables if needed repeatedly

🎬 Watch the Full Tutorial

How to Query Tables from Lake Database Using Serverless SQL Pool in Azure Synapse | Azure Synapse Analytics Tutorial

Query Lake Database Tables Using Serverless SQL Pool in Azure Synapse

How to Query Tables from Lake Database Using Serverless SQL Pool in Azure Synapse

📘 Overview

In Azure Synapse Analytics, a Lake Database created using Spark pools can be queried directly using Serverless SQL Pools. This allows seamless integration of big data stored in Data Lake with familiar T-SQL syntax for reporting and analytics without data movement.

🎯 Why Query Lake Databases with Serverless SQL?

No need to provision or manage compute resources
Familiar T-SQL interface for querying big data
Cost-effective pay-per-query model
Supports querying Delta, Parquet, and CSV formats

🛠️ Step-by-Step: Querying Spark-Created Tables

✅ Step 1: Create Table in Lake Database via Spark (Optional)

%%spark
CREATE TABLE lakehouse.customer (
  id INT,
  name STRING,
  country STRING
)
USING DELTA
LOCATION 'abfss://datalake@youraccount.dfs.core.windows.net/lake/customer'

✅ Step 2: Confirm Table Appears in Lake Database

Go to Synapse Studio > Data > Lake Database and confirm the table is visible under the specified database.

✅ Step 3: Query the Table Using Serverless SQL

-- Use Serverless SQL Pool
SELECT TOP 10 * 
FROM [YourLakeDB].[dbo].[customer];

✅ The table is accessible through the built-in metadata and requires no external table definition.

📌 Tips

Ensure Spark and SQL pools have access to the same storage and metadata
Use TOP or WHERE clauses for performance optimization
Lake DB tables should be created under Spark with valid Delta/Parquet paths

📈 Use Cases

Ad-hoc querying of curated data lakes
Feeding Power BI dashboards directly from Lake DB
Joining lake tables with Serverless SQL objects like external tables or views

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

Split_part() in PySpark Explained | Split Strings by Delimiter and Extract Specific Part #pyspark | PySpark Tutorial

split_part() in PySpark | Extract String Parts by Delimiter

split_part() in PySpark – Extract String Parts by Delimiter

In this tutorial, you'll learn how to use the split_part() function in PySpark to extract specific substrings by a given delimiter, such as pulling username from an email, or ZIP code from a location string.

📘 Sample Data

data = [
    ("john.doe@example.com", "NY-10001-USA"),
    ("alice.smith@domain.org", "CA-90001-USA"),
    ("bob99@company.net", "TX-73301-USA")
]

columns = ["email", "location"]

df = spark.createDataFrame(data, columns)
df.show()

Output:

+----------------------+--------------+
| email                | location     |
+----------------------+--------------+
| john.doe@example.com | NY-10001-USA |
| alice.smith@domain.org | CA-90001-USA |
| bob99@company.net    | TX-73301-USA |
+----------------------+--------------+

📍 Extract Specific Parts Using split_part()

from pyspark.sql.functions import split_part, col

df = df.withColumn("username", split_part(col("email"), "@", 1)) \
       .withColumn("domain", split_part(col("email"), "@", 2)) \
       .withColumn("state", split_part(col("location"), "-", 1)) \
       .withColumn("zip", split_part(col("location"), "-", 2)) \
       .withColumn("country", split_part(col("location"), "-", 3))

df.select("email", "username", "domain", "location", "state", "zip", "country").show(truncate=False)

Output:

+----------------------+----------------+------------------+--------------+-----+-----+-------+
| email                | username       | domain           | location     |state| zip |country|
+----------------------+----------------+------------------+--------------+-----+-----+-------+
| john.doe@example.com | john.doe       | example.com      | NY-10001-USA | NY  |10001| USA   |
| alice.smith@domain.org| alice.smith   | domain.org       | CA-90001-USA | CA  |90001| USA   |
| bob99@company.net    | bob99          | company.net      | TX-73301-USA | TX  |73301| USA   |
+----------------------+----------------+------------------+--------------+-----+-----+-------+

🎥 Watch the Full Tutorial

Time Travel in Fabric Warehouse – Query Historical Data with T-SQL | Microsoft Fabric Tutorial

Time Travel in Fabric Warehouse – Query Historical Data with T-SQL

Microsoft Fabric's Time Travel feature allows you to query past versions of your data warehouse tables using T-SQL. It’s a powerful capability for historical analysis, reproducibility, and data recovery.

🕒 What is Time Travel?

Query data as it existed at a specific point in time
Retention up to 30 days
Uses the FOR TIMESTAMP AS OF clause in SELECT statements
Read-only – no updates, deletes, or inserts allowed

✅ Syntax Example

SELECT * 
FROM dbo.dimension_customer
OPTION (FOR TIMESTAMP AS OF '2024-03-13T19:39:35.280');

📌 Use this to get current UTC timestamp in correct format:

SELECT CONVERT(varchar, SYSUTCDATETIME(), 126) AS TruncatedUTC;

🔍 Use Cases for Time Travel

📈 Historical trend analysis and reporting
🧪 Machine learning model reproducibility
🔍 Root cause investigations and debugging
🔄 Safe verification during ETL pipeline development
✅ Compliance and audit traceability

🔐 Permissions

Roles: Admin, Member, Contributor, Viewer
Security enforced: RLS, CLS, and DDM apply even in time travel

⚠️ Limitations

Only SELECT statements are supported
Not supported inside views or CTAS queries
Only UTC timestamps are allowed (no local timezones)
Supports up to 3-digit millisecond precision
Not usable on Lakehouse endpoints or Power BI datasets directly

🎬 Watch the Full Tutorial

Delta Table Tutorial in Azure Synapse: Create, Insert, Update, Delete & Merge with Spark SQL | Azure Synapse Analytics Tutorial

Delta Table Tutorial in Azure Synapse: Create, Insert, Update, Delete, Merge

Delta Table Tutorial in Azure Synapse: Create, Insert, Update, Delete & Merge with Spark SQL

📘 Overview

Delta Lake is a powerful open-source storage layer that brings ACID transactions to Apache Spark. In Azure Synapse, Delta Tables enable you to perform CREATE, INSERT, UPDATE, DELETE, and MERGE operations directly within your Spark SQL workflows.

🛠️ 1. Create a Delta Table

%%spark
CREATE TABLE lake.delta_customers (
  id INT,
  name STRING,
  country STRING
)
USING DELTA
LOCATION 'abfss://data@yourstorageaccount.dfs.core.windows.net/delta/customers'

➕ 2. Insert Records

%%spark
INSERT INTO lake.delta_customers VALUES
  (1, 'Alice', 'USA'),
  (2, 'Bob', 'Canada');

✏️ 3. Update Records

%%spark
UPDATE lake.delta_customers
SET country = 'United States'
WHERE name = 'Alice';

❌ 4. Delete Records

%%spark
DELETE FROM lake.delta_customers
WHERE id = 2;

🔁 5. Merge (Upsert) Records

%%spark
MERGE INTO lake.delta_customers AS target
USING (
  SELECT 2 AS id, 'Robert' AS name, 'UK' AS country
) AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

🔍 6. Query the Delta Table

%%sql
SELECT * FROM lake.delta_customers;

📌 Why Use Delta in Synapse?

ACID Transactions
Time Travel and Versioning
Efficient Upserts (MERGE)
Seamless integration with Spark SQL

📈 Use Cases

Change Data Capture (CDC) pipelines
Slowly Changing Dimensions (SCD)
Transactional lakehouse architectures

📺 Watch the Full Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

Split Strings in PySpark | split(str, pattern, limit) Function Explained with Examples | PySpark Tutorial

String Splitting in PySpark | split(str, pattern[, limit]) Explained

String Splitting in PySpark

In this tutorial, you’ll learn how to use split(str, pattern[, limit]) to break strings into arrays. We'll cover email parsing, splitting full names, and handling pipe-delimited data.

📦 Sample Data

data = [
  ("john.doe@example.com", "John Doe", "john|doe|35|NY"),
  ("alice.smith@mail.org", "Alice Smith", "alice|smith|29|CA"),
  ("bob.jones@test.net", "Bob Jones", "bob|jones|42|TX")
]

columns = ["email", "full_name", "user_data"]
df = spark.createDataFrame(data, columns)
df.show()

Output:

+----------------------+-----------+--------------------+
| email                | full_name | user_data          |
+----------------------+-----------+--------------------+
| john.doe@example.com | John Doe  | john|doe|35|NY      |
| alice.smith@mail.org | Alice Smith | alice|smith|29|CA |
| bob.jones@test.net   | Bob Jones | bob|jones|42|TX     |
+----------------------+-----------+--------------------+

📬 Split Email into Username and Domain

from pyspark.sql.functions import split

df = df.withColumn("email_parts", split("email", "@"))
df.select("email", "email_parts").show(truncate=False)

Output: Shows username and domain in an array

👤 Split Full Name into First and Last Name

df = df.withColumn("name_split", split("full_name", " "))
df.select("full_name", "name_split").show(truncate=False)

Output: ["John", "Doe"], ["Alice", "Smith"], etc.

📎 Split Pipe-Delimited User Data

df = df.withColumn("user_fields", split("user_data", "\\|"))
df.select("user_data", "user_fields").show(truncate=False)

Output: ["john", "doe", "35", "NY"], etc.

🎥 Watch the Full Tutorial

Spark Table Formats in Azure Synapse: Parquet, CSV, Delta, JSON, Avro & ORC Explained | Azure Synapse Analytics Tutorial

Spark Table Formats in Azure Synapse: Parquet, CSV, Delta, JSON, Avro & ORC

Spark Table Formats in Azure Synapse: Parquet, CSV, Delta, JSON, Avro & ORC Explained

📘 Overview

Apache Spark in Azure Synapse supports multiple table formats, each optimized for different use cases like performance, compatibility, or schema evolution. This post explores how to create and use Parquet, Delta, CSV, JSON, Avro, and ORC table formats.

📂 Common File Formats and Their Use Cases

Parquet – Columnar storage, best for analytics
Delta – ACID compliant, supports updates/deletes
CSV – Simple format for interoperability
JSON – Best for semi-structured data
Avro – Row-based format, good for schema evolution
ORC – Columnar, highly optimized for big data

🛠️ Creating Tables in Different Formats

✅ 1. Parquet Table

%%spark
CREATE TABLE lake.parquet_table (
  id INT, name STRING
)
USING PARQUET
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/parquet'

✅ 2. Delta Table

%%spark
CREATE TABLE lake.delta_table (
  id INT, product STRING
)
USING DELTA
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/delta'

✅ 3. CSV Table

%%spark
CREATE TABLE lake.csv_table (
  id INT, region STRING
)
USING CSV
OPTIONS ('header'='true')
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/csv'

✅ 4. JSON Table

%%spark
CREATE TABLE lake.json_table
USING JSON
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/json'

✅ 5. Avro Table

%%spark
CREATE TABLE lake.avro_table
USING AVRO
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/avro'

✅ 6. ORC Table

%%spark
CREATE TABLE lake.orc_table
USING ORC
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/orc'

📊 Format Comparison Table

Format	Type	Supports Schema Evolution	ACID Transactions	Best Use Case
Parquet	Columnar	Partial	No	Read-heavy analytics
Delta	Columnar + Transaction Logs	Yes	Yes	ETL with updates/deletes
CSV	Row	No	No	Simple data exchange
JSON	Row	No	No	Log data or API responses
Avro	Row	Yes	No	Data pipelines & evolution
ORC	Columnar	Yes	No	Big data batch processing

📌 Best Practices

Use Delta for ACID-compliant workloads
Use Parquet or ORC for optimized analytical queries
Use CSV only when human readability or legacy system compatibility is needed

📺 Watch the Full Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

String Formatting in PySpark | concat_ws, format_number, printf, repeat, lpad, rpad Explained | PySpark Tutorial

String Formatting in PySpark | concat_ws, format_number, printf, repeat, lpad, rpad

String Formatting in PySpark

This tutorial demonstrates how to use PySpark string functions like concat_ws, format_number, format_string, printf, repeat, lpad, and rpad for formatting, combining, and manipulating string values in DataFrames.

📦 Sample Data Creation

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName(\"StringFormattingDemo\").getOrCreate()

data = [
    (\"John\", 950000, \"USD\", \"Smith\"),
    (\"Alice\", 120500, \"EUR\", \"Brown\"),
    (\"Bob\", 87999, \"INR\", \"Taylor\")
]

columns = [\"first_name\", \"salary\", \"country\", \"last_name\"]
df = spark.createDataFrame(data, columns)
df.show()

📊 Sample DataFrame Output

+-----------+--------+-------+----------+
|first_name | salary |country| last_name|
+-----------+--------+-------+----------+
| John      | 950000 | USD   | Smith    |
| Alice     | 120500 | EUR   | Brown    |
| Bob       |  87999 | INR   | Taylor   |
+-----------+--------+-------+----------+

🔗 concat_ws()

df.withColumn("full_name", concat_ws("-", "first_name", "country")).show()

Output: Creates a new column combining first name and country with hyphen

💲 format_number()

df.withColumn("formatted_salary", format_number("salary", 2)).show()

Output: Adds a formatted string version of salary with 2 decimal places

🧾 format_string()

df.withColumn("greeting", format_string("Hello %s %s", col("first_name"), col("country"))).show()

Output: Hello John USD

🔢 printf()

df.withColumn("price_tag", printf("Amount = %.2f", col("salary"))).show()

Output: Amount = 950000.00

🔁 repeat()

df.withColumn("excited", repeat(lit("test"), 2)).show()

Output: testtest

⬅️➡️ lpad() and rpad()

df.withColumn("last_lpad", lpad("last_name", 10, "*")) \
  .withColumn("first_rpad", rpad("first_name", 10, "-")).show()

Output: Pads strings to given length with characters on left/right

🎥 Watch the Tutorial

Query Data & Use Data Preview in SQL Editor in Fabric | Microsoft Fabric Tutorial For Data Engineers

Query Data & Use Data Preview in SQL Editor in Fabric | Microsoft Fabric Tutorial

Query Data & Use Data Preview in SQL Editor in Fabric | Microsoft Fabric Tutorial For Data Engineers

Microsoft Fabric offers a powerful and intuitive SQL Editor for data engineers to query, preview, and analyze data in real time. In this tutorial, you’ll learn how to effectively use the SQL Editor and Data Preview features to work with warehouse tables.

✅ How to Open the SQL Editor

Navigate to your Warehouse or Lakehouse in Microsoft Fabric
Click on the SQL Editor tab from the top menu
You can also open an existing SQL script or create a new one

✍️ Writing and Running SQL Queries

Use the editor to write T-SQL queries just like in SSMS or Azure Data Studio. Example:

SELECT TOP 100 * FROM dbo.Orders
WHERE OrderAmount > 500
ORDER BY OrderDate DESC;

Click Run (▶️) or press Ctrl + Enter to execute.

🔍 Using Data Preview

Right-click any table from the object explorer
Select Preview data
Fabric shows the first 100 rows with basic metadata
You can use this to validate structure and sample content

🎛️ Filter, Sort & Explore Result Sets

Click column headers to sort ascending or descending
Use the filter icon to apply basic text or numeric filters
Pin important columns for easier analysis

💡 Tips for Efficient Querying

Use TOP or LIMIT to avoid large result sets during testing
Alias long table names for cleaner syntax
Click on “Format SQL” for auto-formatting queries
Use INFORMATION_SCHEMA.TABLES to find all tables:

SELECT * FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'dbo';

🎬 Watch the Full Tutorial

PySpark Tutorial : String Search in PySpark - How to Use contains, startswith, endswith, like, rlike, locate

String Search in PySpark | contains, startswith, endswith, like, rlike, locate

String Search in PySpark

Learn how to perform string filtering and matching in PySpark using functions like contains(), startswith(), endswith(), like, rlike, and locate().

🔍 Sample DataFrame

+-------------+------------------------+--------+
| customer_name | email                  | status |
+-------------+------------------------+--------+
| Aamir Khan   | aamir@example.com       | Active |
| Bob Smith    | bob.smith@mail.com      | Inactive |
| Lisa Jones   | lisa.j@data.org         | Active |
| Charlie Brown| charlie@peanuts.net     | Pending |
| Alice Wonder | alice.w@wonder.org      | Active |
| Eve H        | eve.h@allabout.eve      | Inactive |
| John Doe     | john.doe@unknown.com    | Active |
+-------------+------------------------+--------+

✅ contains()

Find customers whose address column contains 'Anytown'.

df.filter(col("address").contains("Anytown")).show(truncate=False)

✅ startswith() and endswith()

Filter emails starting with 'a' and ending with '.org'.

df.filter(col("email").startswith("a")).show(truncate=False)
df.filter(col("email").endswith(".org")).show(truncate=False)

✅ like()

Use SQL-style wildcard match.

df.filter(col("customer_name").like("%is%")).show(truncate=False)

✅ rlike()

Use regex pattern match (e.g., find email domains starting with m or d).

df.filter(col("email").rlike("^[md]")).show(truncate=False)

✅ locate()

Find position of substring in the email (e.g., '@' position).

df.select("email", instr(col("email"), "@").alias("at_pos")).show(truncate=False)

🎥 Watch the Tutorial

Restore Warehouse In-Place in Fabric – Step-by-Step with Restore Point | Microsoft Fabric Tutorial

Restore Warehouse In-Place in Fabric – Step-by-Step with Restore Point

In this Microsoft Fabric tutorial, you'll learn how to restore your warehouse in-place using system or user-defined restore points. This capability is essential for disaster recovery, testing environments, or rolling back undesired changes.

🔁 What is In-Place Restore?

Restores a warehouse to a prior point-in-time
Overwrites the existing warehouse (name stays the same)
Restores all components: models, queries, tables
Useful for undoing accidental deletes, corruption, or failed deployments

📦 Types of Restore Points

System-Created: Every 8 hours automatically (up to 180 kept)
User-Defined: Manually created before/after major changes
Retention for both: 30 days

✅ Common Use Cases

Rollback after a schema error
Reset test environments to a clean state
Recover deleted data or fix bad ETL runs

🧭 Step-by-Step: How to Restore

Open Microsoft Fabric Portal
Navigate to your Warehouse
Click Settings → Restore Points
Select a system or user-defined restore point
Click Restore In-Place

⚠️ If you receive Error 5064, try again after a few moments.

🔐 Permissions Matrix

View Restore Points: Admin, Member, Contributor, Viewer
Create Restore Points: Admin, Member, Contributor
Perform Restore: Admin only

💰 Billing & Cost

Storage: Charges apply for Parquet storage under OneLake
Compute: Charged when restore point is created or restored
Billing is against Fabric Capacity

⚠️ Limitations

Cannot rename warehouse on restore (same name only)
Restore points expire after 30 days — no extensions
Cannot restore across different warehouses
Restore does not include downstream linked pipelines

🎬 Watch the Full Video Tutorial

Clone Tables in Fabric Warehouse – Zero Copy Cloning with T-SQL | Microsoft Fabric Tutorial

Clone Tables in Fabric Warehouse – Zero Copy Cloning with T-SQL

In this tutorial, we’ll walk through how to use zero-copy table cloning in Microsoft Fabric Warehouse using simple T-SQL commands. Cloning tables is useful for development, testing, historical reporting, or machine learning experimentation—all without duplicating data.

🧠 What is a Zero-Copy Clone?

Creates a new table referencing the same data files (Parquet in OneLake)
No physical data duplication
Cloned tables behave as independent tables
Extremely fast and storage-efficient

✅ Syntax to Clone a Table

Clone Current State

CREATE TABLE dbo.Sales_Clone AS CLONE OF dbo.Sales;

Clone Table as of Specific Time

CREATE TABLE dbo.Sales_Clone_History 
AS CLONE OF dbo.Sales 
WITH ( SNAPSHOT_TIME = '2025-05-20T10:00:00Z' );

🛠️ Use Cases for Cloning

🔧 Development and Testing
📊 Point-in-time Historical Reporting
🔁 Recovery and troubleshooting
📂 Data Archiving for compliance
🤖 Reproducible ML Training Pipelines

⚠️ Limitations to Keep in Mind

Cloning is limited to tables within the same warehouse
Cannot clone tables from Lakehouse SQL endpoints
Cloning entire schema or warehouse is not supported
Time-travel cloning must be within 30-day retention
Schema changes in the source table may block time-based clones

✅ Tip: Clones are disposable. Deleting a clone does not impact the original table.

🎬 Watch the Full Tutorial

Managed vs External Tables in Spark SQL Lake Database | Azure Synapse Analytics Tutorial

Managed vs External Tables in Spark SQL Lake Database | Azure Synapse Analytics

Managed vs External Tables in Spark SQL Lake Database | Azure Synapse Analytics Explained

📘 Overview

In Azure Synapse Analytics, you can create both Managed and External Tables within a Lake Database using Spark SQL. Understanding the difference is essential for managing data, storage, and performance effectively.

📂 What Are Managed Tables?

Managed tables are those where both the metadata and the data are controlled by Synapse Spark. When you drop the table, the data and metadata are both deleted.

✅ Create a Managed Table

%%spark
CREATE TABLE lakehouse.managed_employee (
    id INT,
    name STRING,
    salary FLOAT
)

✔️ No LOCATION clause is specified — Spark decides where to store the data internally.

📁 What Are External Tables?

External tables store only the metadata in the Lake Database while the data resides externally in user-defined paths like ADLS Gen2.

✅ Create an External Table

%%spark
CREATE TABLE lakehouse.external_sales (
    id INT,
    product STRING,
    amount DOUBLE
)
USING DELTA
LOCATION 'abfss://data@yourstorageaccount.dfs.core.windows.net/sales/external/'

✔️ Data lives outside Spark's internal catalog. Deleting the table will NOT remove the data.

📊 Key Differences

Feature	Managed Table	External Table
Data Location	Managed by Synapse	Custom ADLS path
Drop Table Deletes Data	Yes	No
Usage	Quick prototyping	Production pipelines

📌 Best Practices

Use external tables for production where you need to retain data
Use managed tables for temporary analysis or rapid testing
Prefer Delta format for ACID support and versioning

📂 Example: Read External Table

%%sql
SELECT * FROM lakehouse.external_sales;

📺 Watch the Full Video Explanation

📚 Credit: Content created with the help of ChatGPT and Gemini.

Substring Functions in PySpark: substr(), substring(), overlay(), left(), right() with Real Examples | PySpark Tutorial

Extract Substrings in PySpark | substr(), substring(), overlay(), left(), right()

Extracting Substrings in PySpark

In this tutorial, you'll learn how to use PySpark string functions like substr(), substring(), overlay(), left(), and right() to manipulate string columns in DataFrames.

1️⃣ Initialize Spark Session

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SubstringTutorial").getOrCreate()

2️⃣ Sample DataFrame

data = [("EMP-0001-NC", "maria.fernandez@healthcare.org"),
        ("EMP-0002-TX", "john.davis@fintech.com"),
        ("EMP-0003-CA", "lucy.liu@retailmart.net"),
        ("EMP-0004-FL", "peter.parker@dailybugle.com"),
        ("EMP-0005-NY", "bruce.wayne@gotham.org")]

columns = ["employee_code", "email"]

df = spark.createDataFrame(data, columns)
df.show(truncate=False)

Output:

+-------------+-----------------------------+
|employee_code|email                        |
+-------------+-----------------------------+
|EMP-0001-NC  |maria.fernandez@healthcare.org|
|EMP-0002-TX  |john.davis@fintech.com       |
|EMP-0003-CA  |lucy.liu@retailmart.net      |
|EMP-0004-FL  |peter.parker@dailybugle.com  |
|EMP-0005-NY  |bruce.wayne@gotham.org       |
+-------------+-----------------------------+

3️⃣ substr() Function

Extract substring starting at a specific position with defined length.

from pyspark.sql.functions import col, substr
df_substr = df.withColumn("emp_id", substr("employee_code", 5, 4))
df_substr.show(truncate=False)

Output:

+-------------+-----------------------------+------+
|employee_code|email                        |emp_id|
+-------------+-----------------------------+------+
|EMP-0001-NC  |maria.fernandez@healthcare.org|0001  |
|EMP-0002-TX  |john.davis@fintech.com       |0002  |
|EMP-0003-CA  |lucy.liu@retailmart.net      |0003  |
|EMP-0004-FL  |peter.parker@dailybugle.com  |0004  |
|EMP-0005-NY  |bruce.wayne@gotham.org       |0005  |
+-------------+-----------------------------+------+

4️⃣ substring() Function

Alias for substr() - works the same way.

from pyspark.sql.functions import substring
df_substring = df.withColumn("emp_id", substring("employee_code", 5, 4))
df_substring.show(truncate=False)

5️⃣ overlay() Function

Replaces part of the string with another string.

from pyspark.sql.functions import overlay
df_overlay = df.withColumn("masked_email", overlay("email", "XXXXXX", 1, 5))
df_overlay.show(truncate=False)

6️⃣ left() and right() Functions

Extract leftmost or rightmost characters from string.

from pyspark.sql.functions import left, right

df_left = df.withColumn("left_code", left("employee_code", 3))
df_right = df.withColumn("right_code", right("employee_code", 2))

df_left.show(truncate=False)
df_right.show(truncate=False)

📺 Watch the Tutorial

User Defined Inline Table-Valued Functions with CROSS APPLY | Microsoft Fabric Tutorial

In this Microsoft Fabric tutorial, we’ll demonstrate how to create and use inline table-valued functions (iTVFs) and apply them with CROSS APPLY. These are great tools for reusable logic and dynamic row-by-row evaluation.

🧠 What is a Function in SQL?

A function is a reusable SQL object that returns either a value or a table. In Fabric Warehouse, only inline table-valued functions (iTVFs) are currently supported.

🚫 Not Supported:

Scalar functions
Multi-statement table-valued functions

✅ Supported:

Inline Table-Valued Functions (iTVFs)

🏗️ Step 1: Create Sample Orders Table

CREATE TABLE dbo.Orders (
    OrderID INT,
    CustomerName VARCHAR(100),
    OrderDate DATE,
    OrderAmount DECIMAL(10, 2)
);

💾 Step 2: Insert Sample Data

INSERT INTO dbo.Orders VALUES
(1, 'Aamir', '2024-01-10', 1200.00),
(2, 'Sara', '2024-02-15', 800.00),
(3, 'John', '2024-03-20', 300.00),
(4, 'Mia', '2024-03-22', 1500.00),
(5, 'Aamir', '2024-04-01', 200.00);

🔄 Step 3: Create Discount Function with CROSS APPLY

This iTVF applies a 10% discount to a given amount:

CREATE FUNCTION dbo.fn_ApplyDiscount (@Amount DECIMAL(10,2))
RETURNS TABLE
AS
RETURN (
    SELECT @Amount * 0.90 AS DiscountedAmount
);

✅ Using CROSS APPLY

SELECT 
    o.OrderID, o.CustomerName, o.OrderAmount,
    d.DiscountedAmount
FROM dbo.Orders o
CROSS APPLY dbo.fn_ApplyDiscount(o.OrderAmount) d;

🔸 Step 4: Create Function to Filter Orders

This function returns orders where the amount is greater than or equal to a specified value:

CREATE FUNCTION dbo.fn_GetHighValueOrders (@MinAmount DECIMAL(10,2))
RETURNS TABLE
AS
RETURN (
    SELECT OrderID, CustomerName, OrderAmount
    FROM dbo.Orders
    WHERE OrderAmount >= @MinAmount
);

✅ Usage Example:

SELECT * FROM dbo.fn_GetHighValueOrders(200.00);

🧹 Optional Cleanup

-- DROP FUNCTION dbo.fn_ApplyDiscount;
-- DROP FUNCTION dbo.fn_GetHighValueOrders;
-- DROP TABLE dbo.Orders;

🎬 Watch the Full Tutorial

Lake Database in Azure Synapse: Spark SQL vs Delta vs Parquet | Full Demo with Create & Insert | Azure Synapse Analytics Tutorial

Lake Database in Azure Synapse: Spark SQL vs Delta vs Parquet

Lake Database in Azure Synapse: Spark SQL vs Delta vs Parquet | Full Demo

📘 Overview

Azure Synapse Lake Databases offer powerful data lake querying capabilities using Spark SQL. In this tutorial, we demonstrate how to:

Create and use Lake Databases
Compare Parquet, Delta, and Spark SQL tables
Insert, update, and read data using Apache Spark in Synapse

🏗️ Step 1: Create Spark SQL Table (Unmanaged)

%%spark
CREATE TABLE IF NOT EXISTS lakehouse.spark_table (
    id INT, name STRING, age INT
)
USING PARQUET
LOCATION 'abfss://lake@storageaccount.dfs.core.windows.net/tables/spark_table'

💾 Step 2: Create Delta Table

%%spark
CREATE TABLE lakehouse.delta_table (
    id INT, product STRING, quantity INT
)
USING DELTA
LOCATION 'abfss://lake@storageaccount.dfs.core.windows.net/tables/delta_table'

🔄 Step 3: Insert Data into Tables

%%spark
INSERT INTO lakehouse.spark_table VALUES (1, 'Ali', 30), (2, 'Zara', 25);
INSERT INTO lakehouse.delta_table VALUES (1, 'Phone', 10), (2, 'Tablet', 20);

📊 Step 4: Query Tables

%%sql
SELECT * FROM lakehouse.spark_table;
SELECT * FROM lakehouse.delta_table;

✏️ Step 5: Update Delta Table

Delta supports ACID transactions, including UPDATE and DELETE operations.

%%spark
UPDATE lakehouse.delta_table SET quantity = 15 WHERE id = 1;

📈 Comparison: Parquet vs Delta vs Spark Tables

Feature	Parquet	Delta	Spark SQL Table
Storage Format	Columnar	Delta (based on Parquet + logs)	Varies
ACID Support	No	Yes	No
Supports Update/Delete	No	Yes	No
Performance	Fast Read	Fast + Version Control	Simple

📌 Best Practices

Use Delta Lake for write-heavy transactional workloads
Use Parquet for optimized reads and archival
Validate permissions for Lake DB and storage paths

📺 Watch the Full Demo

📚 Credit: Content created with the help of ChatGPT and Gemini.

How to perform String Cleaning in PySpark lower, trim, initcap Explained with Real Data | PySpark Tutorial

How to Clean Strings in PySpark | lower(), trim(), initcap() Explained with Real Data

How to Clean Strings in PySpark

Using `lower()`, `trim()`, and `initcap()` with Real Data

📌 What You’ll Learn

How to use lower() to convert text to lowercase
How to use trim() to remove leading/trailing spaces
How to use initcap() to capitalize first letter of each word
Chaining multiple string functions

📊 Sample Data

data = [
    (" Aamir ",),
    ("LISA ",),
    ("  charLie   ",),
    ("BOB",),
    (" eli",)
]
columns = ["raw_name"]
df = spark.createDataFrame(data, columns)
df.show(truncate=False)

Output:

+-----------+
|raw_name   |
+-----------+
| Aamir     |
|LISA       |
|  charLie  |
|BOB        |
| eli       |
+-----------+

🔧 Cleaned Data using PySpark Functions

1️⃣ Apply `trim()`

from pyspark.sql.functions import trim
df_trimmed = df.withColumn("trimmed", trim("raw_name"))
df_trimmed.show(truncate=False)

2️⃣ Apply `lower()` and `upper()`

from pyspark.sql.functions import lower, upper
df_lower = df_trimmed.withColumn("lowercase", lower("trimmed"))
df_upper = df_trimmed.withColumn("uppercase", upper("trimmed"))
df_lower.show(truncate=False)
df_upper.show(truncate=False)

3️⃣ Apply `initcap()`

from pyspark.sql.functions import initcap
df_initcap = df_trimmed.withColumn("titlecase", initcap("trimmed"))
df_initcap.show(truncate=False)

🎥 Watch the Full Tutorial

Stored Procedure in Fabric Warehouse Input & Output Parameters SQL Demo | Microsoft Fabric Tutorial

Stored Procedure in Fabric Warehouse – Input & Output Parameters SQL Demo

In this Microsoft Fabric tutorial, you'll learn how to create and execute a stored procedure with input and output parameters. This is especially useful when you want to encapsulate query logic and reuse it across reports, data pipelines, and dashboards.

🧠 What is a Stored Procedure?

A stored procedure is a reusable set of SQL statements saved in the database. It can take input parameters, return output values, and perform operations like reading or modifying data.

✅ Benefits of Stored Procedures:

Encapsulates business logic
Promotes reuse and consistency
Supports input and output parameters
Can include conditional and complex logic

🏗️ Step 1: Create a Sample Table

CREATE TABLE dbo.Orders (
    OrderID INT,
    CustomerName VARCHAR(100),
    OrderAmount DECIMAL(10,2)
);

💾 Step 2: Insert Sample Data

INSERT INTO dbo.Orders VALUES
(1, 'Aamir', 500.00),
(2, 'Sara', 750.00),
(3, 'John', 1200.00);

⚙️ Step 3: Create the Stored Procedure

This procedure accepts a customer name as input and returns the total order amount as output:

ALTER PROCEDURE dbo.GetCustomerTotalOrders
    @CustomerName VARCHAR(100),
    @TotalAmount DECIMAL(10,2) OUTPUT
AS
BEGIN
    SELECT @TotalAmount = SUM(OrderAmount)
    FROM dbo.Orders
    WHERE CustomerName = @CustomerName;
END;

🚀 Step 4: Execute the Stored Procedure

-- Declare output variable
DECLARE @Total DECIMAL(10,2);

-- Execute with input and capture output
EXEC dbo.GetCustomerTotalOrders
    @CustomerName = 'Aamir',
    @TotalAmount = @Total OUTPUT;

-- Display the result
SELECT 'Total Order Amount for Aamir' AS Description, @Total AS TotalAmount;

🧹 Optional Cleanup

-- DROP PROCEDURE dbo.GetCustomerTotalOrders;
-- DROP TABLE dbo.Orders;

📊 View Final Orders Table

SELECT * FROM dbo.Orders;

🎬 Watch the Full Tutorial

Export Data from Synapse Dedicated SQL Pool to Single CSV File in ADLS Gen2 | Azure Synapse Analytics Tutorial

Export Data from Synapse Dedicated SQL Pool to Single CSV File in ADLS Gen2

📘 Overview

In Azure Synapse Analytics, it’s common to export data from a Dedicated SQL Pool to Azure Data Lake Storage Gen2 in CSV format for reporting, archival, or integration with downstream tools. This tutorial demonstrates how to do that and ensure the output is saved as a single CSV file.

🛠️ Step-by-Step: Export to Single CSV File

✅ 1. Query Data Using Synapse Spark

%%pyspark
df = spark.read \
    .synapsesql("yourdedicatedpool.dbo.SalesData")
df.show()

✅ 2. Repartition to Single File

df_single = df.repartition(1)

This ensures only one output file is generated when writing to storage.

✅ 3. Write to ADLS Gen2 as CSV

df_single.write \
    .mode("overwrite") \
    .option("header", "true") \
    .csv("abfss://export@yourstorageaccount.dfs.core.windows.net/sales/csv/")

📌 Notes

repartition(1) should only be used for small to medium datasets due to single-threaded write.
Azure Storage Explorer can be used to rename the output file (e.g., from part-00000.csv to sales_report.csv).
Make sure Synapse workspace has access permissions (RBAC or SAS).

🎯 Use Cases

Exporting summarized data for business intelligence
Staging datasets for Power BI or external tools
Archiving daily/weekly sales reports

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

PySpark Tutorial : Master PySpark Sorting: sort(), asc(), desc() Explained with Examples #pyspark

Sort Functions in PySpark Explained with Examples

In this post, you'll learn how to use various sort functions in PySpark to order data by ascending/descending, and control the handling of nulls. This guide is perfect for anyone working with big data in Spark!

1️⃣ Setup

from pyspark.sql import SparkSession
from pyspark.sql.functions import asc, desc, asc_nulls_first, desc_nulls_last

spark = SparkSession.builder.appName("SortFunctionsDemo").getOrCreate()

2️⃣ Sample Data

sample_data = [
    ("Alice", 5000),
    ("Bob", None),
    ("Cara", 6200),
    ("Dan", None),
    ("Eli", 4500),
    ("Fay", 7000)
]

columns = ["employee", "sales"]
df = spark.createDataFrame(sample_data, columns)
df.show()

Output:

+--------+-----+
|employee|sales|
+--------+-----+
|   Alice| 5000|
|     Bob| null|
|    Cara| 6200|
|     Dan| null|
|     Eli| 4500|
|     Fay| 7000|
+--------+-----+

3️⃣ Sort by Ascending

df.orderBy(asc("sales")).show()

4️⃣ Sort with Nulls First

df.orderBy(asc_nulls_first("sales")).show()

5️⃣ Sort with Nulls Last

df.orderBy(desc_nulls_last("sales")).show()

6️⃣ Descending Order

df.orderBy(desc("sales")).show()

📺 Video Tutorial

Insert, Update & Delete in Fabric Warehouse Full SQL Demo for Beginners | Microsoft Fabric Tutorial

Insert, Update & Delete in Fabric Warehouse – Full SQL Demo for Beginners

In this tutorial, you'll learn how to use the core DML (Data Manipulation Language) commands — INSERT, UPDATE, and DELETE — in Microsoft Fabric Warehouse. These are the foundation of interacting with structured data in your SQL environment.

📌 What Are DML Commands?

INSERT: Adds new rows into a table
UPDATE: Modifies existing rows
DELETE: Removes rows

Fabric Warehouse fully supports these operations using standard T-SQL syntax.

🏗️ Step 1: Create a Sample Table

CREATE TABLE dbo.Products (
    ProductID INT,
    ProductName VARCHAR(100),
    Category VARCHAR(50),
    Price DECIMAL(10, 2),
    Stock INT
);

💾 Step 2: Insert Sample Data

INSERT INTO dbo.Products (ProductID, ProductName, Category, Price, Stock) VALUES
(1, 'Laptop', 'Electronics', 1200.00, 10),
(2, 'Smartphone', 'Electronics', 800.00, 25),
(3, 'Desk Chair', 'Furniture', 150.00, 40),
(4, 'Monitor', 'Electronics', 300.00, 15);

Verify Insert:

SELECT * FROM dbo.Products;

✏️ Step 3: Update Data

Increase the price of all Electronics items by 10%:

UPDATE dbo.Products
SET Price = Price * 1.10
WHERE Category = 'Electronics';

Verify Update:

SELECT * FROM dbo.Products;

❌ Step 4: Delete Data

Delete all products with stock less than 20:

DELETE FROM dbo.Products
WHERE Stock < 20;

Verify Delete:

SELECT * FROM dbo.Products;

🎬 Watch the Full Tutorial

Read CSV File & Load to Dedicated SQL Pool Table Using Apache Spark Pool Notebook | Azure Synapse Analytics Tutorial

Read CSV File & Load to Dedicated SQL Pool Table | Synapse Notebook

Read CSV File & Load to Dedicated SQL Pool Table Using Apache Spark Pool Notebook

📘 Overview

In this tutorial, you'll learn how to use Apache Spark Pools in Azure Synapse Analytics to read a CSV file from ADLS Gen2 and write that data into a Dedicated SQL Pool table. This is useful for ETL and data ingestion pipelines within your Synapse workspace.

🛠️ Step-by-Step Workflow

✅ Step 1: Read CSV File from ADLS Gen2

%%pyspark
csv_df = spark.read.option("header", "true") \
    .csv("abfss://data@yourstorageaccount.dfs.core.windows.net/input/employees.csv")
csv_df.show()

✅ Step 2: Write Data to Dedicated SQL Pool

csv_df.write \
    .synapsesql("yourdedicatedpool.dbo.Employee") \
    .mode("overwrite") \
    .save()

📝 Note: The `.synapsesql()` method is available by default when using Spark Pools in Synapse with built-in connectors.

✅ Step 3: Validate Inserted Data

%%sql
SELECT * FROM yourdedicatedpool.dbo.Employee;

📌 Tips

Ensure the destination table exists before writing
Use overwrite or append mode depending on your use case
Review schema compatibility between DataFrame and SQL table

🎯 Use Cases

Bulk load historical data into data warehouse
ETL transformation using PySpark before loading
Automated data ingestion from CSV sources

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

PySpark Window Functions Explained | Rank, Dense_Rank, Lead, Lag, NTILE | Real-World Demo | PySpark Tutorial

Mastering Window Functions in PySpark | Rank, Lag, Lead, Dense_Rank, Ntile

🚀 Mastering Window Functions in PySpark

This blog post will walk you through essential PySpark window functions like rank(), lag(), lead(), dense_rank(), and ntile() using practical examples. These functions are crucial for complex data analytics, ordering, and partitioning scenarios.

📘 Sample Dataset

+-------+------------+-------+------------+
|Employee|Department |Sales  |Sale_Date   |
+--------+-----------+--------+-----------+
|Alice   |Sales      |5000    |2024-01-01 |
|Ben     |Sales      |5500    |2024-01-01 |
|Cara    |Marketing  |4800    |2024-01-01 |
|Dan     |HR         |5300    |2024-01-01 |
|Alice   |Sales      |6200    |2024-01-02 |
|Ben     |Sales      |5500    |2024-01-02 |
|Cara    |Marketing  |5100    |2024-01-02 |
|Dan     |HR         |6000    |2024-01-02 |
|Alice   |Sales      |7000    |2024-01-03 |
|Ben     |Sales      |6400    |2024-01-03 |
+--------+-----------+--------+-----------+

⚙️ Code Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, row_number, rank, dense_rank, percent_rank, lag, lead, ntile
from pyspark.sql.window import Window

# Start Spark session
spark = SparkSession.builder.appName("WindowFunctionsDemo").getOrCreate()

# Sample DataFrame
data = [
    ("Alice", "Sales", 5000, "2024-01-01"),
    ("Ben", "Sales", 5500, "2024-01-01"),
    ("Cara", "Marketing", 4800, "2024-01-01"),
    ("Dan", "HR", 5300, "2024-01-01"),
    ("Alice", "Sales", 6200, "2024-01-02"),
    ("Ben", "Sales", 5500, "2024-01-02"),
    ("Cara", "Marketing", 5100, "2024-01-02"),
    ("Dan", "HR", 6000, "2024-01-02"),
    ("Alice", "Sales", 7000, "2024-01-03"),
    ("Ben", "Sales", 6400, "2024-01-03"),
]
schema = ["employee", "department", "sales", "sale_date"]
df = spark.createDataFrame(data, schema)
df.show()

📊 Applying Window Functions

window_spec = Window.partitionBy("employee").orderBy("sale_date")

df.withColumn("rank", rank().over(window_spec)).show()
df.withColumn("dense_rank", dense_rank().over(window_spec)).show()
df.withColumn("percent_rank", percent_rank().over(window_spec)).show()
df.withColumn("lag", lag("sales", 1).over(window_spec)).show()
df.withColumn("lead", lead("sales", 1).over(window_spec)).show()
df.withColumn("ntile", ntile(2).over(window_spec)).show()

🎬 Watch Full Tutorial

How to Write, Read & Append Data to Dedicated SQL Pool Table Using Notebook in Azure Synapse | Azure Synapse Analytics Tutorial

Write, Read & Append Data to Dedicated SQL Pool Table Using Notebook in Azure Synapse

How to Write, Read & Append Data to Dedicated SQL Pool Table Using Notebook in Azure Synapse

📘 Overview

Azure Synapse Notebooks allow seamless interaction with Dedicated SQL Pools using Spark. You can create and manage SQL tables from PySpark DataFrames, perform bulk writes, and query them using Spark SQL or Synapse SQL on-demand.

🛠️ Step-by-Step Examples

✅ 1. Create DataFrame

%%pyspark
data = [(1, "Aamir", 5000), (2, "Lisa", 6000)]
columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)

✅ 2. Write Data to Dedicated SQL Pool

Use the write.synapsesql format:

df.write \
  .format("com.databricks.spark.sqldw") \
  .option("url", "jdbc:sqlserver://<server>.database.windows.net:1433;database=<dbname>") \
  .option("dbtable", "dbo.Employee") \
  .option("user", "<username>") \
  .option("password", "<password>") \
  .mode("overwrite") \
  .save()

✅ 3. Read Data from SQL Pool Table

df_read = spark.read \
  .format("com.databricks.spark.sqldw") \
  .option("url", "jdbc:sqlserver://...") \
  .option("dbtable", "dbo.Employee") \
  .option("user", "...") \
  .option("password", "...") \
  .load()

df_read.show()

✅ 4. Append Data to Existing Table

df_append = spark.createDataFrame([(3, "John", 7000)], ["id", "name", "salary"])
df_append.write \
  .format("com.databricks.spark.sqldw") \
  .option("url", "jdbc:...") \
  .option("dbtable", "dbo.Employee") \
  .option("user", "...") \
  .option("password", "...") \
  .mode("append") \
  .save()

📌 Tips

Use overwrite to replace and append to add rows
Validate schema compatibility before appending
Check Synapse firewall rules for access

🎯 Use Cases

ETL workflows from Spark to Dedicated SQL Pool
Reading warehouse data for ML or transformation
Bulk appending logs or batch jobs

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

Label

T-SQL Notebooks in Fabric – Run Queries, Visualize Data & Save Views

📌 Key Features of T-SQL Notebooks

🧱 Step 1: Add Warehouse as Data Source

🧪 Example Queries

💡 Code Templates

💾 Save Query as View

🔗 Cross-Warehouse Queries

📊 Visualizing Query Results

🎬 Watch the Full Tutorial

How to Read Lake Database Tables & Views in Azure Synapse Using PySpark

📘 Overview

✅ Why Read Lake Database in PySpark?

🛠️ Step-by-Step Guide

Step 1: Read a Table from Lake Database

Step 2: Read a View from Lake Database

Step 3: Register as Temp View for SQL Queries

🔎 Sample Use Case

📌 Notes

📺 Watch the Full Video Tutorial

📘 Sample Data

📧 Extract Domain from Email

🔍 Extract Log Level

📅 Extract Date from Log

📞 Extract Area Code from Phone Number

🎥 Watch the Full Tutorial

What Are Views in Lake Databases | Azure Synapse Analytics Explained

📘 Overview

🧠 What Is a View in Lake Database?

🛠️ How to Create a View in Lake Database

✅ Step 1: Use Spark SQL to Create a View

✅ Step 2: Query the View

📌 Benefits of Views

🧩 Use Cases

⚠️ Limitations

📺 Watch the Video Tutorial

📘 Sample Data

Output:

🔍 Apply find_in_set() to Search Word Index

Output:

🎥 Watch the Full Tutorial

Cross Warehouse Queries in Fabric – Query Across Warehouses with T-SQL

📘 What Are Cross Warehouse Queries?

✅ Prerequisites

🧪 Example Query

🎯 Use Cases

⚠️ Limitations

💡 Tips for Efficient Cross Warehouse Queries

🎬 Watch the Full Tutorial

How to Query Tables from Lake Database Using Serverless SQL Pool in Azure Synapse

📘 Overview

🎯 Why Query Lake Databases with Serverless SQL?

🛠️ Step-by-Step: Querying Spark-Created Tables

✅ Step 1: Create Table in Lake Database via Spark (Optional)

✅ Step 2: Confirm Table Appears in Lake Database

✅ Step 3: Query the Table Using Serverless SQL

📌 Tips

📈 Use Cases

📺 Watch the Video Tutorial

📘 Sample Data

Output:

📍 Extract Specific Parts Using split_part()

Output:

🎥 Watch the Full Tutorial

Time Travel in Fabric Warehouse – Query Historical Data with T-SQL

🕒 What is Time Travel?

✅ Syntax Example

🔍 Use Cases for Time Travel

🔐 Permissions

⚠️ Limitations

🎬 Watch the Full Tutorial

Delta Table Tutorial in Azure Synapse: Create, Insert, Update, Delete & Merge with Spark SQL

📘 Overview

🛠️ 1. Create a Delta Table

➕ 2. Insert Records

✏️ 3. Update Records

❌ 4. Delete Records

🔁 5. Merge (Upsert) Records

🔍 6. Query the Delta Table

📌 Why Use Delta in Synapse?