Time Travel in Fabric Warehouse – Query Historical Data with T-SQL | Microsoft Fabric Tutorial

Time Travel in Fabric Warehouse – Query Historical Data with T-SQL | Microsoft Fabric Tutorial

Time Travel in Fabric Warehouse – Query Historical Data with T-SQL

Microsoft Fabric's Time Travel feature allows you to query past versions of your data warehouse tables using T-SQL. It’s a powerful capability for historical analysis, reproducibility, and data recovery.

๐Ÿ•’ What is Time Travel?

  • Query data as it existed at a specific point in time
  • Retention up to 30 days
  • Uses the FOR TIMESTAMP AS OF clause in SELECT statements
  • Read-only – no updates, deletes, or inserts allowed

✅ Syntax Example

SELECT * 
FROM dbo.dimension_customer
OPTION (FOR TIMESTAMP AS OF '2024-03-13T19:39:35.280');

๐Ÿ“Œ Use this to get current UTC timestamp in correct format:

SELECT CONVERT(varchar, SYSUTCDATETIME(), 126) AS TruncatedUTC;

๐Ÿ” Use Cases for Time Travel

  • ๐Ÿ“ˆ Historical trend analysis and reporting
  • ๐Ÿงช Machine learning model reproducibility
  • ๐Ÿ” Root cause investigations and debugging
  • ๐Ÿ”„ Safe verification during ETL pipeline development
  • ✅ Compliance and audit traceability

๐Ÿ” Permissions

  • Roles: Admin, Member, Contributor, Viewer
  • Security enforced: RLS, CLS, and DDM apply even in time travel

⚠️ Limitations

  • Only SELECT statements are supported
  • Not supported inside views or CTAS queries
  • Only UTC timestamps are allowed (no local timezones)
  • Supports up to 3-digit millisecond precision
  • Not usable on Lakehouse endpoints or Power BI datasets directly

๐ŸŽฌ Watch the Full Tutorial

Blog post written with the help of ChatGPT.

Delta Table Tutorial in Azure Synapse: Create, Insert, Update, Delete & Merge with Spark SQL | Azure Synapse Analytics Tutorial

Delta Table Tutorial in Azure Synapse: Create, Insert, Update, Delete, Merge

Delta Table Tutorial in Azure Synapse: Create, Insert, Update, Delete & Merge with Spark SQL

๐Ÿ“˜ Overview

Delta Lake is a powerful open-source storage layer that brings ACID transactions to Apache Spark. In Azure Synapse, Delta Tables enable you to perform CREATE, INSERT, UPDATE, DELETE, and MERGE operations directly within your Spark SQL workflows.

๐Ÿ› ️ 1. Create a Delta Table

%%spark
CREATE TABLE lake.delta_customers (
  id INT,
  name STRING,
  country STRING
)
USING DELTA
LOCATION 'abfss://data@yourstorageaccount.dfs.core.windows.net/delta/customers'

➕ 2. Insert Records

%%spark
INSERT INTO lake.delta_customers VALUES
  (1, 'Alice', 'USA'),
  (2, 'Bob', 'Canada');

✏️ 3. Update Records

%%spark
UPDATE lake.delta_customers
SET country = 'United States'
WHERE name = 'Alice';

❌ 4. Delete Records

%%spark
DELETE FROM lake.delta_customers
WHERE id = 2;

๐Ÿ” 5. Merge (Upsert) Records

%%spark
MERGE INTO lake.delta_customers AS target
USING (
  SELECT 2 AS id, 'Robert' AS name, 'UK' AS country
) AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

๐Ÿ” 6. Query the Delta Table

%%sql
SELECT * FROM lake.delta_customers;

๐Ÿ“Œ Why Use Delta in Synapse?

  • ACID Transactions
  • Time Travel and Versioning
  • Efficient Upserts (MERGE)
  • Seamless integration with Spark SQL

๐Ÿ“ˆ Use Cases

  • Change Data Capture (CDC) pipelines
  • Slowly Changing Dimensions (SCD)
  • Transactional lakehouse architectures

๐Ÿ“บ Watch the Full Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

Split Strings in PySpark | split(str, pattern, limit) Function Explained with Examples | PySpark Tutorial

String Splitting in PySpark | split(str, pattern[, limit]) Explained

String Splitting in PySpark

In this tutorial, you’ll learn how to use split(str, pattern[, limit]) to break strings into arrays. We'll cover email parsing, splitting full names, and handling pipe-delimited data.

๐Ÿ“ฆ Sample Data

data = [
  ("john.doe@example.com", "John Doe", "john|doe|35|NY"),
  ("alice.smith@mail.org", "Alice Smith", "alice|smith|29|CA"),
  ("bob.jones@test.net", "Bob Jones", "bob|jones|42|TX")
]

columns = ["email", "full_name", "user_data"]
df = spark.createDataFrame(data, columns)
df.show()

Output:

+----------------------+-----------+--------------------+
| email                | full_name | user_data          |
+----------------------+-----------+--------------------+
| john.doe@example.com | John Doe  | john|doe|35|NY      |
| alice.smith@mail.org | Alice Smith | alice|smith|29|CA |
| bob.jones@test.net   | Bob Jones | bob|jones|42|TX     |
+----------------------+-----------+--------------------+

๐Ÿ“ฌ Split Email into Username and Domain

from pyspark.sql.functions import split

df = df.withColumn("email_parts", split("email", "@"))
df.select("email", "email_parts").show(truncate=False)

Output: Shows username and domain in an array

๐Ÿ‘ค Split Full Name into First and Last Name

df = df.withColumn("name_split", split("full_name", " "))
df.select("full_name", "name_split").show(truncate=False)

Output: ["John", "Doe"], ["Alice", "Smith"], etc.

๐Ÿ“Ž Split Pipe-Delimited User Data

df = df.withColumn("user_fields", split("user_data", "\\|"))
df.select("user_data", "user_fields").show(truncate=False)

Output: ["john", "doe", "35", "NY"], etc.

๐ŸŽฅ Watch the Full Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Spark Table Formats in Azure Synapse: Parquet, CSV, Delta, JSON, Avro & ORC Explained | Azure Synapse Analytics Tutorial

Spark Table Formats in Azure Synapse: Parquet, CSV, Delta, JSON, Avro & ORC

Spark Table Formats in Azure Synapse: Parquet, CSV, Delta, JSON, Avro & ORC Explained

๐Ÿ“˜ Overview

Apache Spark in Azure Synapse supports multiple table formats, each optimized for different use cases like performance, compatibility, or schema evolution. This post explores how to create and use Parquet, Delta, CSV, JSON, Avro, and ORC table formats.

๐Ÿ“‚ Common File Formats and Their Use Cases

  • Parquet – Columnar storage, best for analytics
  • Delta – ACID compliant, supports updates/deletes
  • CSV – Simple format for interoperability
  • JSON – Best for semi-structured data
  • Avro – Row-based format, good for schema evolution
  • ORC – Columnar, highly optimized for big data

๐Ÿ› ️ Creating Tables in Different Formats

✅ 1. Parquet Table

%%spark
CREATE TABLE lake.parquet_table (
  id INT, name STRING
)
USING PARQUET
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/parquet'

✅ 2. Delta Table

%%spark
CREATE TABLE lake.delta_table (
  id INT, product STRING
)
USING DELTA
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/delta'

✅ 3. CSV Table

%%spark
CREATE TABLE lake.csv_table (
  id INT, region STRING
)
USING CSV
OPTIONS ('header'='true')
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/csv'

✅ 4. JSON Table

%%spark
CREATE TABLE lake.json_table
USING JSON
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/json'

✅ 5. Avro Table

%%spark
CREATE TABLE lake.avro_table
USING AVRO
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/avro'

✅ 6. ORC Table

%%spark
CREATE TABLE lake.orc_table
USING ORC
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/orc'

๐Ÿ“Š Format Comparison Table

Format Type Supports Schema Evolution ACID Transactions Best Use Case
Parquet Columnar Partial No Read-heavy analytics
Delta Columnar + Transaction Logs Yes Yes ETL with updates/deletes
CSV Row No No Simple data exchange
JSON Row No No Log data or API responses
Avro Row Yes No Data pipelines & evolution
ORC Columnar Yes No Big data batch processing

๐Ÿ“Œ Best Practices

  • Use Delta for ACID-compliant workloads
  • Use Parquet or ORC for optimized analytical queries
  • Use CSV only when human readability or legacy system compatibility is needed

๐Ÿ“บ Watch the Full Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

String Formatting in PySpark | concat_ws, format_number, printf, repeat, lpad, rpad Explained | PySpark Tutorial

String Formatting in PySpark | concat_ws, format_number, printf, repeat, lpad, rpad

String Formatting in PySpark

This tutorial demonstrates how to use PySpark string functions like concat_ws, format_number, format_string, printf, repeat, lpad, and rpad for formatting, combining, and manipulating string values in DataFrames.

๐Ÿ“ฆ Sample Data Creation

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName(\"StringFormattingDemo\").getOrCreate()

data = [
    (\"John\", 950000, \"USD\", \"Smith\"),
    (\"Alice\", 120500, \"EUR\", \"Brown\"),
    (\"Bob\", 87999, \"INR\", \"Taylor\")
]

columns = [\"first_name\", \"salary\", \"country\", \"last_name\"]
df = spark.createDataFrame(data, columns)
df.show()

๐Ÿ“Š Sample DataFrame Output

+-----------+--------+-------+----------+
|first_name | salary |country| last_name|
+-----------+--------+-------+----------+
| John      | 950000 | USD   | Smith    |
| Alice     | 120500 | EUR   | Brown    |
| Bob       |  87999 | INR   | Taylor   |
+-----------+--------+-------+----------+

๐Ÿ”— concat_ws()

df.withColumn("full_name", concat_ws("-", "first_name", "country")).show()

Output: Creates a new column combining first name and country with hyphen

๐Ÿ’ฒ format_number()

df.withColumn("formatted_salary", format_number("salary", 2)).show()

Output: Adds a formatted string version of salary with 2 decimal places

๐Ÿงพ format_string()

df.withColumn("greeting", format_string("Hello %s %s", col("first_name"), col("country"))).show()

Output: Hello John USD

๐Ÿ”ข printf()

df.withColumn("price_tag", printf("Amount = %.2f", col("salary"))).show()

Output: Amount = 950000.00

๐Ÿ” repeat()

df.withColumn("excited", repeat(lit("test"), 2)).show()

Output: testtest

⬅️➡️ lpad() and rpad()

df.withColumn("last_lpad", lpad("last_name", 10, "*")) \
  .withColumn("first_rpad", rpad("first_name", 10, "-")).show()

Output: Pads strings to given length with characters on left/right

๐ŸŽฅ Watch the Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Query Data & Use Data Preview in SQL Editor in Fabric | Microsoft Fabric Tutorial For Data Engineers

Query Data & Use Data Preview in SQL Editor in Fabric | Microsoft Fabric Tutorial

Query Data & Use Data Preview in SQL Editor in Fabric | Microsoft Fabric Tutorial For Data Engineers

Microsoft Fabric offers a powerful and intuitive SQL Editor for data engineers to query, preview, and analyze data in real time. In this tutorial, you’ll learn how to effectively use the SQL Editor and Data Preview features to work with warehouse tables.

✅ How to Open the SQL Editor

  1. Navigate to your Warehouse or Lakehouse in Microsoft Fabric
  2. Click on the SQL Editor tab from the top menu
  3. You can also open an existing SQL script or create a new one

✍️ Writing and Running SQL Queries

Use the editor to write T-SQL queries just like in SSMS or Azure Data Studio. Example:

SELECT TOP 100 * FROM dbo.Orders
WHERE OrderAmount > 500
ORDER BY OrderDate DESC;

Click Run (▶️) or press Ctrl + Enter to execute.

๐Ÿ” Using Data Preview

  • Right-click any table from the object explorer
  • Select Preview data
  • Fabric shows the first 100 rows with basic metadata
  • You can use this to validate structure and sample content

๐ŸŽ›️ Filter, Sort & Explore Result Sets

  • Click column headers to sort ascending or descending
  • Use the filter icon to apply basic text or numeric filters
  • Pin important columns for easier analysis

๐Ÿ’ก Tips for Efficient Querying

  • Use TOP or LIMIT to avoid large result sets during testing
  • Alias long table names for cleaner syntax
  • Click on “Format SQL” for auto-formatting queries
  • Use INFORMATION_SCHEMA.TABLES to find all tables:
SELECT * FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'dbo';

๐ŸŽฌ Watch the Full Tutorial

Blog post written with the help of ChatGPT.

PySpark Tutorial : String Search in PySpark - How to Use contains, startswith, endswith, like, rlike, locate

String Search in PySpark | contains, startswith, endswith, like, rlike, locate

String Search in PySpark

Learn how to perform string filtering and matching in PySpark using functions like contains(), startswith(), endswith(), like, rlike, and locate().

๐Ÿ” Sample DataFrame

+-------------+------------------------+--------+
| customer_name | email                  | status |
+-------------+------------------------+--------+
| Aamir Khan   | aamir@example.com       | Active |
| Bob Smith    | bob.smith@mail.com      | Inactive |
| Lisa Jones   | lisa.j@data.org         | Active |
| Charlie Brown| charlie@peanuts.net     | Pending |
| Alice Wonder | alice.w@wonder.org      | Active |
| Eve H        | eve.h@allabout.eve      | Inactive |
| John Doe     | john.doe@unknown.com    | Active |
+-------------+------------------------+--------+

✅ contains()

Find customers whose address column contains 'Anytown'.

df.filter(col("address").contains("Anytown")).show(truncate=False)

✅ startswith() and endswith()

Filter emails starting with 'a' and ending with '.org'.

df.filter(col("email").startswith("a")).show(truncate=False)
df.filter(col("email").endswith(".org")).show(truncate=False)

✅ like()

Use SQL-style wildcard match.

df.filter(col("customer_name").like("%is%")).show(truncate=False)

✅ rlike()

Use regex pattern match (e.g., find email domains starting with m or d).

df.filter(col("email").rlike("^[md]")).show(truncate=False)

✅ locate()

Find position of substring in the email (e.g., '@' position).

df.select("email", instr(col("email"), "@").alias("at_pos")).show(truncate=False)

๐ŸŽฅ Watch the Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Restore Warehouse In-Place in Fabric – Step-by-Step with Restore Point | Microsoft Fabric Tutorial

Restore Warehouse In-Place in Fabric – Step-by-Step with Restore Point | Microsoft Fabric Tutorial

Restore Warehouse In-Place in Fabric – Step-by-Step with Restore Point

In this Microsoft Fabric tutorial, you'll learn how to restore your warehouse in-place using system or user-defined restore points. This capability is essential for disaster recovery, testing environments, or rolling back undesired changes.

๐Ÿ” What is In-Place Restore?

  • Restores a warehouse to a prior point-in-time
  • Overwrites the existing warehouse (name stays the same)
  • Restores all components: models, queries, tables
  • Useful for undoing accidental deletes, corruption, or failed deployments

๐Ÿ“ฆ Types of Restore Points

  • System-Created: Every 8 hours automatically (up to 180 kept)
  • User-Defined: Manually created before/after major changes
  • Retention for both: 30 days

✅ Common Use Cases

  • Rollback after a schema error
  • Reset test environments to a clean state
  • Recover deleted data or fix bad ETL runs

๐Ÿงญ Step-by-Step: How to Restore

  1. Open Microsoft Fabric Portal
  2. Navigate to your Warehouse
  3. Click Settings → Restore Points
  4. Select a system or user-defined restore point
  5. Click Restore In-Place

⚠️ If you receive Error 5064, try again after a few moments.

๐Ÿ” Permissions Matrix

  • View Restore Points: Admin, Member, Contributor, Viewer
  • Create Restore Points: Admin, Member, Contributor
  • Perform Restore: Admin only

๐Ÿ’ฐ Billing & Cost

  • Storage: Charges apply for Parquet storage under OneLake
  • Compute: Charged when restore point is created or restored
  • Billing is against Fabric Capacity

⚠️ Limitations

  • Cannot rename warehouse on restore (same name only)
  • Restore points expire after 30 days — no extensions
  • Cannot restore across different warehouses
  • Restore does not include downstream linked pipelines

๐ŸŽฌ Watch the Full Video Tutorial

Blog post written with the help of ChatGPT.

Clone Tables in Fabric Warehouse – Zero Copy Cloning with T-SQL | Microsoft Fabric Tutorial

Clone Tables in Fabric Warehouse – Zero Copy Cloning with T-SQL | Microsoft Fabric Tutorial

Clone Tables in Fabric Warehouse – Zero Copy Cloning with T-SQL

In this tutorial, we’ll walk through how to use zero-copy table cloning in Microsoft Fabric Warehouse using simple T-SQL commands. Cloning tables is useful for development, testing, historical reporting, or machine learning experimentation—all without duplicating data.

๐Ÿง  What is a Zero-Copy Clone?

  • Creates a new table referencing the same data files (Parquet in OneLake)
  • No physical data duplication
  • Cloned tables behave as independent tables
  • Extremely fast and storage-efficient

✅ Syntax to Clone a Table

Clone Current State

CREATE TABLE dbo.Sales_Clone AS CLONE OF dbo.Sales;

Clone Table as of Specific Time

CREATE TABLE dbo.Sales_Clone_History 
AS CLONE OF dbo.Sales 
WITH ( SNAPSHOT_TIME = '2025-05-20T10:00:00Z' );

๐Ÿ› ️ Use Cases for Cloning

  • ๐Ÿ”ง Development and Testing
  • ๐Ÿ“Š Point-in-time Historical Reporting
  • ๐Ÿ” Recovery and troubleshooting
  • ๐Ÿ“‚ Data Archiving for compliance
  • ๐Ÿค– Reproducible ML Training Pipelines

⚠️ Limitations to Keep in Mind

  • Cloning is limited to tables within the same warehouse
  • Cannot clone tables from Lakehouse SQL endpoints
  • Cloning entire schema or warehouse is not supported
  • Time-travel cloning must be within 30-day retention
  • Schema changes in the source table may block time-based clones
Tip: Clones are disposable. Deleting a clone does not impact the original table.

๐ŸŽฌ Watch the Full Tutorial

Blog post written with the help of ChatGPT.

Managed vs External Tables in Spark SQL Lake Database | Azure Synapse Analytics Tutorial

Managed vs External Tables in Spark SQL Lake Database | Azure Synapse Analytics

Managed vs External Tables in Spark SQL Lake Database | Azure Synapse Analytics Explained

๐Ÿ“˜ Overview

In Azure Synapse Analytics, you can create both Managed and External Tables within a Lake Database using Spark SQL. Understanding the difference is essential for managing data, storage, and performance effectively.

๐Ÿ“‚ What Are Managed Tables?

Managed tables are those where both the metadata and the data are controlled by Synapse Spark. When you drop the table, the data and metadata are both deleted.

✅ Create a Managed Table

%%spark
CREATE TABLE lakehouse.managed_employee (
    id INT,
    name STRING,
    salary FLOAT
)

✔️ No LOCATION clause is specified — Spark decides where to store the data internally.

๐Ÿ“ What Are External Tables?

External tables store only the metadata in the Lake Database while the data resides externally in user-defined paths like ADLS Gen2.

✅ Create an External Table

%%spark
CREATE TABLE lakehouse.external_sales (
    id INT,
    product STRING,
    amount DOUBLE
)
USING DELTA
LOCATION 'abfss://data@yourstorageaccount.dfs.core.windows.net/sales/external/'

✔️ Data lives outside Spark's internal catalog. Deleting the table will NOT remove the data.

๐Ÿ“Š Key Differences

Feature Managed Table External Table
Data Location Managed by Synapse Custom ADLS path
Drop Table Deletes Data Yes No
Usage Quick prototyping Production pipelines

๐Ÿ“Œ Best Practices

  • Use external tables for production where you need to retain data
  • Use managed tables for temporary analysis or rapid testing
  • Prefer Delta format for ACID support and versioning

๐Ÿ“‚ Example: Read External Table

%%sql
SELECT * FROM lakehouse.external_sales;

๐Ÿ“บ Watch the Full Video Explanation

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

Substring Functions in PySpark: substr(), substring(), overlay(), left(), right() with Real Examples | PySpark Tutorial

Extract Substrings in PySpark | substr(), substring(), overlay(), left(), right()

Extracting Substrings in PySpark

In this tutorial, you'll learn how to use PySpark string functions like substr(), substring(), overlay(), left(), and right() to manipulate string columns in DataFrames.

1️⃣ Initialize Spark Session

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SubstringTutorial").getOrCreate()

2️⃣ Sample DataFrame

data = [("EMP-0001-NC", "maria.fernandez@healthcare.org"),
        ("EMP-0002-TX", "john.davis@fintech.com"),
        ("EMP-0003-CA", "lucy.liu@retailmart.net"),
        ("EMP-0004-FL", "peter.parker@dailybugle.com"),
        ("EMP-0005-NY", "bruce.wayne@gotham.org")]

columns = ["employee_code", "email"]

df = spark.createDataFrame(data, columns)
df.show(truncate=False)

Output:

+-------------+-----------------------------+
|employee_code|email                        |
+-------------+-----------------------------+
|EMP-0001-NC  |maria.fernandez@healthcare.org|
|EMP-0002-TX  |john.davis@fintech.com       |
|EMP-0003-CA  |lucy.liu@retailmart.net      |
|EMP-0004-FL  |peter.parker@dailybugle.com  |
|EMP-0005-NY  |bruce.wayne@gotham.org       |
+-------------+-----------------------------+

3️⃣ substr() Function

Extract substring starting at a specific position with defined length.

from pyspark.sql.functions import col, substr
df_substr = df.withColumn("emp_id", substr("employee_code", 5, 4))
df_substr.show(truncate=False)

Output:

+-------------+-----------------------------+------+
|employee_code|email                        |emp_id|
+-------------+-----------------------------+------+
|EMP-0001-NC  |maria.fernandez@healthcare.org|0001  |
|EMP-0002-TX  |john.davis@fintech.com       |0002  |
|EMP-0003-CA  |lucy.liu@retailmart.net      |0003  |
|EMP-0004-FL  |peter.parker@dailybugle.com  |0004  |
|EMP-0005-NY  |bruce.wayne@gotham.org       |0005  |
+-------------+-----------------------------+------+

4️⃣ substring() Function

Alias for substr() - works the same way.

from pyspark.sql.functions import substring
df_substring = df.withColumn("emp_id", substring("employee_code", 5, 4))
df_substring.show(truncate=False)

5️⃣ overlay() Function

Replaces part of the string with another string.

from pyspark.sql.functions import overlay
df_overlay = df.withColumn("masked_email", overlay("email", "XXXXXX", 1, 5))
df_overlay.show(truncate=False)

6️⃣ left() and right() Functions

Extract leftmost or rightmost characters from string.

from pyspark.sql.functions import left, right

df_left = df.withColumn("left_code", left("employee_code", 3))
df_right = df.withColumn("right_code", right("employee_code", 2))

df_left.show(truncate=False)
df_right.show(truncate=False)

๐Ÿ“บ Watch the Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

User Defined Inline Table-Valued Functions with CROSS APPLY | Microsoft Fabric Tutorial

User Defined Inline Table-Valued Functions with CROSS APPLY | Microsoft Fabric Tutorial

User Defined Inline Table-Valued Functions with CROSS APPLY | Microsoft Fabric Tutorial

In this Microsoft Fabric tutorial, we’ll demonstrate how to create and use inline table-valued functions (iTVFs) and apply them with CROSS APPLY. These are great tools for reusable logic and dynamic row-by-row evaluation.

๐Ÿง  What is a Function in SQL?

A function is a reusable SQL object that returns either a value or a table. In Fabric Warehouse, only inline table-valued functions (iTVFs) are currently supported.

๐Ÿšซ Not Supported:

  • Scalar functions
  • Multi-statement table-valued functions

✅ Supported:

  • Inline Table-Valued Functions (iTVFs)

๐Ÿ—️ Step 1: Create Sample Orders Table

CREATE TABLE dbo.Orders (
    OrderID INT,
    CustomerName VARCHAR(100),
    OrderDate DATE,
    OrderAmount DECIMAL(10, 2)
);

๐Ÿ’พ Step 2: Insert Sample Data

INSERT INTO dbo.Orders VALUES
(1, 'Aamir', '2024-01-10', 1200.00),
(2, 'Sara', '2024-02-15', 800.00),
(3, 'John', '2024-03-20', 300.00),
(4, 'Mia', '2024-03-22', 1500.00),
(5, 'Aamir', '2024-04-01', 200.00);

๐Ÿ”„ Step 3: Create Discount Function with CROSS APPLY

This iTVF applies a 10% discount to a given amount:

CREATE FUNCTION dbo.fn_ApplyDiscount (@Amount DECIMAL(10,2))
RETURNS TABLE
AS
RETURN (
    SELECT @Amount * 0.90 AS DiscountedAmount
);

✅ Using CROSS APPLY

SELECT 
    o.OrderID, o.CustomerName, o.OrderAmount,
    d.DiscountedAmount
FROM dbo.Orders o
CROSS APPLY dbo.fn_ApplyDiscount(o.OrderAmount) d;

๐Ÿ”ธ Step 4: Create Function to Filter Orders

This function returns orders where the amount is greater than or equal to a specified value:

CREATE FUNCTION dbo.fn_GetHighValueOrders (@MinAmount DECIMAL(10,2))
RETURNS TABLE
AS
RETURN (
    SELECT OrderID, CustomerName, OrderAmount
    FROM dbo.Orders
    WHERE OrderAmount >= @MinAmount
);

✅ Usage Example:

SELECT * FROM dbo.fn_GetHighValueOrders(200.00);

๐Ÿงน Optional Cleanup

-- DROP FUNCTION dbo.fn_ApplyDiscount;
-- DROP FUNCTION dbo.fn_GetHighValueOrders;
-- DROP TABLE dbo.Orders;

๐ŸŽฌ Watch the Full Tutorial

Blog post written with the help of ChatGPT.

Lake Database in Azure Synapse: Spark SQL vs Delta vs Parquet | Full Demo with Create & Insert | Azure Synapse Analytics Tutorial

Lake Database in Azure Synapse: Spark SQL vs Delta vs Parquet

Lake Database in Azure Synapse: Spark SQL vs Delta vs Parquet | Full Demo

๐Ÿ“˜ Overview

Azure Synapse Lake Databases offer powerful data lake querying capabilities using Spark SQL. In this tutorial, we demonstrate how to:

  • Create and use Lake Databases
  • Compare Parquet, Delta, and Spark SQL tables
  • Insert, update, and read data using Apache Spark in Synapse

๐Ÿ—️ Step 1: Create Spark SQL Table (Unmanaged)

%%spark
CREATE TABLE IF NOT EXISTS lakehouse.spark_table (
    id INT, name STRING, age INT
)
USING PARQUET
LOCATION 'abfss://lake@storageaccount.dfs.core.windows.net/tables/spark_table'

๐Ÿ’พ Step 2: Create Delta Table

%%spark
CREATE TABLE lakehouse.delta_table (
    id INT, product STRING, quantity INT
)
USING DELTA
LOCATION 'abfss://lake@storageaccount.dfs.core.windows.net/tables/delta_table'

๐Ÿ”„ Step 3: Insert Data into Tables

%%spark
INSERT INTO lakehouse.spark_table VALUES (1, 'Ali', 30), (2, 'Zara', 25);
INSERT INTO lakehouse.delta_table VALUES (1, 'Phone', 10), (2, 'Tablet', 20);

๐Ÿ“Š Step 4: Query Tables

%%sql
SELECT * FROM lakehouse.spark_table;
SELECT * FROM lakehouse.delta_table;

✏️ Step 5: Update Delta Table

Delta supports ACID transactions, including UPDATE and DELETE operations.

%%spark
UPDATE lakehouse.delta_table SET quantity = 15 WHERE id = 1;

๐Ÿ“ˆ Comparison: Parquet vs Delta vs Spark Tables

FeatureParquetDeltaSpark SQL Table
Storage FormatColumnarDelta (based on Parquet + logs)Varies
ACID SupportNoYesNo
Supports Update/DeleteNoYesNo
PerformanceFast ReadFast + Version ControlSimple

๐Ÿ“Œ Best Practices

  • Use Delta Lake for write-heavy transactional workloads
  • Use Parquet for optimized reads and archival
  • Validate permissions for Lake DB and storage paths

๐Ÿ“บ Watch the Full Demo

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

How to perform String Cleaning in PySpark lower, trim, initcap Explained with Real Data | PySpark Tutorial

How to Clean Strings in PySpark | lower(), trim(), initcap() Explained with Real Data

How to Clean Strings in PySpark

Using lower(), trim(), and initcap() with Real Data

๐Ÿ“Œ What You’ll Learn

  • How to use lower() to convert text to lowercase
  • How to use trim() to remove leading/trailing spaces
  • How to use initcap() to capitalize first letter of each word
  • Chaining multiple string functions

๐Ÿ“Š Sample Data

data = [
    (" Aamir ",),
    ("LISA ",),
    ("  charLie   ",),
    ("BOB",),
    (" eli",)
]
columns = ["raw_name"]
df = spark.createDataFrame(data, columns)
df.show(truncate=False)
Output:
+-----------+
|raw_name   |
+-----------+
| Aamir     |
|LISA       |
|  charLie  |
|BOB        |
| eli       |
+-----------+

๐Ÿ”ง Cleaned Data using PySpark Functions

1️⃣ Apply trim()

from pyspark.sql.functions import trim
df_trimmed = df.withColumn("trimmed", trim("raw_name"))
df_trimmed.show(truncate=False)

2️⃣ Apply lower() and upper()

from pyspark.sql.functions import lower, upper
df_lower = df_trimmed.withColumn("lowercase", lower("trimmed"))
df_upper = df_trimmed.withColumn("uppercase", upper("trimmed"))
df_lower.show(truncate=False)
df_upper.show(truncate=False)

3️⃣ Apply initcap()

from pyspark.sql.functions import initcap
df_initcap = df_trimmed.withColumn("titlecase", initcap("trimmed"))
df_initcap.show(truncate=False)

๐ŸŽฅ Watch the Full Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Stored Procedure in Fabric Warehouse Input & Output Parameters SQL Demo | Microsoft Fabric Tutorial

Stored Procedure in Fabric Warehouse – Input & Output Parameters SQL Demo

Stored Procedure in Fabric Warehouse – Input & Output Parameters SQL Demo

In this Microsoft Fabric tutorial, you'll learn how to create and execute a stored procedure with input and output parameters. This is especially useful when you want to encapsulate query logic and reuse it across reports, data pipelines, and dashboards.

๐Ÿง  What is a Stored Procedure?

A stored procedure is a reusable set of SQL statements saved in the database. It can take input parameters, return output values, and perform operations like reading or modifying data.

✅ Benefits of Stored Procedures:

  • Encapsulates business logic
  • Promotes reuse and consistency
  • Supports input and output parameters
  • Can include conditional and complex logic

๐Ÿ—️ Step 1: Create a Sample Table

CREATE TABLE dbo.Orders (
    OrderID INT,
    CustomerName VARCHAR(100),
    OrderAmount DECIMAL(10,2)
);

๐Ÿ’พ Step 2: Insert Sample Data

INSERT INTO dbo.Orders VALUES
(1, 'Aamir', 500.00),
(2, 'Sara', 750.00),
(3, 'John', 1200.00);

⚙️ Step 3: Create the Stored Procedure

This procedure accepts a customer name as input and returns the total order amount as output:

ALTER PROCEDURE dbo.GetCustomerTotalOrders
    @CustomerName VARCHAR(100),
    @TotalAmount DECIMAL(10,2) OUTPUT
AS
BEGIN
    SELECT @TotalAmount = SUM(OrderAmount)
    FROM dbo.Orders
    WHERE CustomerName = @CustomerName;
END;

๐Ÿš€ Step 4: Execute the Stored Procedure

-- Declare output variable
DECLARE @Total DECIMAL(10,2);

-- Execute with input and capture output
EXEC dbo.GetCustomerTotalOrders
    @CustomerName = 'Aamir',
    @TotalAmount = @Total OUTPUT;

-- Display the result
SELECT 'Total Order Amount for Aamir' AS Description, @Total AS TotalAmount;

๐Ÿงน Optional Cleanup

-- DROP PROCEDURE dbo.GetCustomerTotalOrders;
-- DROP TABLE dbo.Orders;

๐Ÿ“Š View Final Orders Table

SELECT * FROM dbo.Orders;

๐ŸŽฌ Watch the Full Tutorial

Blog post written with the help of ChatGPT.

Export Data from Synapse Dedicated SQL Pool to Single CSV File in ADLS Gen2 | Azure Synapse Analytics Tutorial

Export Data from Synapse Dedicated SQL Pool to Single CSV File in ADLS Gen2

Export Data from Synapse Dedicated SQL Pool to Single CSV File in ADLS Gen2

๐Ÿ“˜ Overview

In Azure Synapse Analytics, it’s common to export data from a Dedicated SQL Pool to Azure Data Lake Storage Gen2 in CSV format for reporting, archival, or integration with downstream tools. This tutorial demonstrates how to do that and ensure the output is saved as a single CSV file.

๐Ÿ› ️ Step-by-Step: Export to Single CSV File

✅ 1. Query Data Using Synapse Spark

%%pyspark
df = spark.read \
    .synapsesql("yourdedicatedpool.dbo.SalesData")
df.show()

✅ 2. Repartition to Single File

df_single = df.repartition(1)

This ensures only one output file is generated when writing to storage.

✅ 3. Write to ADLS Gen2 as CSV

df_single.write \
    .mode("overwrite") \
    .option("header", "true") \
    .csv("abfss://export@yourstorageaccount.dfs.core.windows.net/sales/csv/")

๐Ÿ“Œ Notes

  • repartition(1) should only be used for small to medium datasets due to single-threaded write.
  • Azure Storage Explorer can be used to rename the output file (e.g., from part-00000.csv to sales_report.csv).
  • Make sure Synapse workspace has access permissions (RBAC or SAS).

๐ŸŽฏ Use Cases

  • Exporting summarized data for business intelligence
  • Staging datasets for Power BI or external tools
  • Archiving daily/weekly sales reports

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

PySpark Tutorial : Master PySpark Sorting: sort(), asc(), desc() Explained with Examples #pyspark

Sort Functions in PySpark Explained with Examples

Sort Functions in PySpark Explained with Examples

In this post, you'll learn how to use various sort functions in PySpark to order data by ascending/descending, and control the handling of nulls. This guide is perfect for anyone working with big data in Spark!

1️⃣ Setup

from pyspark.sql import SparkSession
from pyspark.sql.functions import asc, desc, asc_nulls_first, desc_nulls_last

spark = SparkSession.builder.appName("SortFunctionsDemo").getOrCreate()

2️⃣ Sample Data

sample_data = [
    ("Alice", 5000),
    ("Bob", None),
    ("Cara", 6200),
    ("Dan", None),
    ("Eli", 4500),
    ("Fay", 7000)
]

columns = ["employee", "sales"]
df = spark.createDataFrame(sample_data, columns)
df.show()

Output:

+--------+-----+
|employee|sales|
+--------+-----+
|   Alice| 5000|
|     Bob| null|
|    Cara| 6200|
|     Dan| null|
|     Eli| 4500|
|     Fay| 7000|
+--------+-----+

3️⃣ Sort by Ascending

df.orderBy(asc("sales")).show()

4️⃣ Sort with Nulls First

df.orderBy(asc_nulls_first("sales")).show()

5️⃣ Sort with Nulls Last

df.orderBy(desc_nulls_last("sales")).show()

6️⃣ Descending Order

df.orderBy(desc("sales")).show()

๐Ÿ“บ Video Tutorial

© 2024 Aamir Shahzad — All rights reserved.
Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Insert, Update & Delete in Fabric Warehouse Full SQL Demo for Beginners | Microsoft Fabric Tutorial

Insert, Update & Delete in Fabric Warehouse – Full SQL Demo for Beginners

Insert, Update & Delete in Fabric Warehouse – Full SQL Demo for Beginners

In this tutorial, you'll learn how to use the core DML (Data Manipulation Language) commands — INSERT, UPDATE, and DELETE — in Microsoft Fabric Warehouse. These are the foundation of interacting with structured data in your SQL environment.

๐Ÿ“Œ What Are DML Commands?

  • INSERT: Adds new rows into a table
  • UPDATE: Modifies existing rows
  • DELETE: Removes rows

Fabric Warehouse fully supports these operations using standard T-SQL syntax.

๐Ÿ—️ Step 1: Create a Sample Table

CREATE TABLE dbo.Products (
    ProductID INT,
    ProductName VARCHAR(100),
    Category VARCHAR(50),
    Price DECIMAL(10, 2),
    Stock INT
);

๐Ÿ’พ Step 2: Insert Sample Data

INSERT INTO dbo.Products (ProductID, ProductName, Category, Price, Stock) VALUES
(1, 'Laptop', 'Electronics', 1200.00, 10),
(2, 'Smartphone', 'Electronics', 800.00, 25),
(3, 'Desk Chair', 'Furniture', 150.00, 40),
(4, 'Monitor', 'Electronics', 300.00, 15);

Verify Insert:

SELECT * FROM dbo.Products;

✏️ Step 3: Update Data

Increase the price of all Electronics items by 10%:

UPDATE dbo.Products
SET Price = Price * 1.10
WHERE Category = 'Electronics';

Verify Update:

SELECT * FROM dbo.Products;

❌ Step 4: Delete Data

Delete all products with stock less than 20:

DELETE FROM dbo.Products
WHERE Stock < 20;

Verify Delete:

SELECT * FROM dbo.Products;

๐ŸŽฌ Watch the Full Tutorial

Blog post written with the help of ChatGPT.

Read CSV File & Load to Dedicated SQL Pool Table Using Apache Spark Pool Notebook | Azure Synapse Analytics Tutorial

Read CSV File & Load to Dedicated SQL Pool Table | Synapse Notebook

Read CSV File & Load to Dedicated SQL Pool Table Using Apache Spark Pool Notebook

๐Ÿ“˜ Overview

In this tutorial, you'll learn how to use Apache Spark Pools in Azure Synapse Analytics to read a CSV file from ADLS Gen2 and write that data into a Dedicated SQL Pool table. This is useful for ETL and data ingestion pipelines within your Synapse workspace.

๐Ÿ› ️ Step-by-Step Workflow

✅ Step 1: Read CSV File from ADLS Gen2

%%pyspark
csv_df = spark.read.option("header", "true") \
    .csv("abfss://data@yourstorageaccount.dfs.core.windows.net/input/employees.csv")
csv_df.show()

✅ Step 2: Write Data to Dedicated SQL Pool

csv_df.write \
    .synapsesql("yourdedicatedpool.dbo.Employee") \
    .mode("overwrite") \
    .save()

๐Ÿ“ Note: The `.synapsesql()` method is available by default when using Spark Pools in Synapse with built-in connectors.

✅ Step 3: Validate Inserted Data

%%sql
SELECT * FROM yourdedicatedpool.dbo.Employee;

๐Ÿ“Œ Tips

  • Ensure the destination table exists before writing
  • Use overwrite or append mode depending on your use case
  • Review schema compatibility between DataFrame and SQL table

๐ŸŽฏ Use Cases

  • Bulk load historical data into data warehouse
  • ETL transformation using PySpark before loading
  • Automated data ingestion from CSV sources

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

PySpark Window Functions Explained | Rank, Dense_Rank, Lead, Lag, NTILE | Real-World Demo | PySpark Tutorial

Mastering Window Functions in PySpark | Rank, Lag, Lead, Dense_Rank, Ntile

๐Ÿš€ Mastering Window Functions in PySpark

This blog post will walk you through essential PySpark window functions like rank(), lag(), lead(), dense_rank(), and ntile() using practical examples. These functions are crucial for complex data analytics, ordering, and partitioning scenarios.

๐Ÿ“˜ Sample Dataset

+-------+------------+-------+------------+
|Employee|Department |Sales  |Sale_Date   |
+--------+-----------+--------+-----------+
|Alice   |Sales      |5000    |2024-01-01 |
|Ben     |Sales      |5500    |2024-01-01 |
|Cara    |Marketing  |4800    |2024-01-01 |
|Dan     |HR         |5300    |2024-01-01 |
|Alice   |Sales      |6200    |2024-01-02 |
|Ben     |Sales      |5500    |2024-01-02 |
|Cara    |Marketing  |5100    |2024-01-02 |
|Dan     |HR         |6000    |2024-01-02 |
|Alice   |Sales      |7000    |2024-01-03 |
|Ben     |Sales      |6400    |2024-01-03 |
+--------+-----------+--------+-----------+

⚙️ Code Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, row_number, rank, dense_rank, percent_rank, lag, lead, ntile
from pyspark.sql.window import Window

# Start Spark session
spark = SparkSession.builder.appName("WindowFunctionsDemo").getOrCreate()

# Sample DataFrame
data = [
    ("Alice", "Sales", 5000, "2024-01-01"),
    ("Ben", "Sales", 5500, "2024-01-01"),
    ("Cara", "Marketing", 4800, "2024-01-01"),
    ("Dan", "HR", 5300, "2024-01-01"),
    ("Alice", "Sales", 6200, "2024-01-02"),
    ("Ben", "Sales", 5500, "2024-01-02"),
    ("Cara", "Marketing", 5100, "2024-01-02"),
    ("Dan", "HR", 6000, "2024-01-02"),
    ("Alice", "Sales", 7000, "2024-01-03"),
    ("Ben", "Sales", 6400, "2024-01-03"),
]
schema = ["employee", "department", "sales", "sale_date"]
df = spark.createDataFrame(data, schema)
df.show()

๐Ÿ“Š Applying Window Functions

window_spec = Window.partitionBy("employee").orderBy("sale_date")

df.withColumn("rank", rank().over(window_spec)).show()
df.withColumn("dense_rank", dense_rank().over(window_spec)).show()
df.withColumn("percent_rank", percent_rank().over(window_spec)).show()
df.withColumn("lag", lag("sales", 1).over(window_spec)).show()
df.withColumn("lead", lead("sales", 1).over(window_spec)).show()
df.withColumn("ntile", ntile(2).over(window_spec)).show()

๐ŸŽฌ Watch Full Tutorial

๐Ÿ“ Some of the contents in this website were created with assistance from ChatGPT and Gemini.

How to Write, Read & Append Data to Dedicated SQL Pool Table Using Notebook in Azure Synapse | Azure Synapse Analytics Tutorial

Write, Read & Append Data to Dedicated SQL Pool Table Using Notebook in Azure Synapse

How to Write, Read & Append Data to Dedicated SQL Pool Table Using Notebook in Azure Synapse

๐Ÿ“˜ Overview

Azure Synapse Notebooks allow seamless interaction with Dedicated SQL Pools using Spark. You can create and manage SQL tables from PySpark DataFrames, perform bulk writes, and query them using Spark SQL or Synapse SQL on-demand.

๐Ÿ› ️ Step-by-Step Examples

✅ 1. Create DataFrame

%%pyspark
data = [(1, "Aamir", 5000), (2, "Lisa", 6000)]
columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)

✅ 2. Write Data to Dedicated SQL Pool

Use the write.synapsesql format:

df.write \
  .format("com.databricks.spark.sqldw") \
  .option("url", "jdbc:sqlserver://<server>.database.windows.net:1433;database=<dbname>") \
  .option("dbtable", "dbo.Employee") \
  .option("user", "<username>") \
  .option("password", "<password>") \
  .mode("overwrite") \
  .save()

✅ 3. Read Data from SQL Pool Table

df_read = spark.read \
  .format("com.databricks.spark.sqldw") \
  .option("url", "jdbc:sqlserver://...") \
  .option("dbtable", "dbo.Employee") \
  .option("user", "...") \
  .option("password", "...") \
  .load()

df_read.show()

✅ 4. Append Data to Existing Table

df_append = spark.createDataFrame([(3, "John", 7000)], ["id", "name", "salary"])
df_append.write \
  .format("com.databricks.spark.sqldw") \
  .option("url", "jdbc:...") \
  .option("dbtable", "dbo.Employee") \
  .option("user", "...") \
  .option("password", "...") \
  .mode("append") \
  .save()

๐Ÿ“Œ Tips

  • Use overwrite to replace and append to add rows
  • Validate schema compatibility before appending
  • Check Synapse firewall rules for access

๐ŸŽฏ Use Cases

  • ETL workflows from Spark to Dedicated SQL Pool
  • Reading warehouse data for ML or transformation
  • Bulk appending logs or batch jobs

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

Summarizing Data with Aggregate Functions in PySpark _ sum(), sum_distinct(), bit_and() | PySpark Tutorial

PySpark Aggregation Functions: sum(), sum_distinct(), bit_and() Explained

๐Ÿ” PySpark Aggregation Functions: sum(), sum_distinct(), bit_and()

In this tutorial, we will explore key PySpark aggregation functions that help summarize your data efficiently. We’ll go over sum(), sum_distinct(), and bit_and() with practical examples using a DataFrame.

๐Ÿ“Œ Sample DataFrame

+------+------+
| name | sales|
+------+------+
| Aamir|  500 |
| Sara |  300 |
| John |  300 |
| Lina |  200 |
| Aamir|  550 |
| Sara |  650 |
| John |  800 |
| Lina |  250 |
+------+------+

➕ sum()

Description: Calculates the total sum of all values in a column.

from pyspark.sql.functions import sum

df_sum = df.agg(sum("sales").alias("total_sales"))
df_sum.show()

Output:

+------------+
| total_sales|
+------------+
|       3550 |
+------------+

๐Ÿ“Š sum_distinct()

Description: Calculates the sum of only distinct (unique) values in a column.

from pyspark.sql.functions import sum_distinct

df_sum_distinct = df.agg(sum_distinct("sales").alias("total_distinct_sales"))
df_sum_distinct.show()

Output:

+---------------------+
| total_distinct_sales|
+---------------------+
|                3250 |
+---------------------+

⚙️ bit_and()

Description: Computes the bitwise AND of all non-null input values in a column.

from pyspark.sql.functions import bit_and

df_bit_and = df.agg(bit_and("sales").alias("bitwise_and_sales"))
df_bit_and.show()

Output:

+--------------------+
| bitwise_and_sales  |
+--------------------+
|                  0 |
+--------------------+

๐ŸŽฅ Watch the Video Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Create and Use Views in Fabric Warehouse – SQL Tutorial with Example | Microsoft Fabric Tutorial

Create and Use Views in Fabric Warehouse – SQL Tutorial with Example

Create and Use Views in Fabric Warehouse – SQL Tutorial with Example

In this Microsoft Fabric tutorial, we explore how to use SQL Views in Fabric Warehouse to simplify queries, secure data, and generate reusable logic. Views allow you to abstract complex logic and limit sensitive access while maintaining flexibility in data operations.

๐Ÿง  What is a VIEW?

A VIEW is a virtual table based on the result of a SQL query. Views do not store data physically but allow easier querying, especially when dealing with complex joins, filters, or column-level security.

Benefits of Using Views:

  • Simplifies complex queries
  • Provides abstraction and data masking
  • Promotes reusable query logic
  • Improves security by exposing only needed columns

๐Ÿ“‚ Supported & Unsupported View Types in Fabric

  • ✅ Regular Views (non-materialized)
  • ✅ Views on external tables
  • ❌ Materialized Views – Not supported
  • ❌ Indexed Views – Not supported

๐Ÿ—️ Step 1: Create a Sample Table

CREATE TABLE dbo.Employees (
    EmployeeID INT,
    FullName VARCHAR(50),
    Department VARCHAR(50),
    Salary DECIMAL(10,2),
    HireDate DATE
);

๐Ÿ’พ Step 2: Insert Sample Data

INSERT INTO dbo.Employees VALUES
(1, 'Aamir Shahzad', 'IT', 80000, '2020-01-15'),
(2, 'Sara Khan', 'HR', 65000, '2021-06-10'),
(3, 'John Doe', 'Finance', 90000, '2019-03-20'),
(4, 'Mia Wong', 'IT', 75000, '2022-02-25');

๐Ÿ” Step 3: Create a Basic View

View that returns employee names and departments:

CREATE VIEW dbo.vw_EmployeeBasicInfo AS
SELECT FullName, Department
FROM dbo.Employees;

-- Query the view
SELECT * FROM dbo.vw_EmployeeBasicInfo;

๐Ÿ” Step 4: Create a Security View

Hide salary details from general users:

CREATE VIEW dbo.vw_PublicEmployeeInfo AS
SELECT EmployeeID, FullName, Department
FROM dbo.Employees;

-- Query the view
SELECT * FROM dbo.vw_PublicEmployeeInfo;

๐Ÿ“Š Step 5: Create an Aggregated View

Average salary by department:

CREATE VIEW dbo.vw_DepartmentSalaryAvg AS
SELECT Department, AVG(Salary) AS AvgSalary
FROM dbo.Employees
GROUP BY Department;

-- Query the view
SELECT * FROM dbo.vw_DepartmentSalaryAvg;

๐Ÿšซ Step 6: Unsupported View Features

  • ❌ ORDER BY in views (unless used with TOP or OFFSET-FETCH)
  • ❌ Nested CTEs
  • ❌ Procedural logic or parameters

๐ŸŽฌ Watch the Full Tutorial

Blog post written with the help of ChatGPT.

Fix “File Cannot Be Opened” Error in Azure Synapse | File Not Found or In Use Issue Explained | Azure Synapse Analytics Tutorial

Fix “File Cannot Be Opened” Error in Azure Synapse | File Not Found or In Use Issue

Fix “File Cannot Be Opened” Error in Azure Synapse | File Not Found or In Use Issue Explained

๐Ÿ“˜ Overview

While working with Serverless SQL Pools or Spark Pools in Azure Synapse Analytics, you may encounter the frustrating error:

Msg 105019, Level 16, State 1, Line 1  
External file cannot be opened because it does not exist or is used by another process.

This error can occur when accessing files using OPENROWSET, external tables, or Spark reads from abfss:// paths.

❌ Common Causes

  • The file or folder does not exist at the specified path
  • You are referencing a folder but did not specify WITH (FORMAT = 'CSV') properly
  • File is still being written to and locked by another process
  • Synapse doesn't have permission to access the container or path
  • File is empty or contains incompatible format/schema

✅ How to Fix It

๐Ÿ”น 1. Double-Check File Path

SELECT *  
FROM OPENROWSET(
    BULK 'https://yourstorage.dfs.core.windows.net/container/data/file.csv',
    FORMAT = 'CSV',
    PARSER_VERSION = '2.0'
) AS rows;

✅ Use Azure Storage Explorer to verify that the file and folder exist.

๐Ÿ”น 2. Use Trailing Slash for Folder Reads

-- Correct: reading all files from folder
BULK 'https://.../datafolder/'

๐Ÿ”น 3. Check File Permissions

Ensure Synapse has Storage Blob Data Contributor role assigned via IAM or Managed Identity.

๐Ÿ”น 4. Validate File Format

Make sure your file is in the correct format and that headers, delimiters, or encodings are compatible with your query options.

๐Ÿ”น 5. Delay Access if File Is in Use

If the file is still being written by another service (e.g., ADF pipeline or streaming job), introduce a delay or check for file locks before reading.

๐Ÿ“Œ Best Practices

  • Always verify storage path and access manually when error occurs
  • Use wildcards to test file patterns if unsure
  • Enable diagnostic logs for more details

๐Ÿ“ˆ Helpful Diagnostic Query

-- To test path access
SELECT TOP 10 *  
FROM OPENROWSET(
    BULK 'https://.../testfolder/*.csv',
    FORMAT = 'CSV',
    FIRSTROW = 2
) AS rows;

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

Statistical Aggregations in PySpark | avg(), mean(), median(), mode() | PySpark Tutorial

Statistical Aggregations in PySpark | avg(), mean(), median(), mode()

Statistical Aggregations in PySpark

Learn how to use avg(), mean(), median(), and mode() in PySpark with practical examples.

๐Ÿ“Š What You'll Learn

  • How to compute the average (mean) using avg() and mean()
  • How to calculate the median value in a DataFrame
  • How to determine the mode (most frequent value)

๐Ÿ“ฆ Sample Data


data = [("Aamir", 500), ("Sara", 300), ("John", 700),
        ("Lina", 200), ("Aamir", 550), ("Sara", 650),
        ("John", 700), ("Lina", 250), ("John", 700)]
schema = StructType([
    StructField("name", StringType(), True),
    StructField("sales", IntegerType(), True)
])
df = spark.createDataFrame(data, schema)
    

๐Ÿ“ˆ Example: avg() & mean()


from pyspark.sql.functions import avg, mean

df_avg = df.agg(avg("sales").alias("average_sales"))
df_avg.show()
    
Output:

+-------------+
|average_sales|
+-------------+
|        500.0|
+-------------+
    

๐Ÿ“Œ Example: median()


from pyspark.sql.functions import col

median = df.approxQuantile("sales", [0.5], 0.0)[0]
print("Median:", median)
    
Output:

Median: 500.0
    

๐Ÿ“Œ Example: mode()


mode_df = df.groupBy("sales").count().orderBy("count", ascending=False)
mode = mode_df.first()["sales"]
print("Mode:", mode)
    
Output:

Mode: 700
    

๐ŸŽฅ Video Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Mount ADLS Gen2 Storage Using Linked Service in Azure Synapse Analytics | Azure Synapse Tutorial

Mount ADLS Gen2 Storage Using Linked Service in Azure Synapse Analytics

Mount ADLS Gen2 Storage Using Linked Service in Azure Synapse Analytics

๐Ÿ“˜ Overview

Azure Synapse Analytics allows you to securely connect to Azure Data Lake Storage Gen2 (ADLS Gen2) using a Linked Service. Once configured, you can easily access files using the abfss:// path format within Spark notebooks or pipelines.

๐ŸŽฏ Why Use Linked Service?

  • Centralized and secure credential management
  • Reuse connection across notebooks and pipelines
  • No need to hard-code credentials in code
  • Simplified access using friendly names and workspace integration

๐Ÿ› ️ Step-by-Step: Configure and Use Linked Service

✅ Step 1: Create Linked Service

  1. Go to your Synapse workspace
  2. Navigate to ManageLinked Services
  3. Click + New and choose Azure Data Lake Storage Gen2
  4. Enter a name (e.g., adls2-linked)
  5. Choose authentication method (Account Key, Service Principal, or Managed Identity)
  6. Test connection and click Create

✅ Step 2: Reference Storage in Notebooks

%%pyspark
df = spark.read.option("header", "true") \
    .csv("abfss://<container>@<storageaccount>.dfs.core.windows.net/path/to/data.csv")
df.show()

✅ Step 3: Access from Synapse Pipelines

Once the linked service is created, you can:

  • Use it in Copy Data and Data Flow activities
  • Browse ADLS files using the Synapse Studio UI

๐Ÿ“ฆ Example abfss Path Format

abfss://raw@techbrothersadlsgen2.dfs.core.windows.net/sales/data.csv

๐Ÿ“Œ Tips

  • Use Managed Identity for enterprise security integration
  • Ensure Synapse workspace has Storage Blob Data Contributor access to ADLS
  • Use path autocompletion in notebooks for easy access

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT.

PySpark Aggregations: count(), count_distinct(), first(), last() Explained with Examples #pyspark | PySpark Tutorial

Basic Aggregations in PySpark | count(), count_distinct(), first(), last()

Basic Aggregations in PySpark

Learn how to use count(), count_distinct(), first(), and last() functions with real-world PySpark DataFrames. A beginner-friendly walkthrough.

๐Ÿ“Œ Step 1: Sample Dataset

data = [
  ("Aamir", "New York", 31),
  ("Sara", "San Francisco", 25),
  ("John", "Los Angeles", 35),
  ("Lina", "Chicago", 28),
  ("Aamir", "Lahore", 30),
  ("John", "Los Angeles", 35)
]

schema = StructType([
  StructField("name", StringType(), True),
  StructField("city", StringType(), True),
  StructField("age", IntegerType(), True)
])

df = spark.createDataFrame(data, schema)
df.show()

๐Ÿ–ฅ️ Output:

+-----+-------------+---+
| name|         city|age|
+-----+-------------+---+
|Aamir|     New York| 31|
| Sara|San Francisco| 25|
| John|  Los Angeles| 35|
| Lina|      Chicago| 28|
|Aamir|       Lahore| 30|
| John|  Los Angeles| 35|
+-----+-------------+---+

๐Ÿ“Š Step 2: Aggregation Functions

1️⃣ Count()

df.groupBy("name").agg(count("name").alias("name_count")).show()

2️⃣ count_distinct()

df.groupBy("name").agg(count_distinct("age").alias("distinct_age_count")).show()

3️⃣ any_value()

df.groupBy("name").agg(any_value("city").alias("any_city")).show()

4️⃣ first()

df.groupBy("name").agg(first("city").alias("first_city")).show()

5️⃣ last()

df.groupBy("name").agg(last("city").alias("last_city")).show()

๐Ÿ“บ Watch the Full Tutorial

๐Ÿ“ข Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Mastering SQL JOINs in Fabric Warehouse INNER, LEFT, RIGHT, FULL, CROSS | Microsoft Fabric Tutorial

Mastering SQL JOINs in Fabric Warehouse – INNER, LEFT, RIGHT, FULL, CROSS | Microsoft Fabric Tutorial

Mastering SQL JOINs in Fabric Warehouse – INNER, LEFT, RIGHT, FULL, CROSS

In this Microsoft Fabric tutorial, you’ll learn how to use different types of SQL JOINs to combine data across tables inside Fabric Warehouse. We’ll walk through INNER, LEFT, RIGHT, FULL OUTER, and CROSS JOIN with clear examples.

๐Ÿ“– What is a SQL JOIN?

A JOIN clause is used to combine rows from two or more tables based on a related column between them. JOINs allow you to query relational data stored in separate tables in a unified view.

๐Ÿ“š Types of JOINs in Fabric Warehouse

  • INNER JOIN – Returns only the matching rows from both tables
  • LEFT JOIN – Returns all rows from the left table and matched rows from the right
  • RIGHT JOIN – Returns all rows from the right table and matched rows from the left
  • FULL OUTER JOIN – Returns all rows from both tables, matched or not
  • CROSS JOIN – Returns a Cartesian product (all combinations of rows)

๐Ÿงฑ Step 1: Create Tables

CREATE TABLE dbo.Customers (
    CustomerID INT,
    CustomerName VARCHAR(50),
    Country VARCHAR(50)
);

CREATE TABLE dbo.Orders (
    OrderID INT,
    CustomerID INT,
    OrderAmount DECIMAL(10,2),
    OrderDate DATE
);

๐Ÿงช Step 2: Insert Sample Data

INSERT INTO dbo.Customers (CustomerID, CustomerName, Country) VALUES
(1, 'Aamir', 'USA'),
(2, 'Sara', 'Canada'),
(3, 'John', 'UK'),
(4, 'Mia', 'Australia');

INSERT INTO dbo.Orders (OrderID, CustomerID, OrderAmount, OrderDate) VALUES
(101, 1, 500.00, '2024-01-01'),
(102, 2, 700.00, '2024-01-02'),
(103, 1, 300.00, '2024-01-05'),
(104, 5, 250.00, '2024-01-07');  -- CustomerID 5 does not exist in Customers

๐Ÿ”Ž Step 3: SQL JOIN Examples

✅ INNER JOIN

SELECT c.CustomerName, o.OrderID, o.OrderAmount
FROM dbo.Customers c
INNER JOIN dbo.Orders o ON c.CustomerID = o.CustomerID;

✅ LEFT JOIN

SELECT c.CustomerName, o.OrderID, o.OrderAmount
FROM dbo.Customers c
LEFT JOIN dbo.Orders o ON c.CustomerID = o.CustomerID;

✅ RIGHT JOIN

SELECT c.CustomerName, o.OrderID, o.OrderAmount
FROM dbo.Customers c
RIGHT JOIN dbo.Orders o ON c.CustomerID = o.CustomerID;

✅ FULL OUTER JOIN

SELECT c.CustomerName, o.OrderID, o.OrderAmount
FROM dbo.Customers c
FULL OUTER JOIN dbo.Orders o ON c.CustomerID = o.CustomerID;

✅ CROSS JOIN (Bonus)

SELECT c.CustomerName, o.OrderID
FROM dbo.Customers c
CROSS JOIN dbo.Orders o;

๐ŸŽฌ Watch the Full Tutorial

Blog post written with the help of ChatGPT.