Understanding IDENTITY Behavior in Azure Synapse SQL Pool | Non-Sequential IDs Explained | Azure Synapse Analytics Tutorial

Understanding IDENTITY Behavior in Azure Synapse SQL Pool | Non-Sequential IDs Explained

Understanding IDENTITY Behavior in Azure Synapse SQL Pool | Non-Sequential IDs Explained

๐Ÿ“˜ What is a Surrogate Key?

A surrogate key is an artificial, system-generated unique identifier used to represent each row in a table. It is not derived from actual business data and is commonly used in data warehouse dimension tables to simplify joins with fact tables.

๐Ÿ”น Using IDENTITY in Synapse SQL Pools

In Azure Synapse Dedicated SQL Pools, you can use the IDENTITY property to automatically generate surrogate keys. However, due to the distributed nature of Synapse, IDENTITY values are not guaranteed to be sequential.

✅ Create Table with IDENTITY

CREATE TABLE dbo.DimCustomer
(
    CustomerSK INT IDENTITY(1,1) NOT NULL,
    CustomerName NVARCHAR(100),
    Country NVARCHAR(50)
)
WITH (
    DISTRIBUTION = HASH(CustomerName),
    CLUSTERED COLUMNSTORE INDEX
);

๐Ÿš€ Insert Sample Data

INSERT INTO dbo.DimCustomer (CustomerName, Country) VALUES ('Alice', 'USA');
INSERT INTO dbo.DimCustomer (CustomerName, Country) VALUES ('Bob', 'Canada');
INSERT INTO dbo.DimCustomer (CustomerName, Country) VALUES ('Charlie', 'UK');

๐Ÿ› ️ Insert Custom IDENTITY Value

Sometimes you might need to insert a specific surrogate key (e.g., -1 for "Unknown"). You can do that by temporarily enabling IDENTITY_INSERT:

SET IDENTITY_INSERT dbo.DimCustomer ON;
INSERT INTO dbo.DimCustomer (CustomerSK, CustomerName, Country)
VALUES (-1, 'Unknown', 'N/A');
SET IDENTITY_INSERT dbo.DimCustomer OFF;

๐Ÿ”Ž Query Metadata About IDENTITY Column

SELECT sm.name AS schema_name,
       tb.name AS table_name,
       co.name AS column_name,
       CASE WHEN ic.column_id IS NOT NULL THEN 1 ELSE 0 END AS is_identity
FROM sys.schemas sm
JOIN sys.tables tb ON sm.schema_id = tb.schema_id
JOIN sys.columns co ON tb.object_id = co.object_id
LEFT JOIN sys.identity_columns ic ON co.object_id = ic.object_id AND co.column_id = ic.column_id
WHERE sm.name = 'dbo' AND tb.name = 'DimCustomer';

๐Ÿ“ Check Seed & Increment Values

SELECT ic.seed_value, ic.increment_value
FROM sys.identity_columns ic
JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
JOIN sys.tables t ON t.object_id = c.object_id
WHERE t.name = 'DimCustomer';

⚠️ IDENTITY Behavior Caveats

  • IDENTITY values may skip or repeat due to parallelism.
  • Do not use IDENTITY on the distribution column.
  • CTAS and SELECT INTO do not preserve IDENTITY properties.
  • Use INSERT INTO ... SELECT when populating IDENTITY tables.

๐Ÿ’ก Best Practices

  • Use IDENTITY columns for dimension surrogate keys.
  • Do not rely on IDENTITY values being sequential across rows.
  • For reporting or incremental loading, always use a MAX(ID) lookup strategy.

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

PySpark String Functions Explained | contains(), startswith(), substr(), endswith() with Examples | PySpark Tutorial

PySpark String Functions Explained | contains(), startswith(), substr(), endswith()

PySpark String Functions Explained

In this tutorial, you'll learn how to use PySpark string functions like contains(), startswith(), substr(), and endswith(). These functions are very helpful for filtering, searching, and extracting string data in PySpark DataFrames.

๐Ÿ”น Sample Data

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("StringFunctionsExample").getOrCreate()

data = [
    (1, "Aamir"),
    (2, "Ali"),
    (3, "Bob"),
    (4, "Lisa"),
    (5, "Zara"),
    (6, "ALINA"),
    (7, "amrita"),
    (8, "Sana")
]
columns = ["id", "name"]
df = spark.createDataFrame(data, columns)
df.show()

๐Ÿ”น contains() Function

Filter rows where name contains "a":

df.filter(col("name").contains("a")).show()

๐Ÿ”น startswith() Function

Filter rows where name starts with "A":

df.filter(col("name").startswith("A")).show()

๐Ÿ”น endswith() Function

Filter rows where name ends with "a":

df.filter(col("name").endswith("a")).show()

๐Ÿ”น substr() Function

Extract first two characters from name:

df.withColumn("first_two", col("name").substr(1, 2)).show()

๐ŸŽฅ Watch the Full Tutorial

Full Notebook Interface Walkthrough with Hands-On Demo | Microsoft Fabric Tutorial for Beginners and Advance Users / Data Engineers

Full Notebook Interface Walkthrough with Hands-On Demo | Microsoft Fabric Tutorial

๐Ÿง  Full Notebook Interface Walkthrough with Hands-On Demo

Microsoft Fabric Tutorial for Beginners and Data Engineers

๐Ÿ““ What Is a Notebook in Microsoft Fabric?

A notebook in Microsoft Fabric is an interactive development environment where you can run PySpark, Spark SQL, or R code, visualize results, and collaborate with your team. It includes advanced features like AI assistance, Git integration, scheduling, and Spark session management.

๐Ÿ  Home Tab

  • ▶️ Run All: Executes all cells in order
  • ๐Ÿ’พ Save / Auto-Save: Keeps your work updated automatically
  • ๐Ÿ”— Connect: Allows you to connect your notebook to a Spark runtime
  • ๐Ÿงช Kernel: Select or restart the notebook’s compute kernel

๐Ÿง  AI & Copilot Tab

  • Generate code with AI suggestions
  • Explain cell logic using natural language
  • Auto-generate visualizations and queries based on schema

๐Ÿ”ข Cell Operations Tab

  • Add code or markdown cells
  • Move cells up or down
  • Clear output or delete cell

๐Ÿ“… Schedule Tab

  • Convert notebook into scheduled pipeline activities
  • Trigger based on time or event
  • Integrate into Fabric Data Pipelines

๐Ÿ“‚ Data Tab

  • Browse Lakehouse files and tables directly within the notebook
  • Drag and drop tables to auto-generate read/load code

⚙️ Settings Tab

  • Select Spark pool and session timeout
  • Adjust language mode (PySpark, SQL, R)

๐Ÿงช Hands-On Demo Overview

In the video walkthrough below, we demonstrate how to:

  • Navigate and utilize each tab
  • Write and run code in code cells
  • Use markdown for documentation
  • Explore and query Lakehouse data
  • Leverage Spark SQL and visual output

๐ŸŽฌ Watch the Full UI Tour & Demo

Blog created with help from ChatGPT and Gemini.

Install OneLake File Explorer to Access Fabric from Local Computer | Microsoft Fabric Tutorial for Beginners and Advance Users

Install OneLake File Explorer to Access Fabric from Local Computer | Microsoft Fabric Tutorial

Install OneLake File Explorer to Access Fabric from Local Computer

Microsoft Fabric Tutorial

๐Ÿ“˜ Introduction

In this tutorial, you'll learn how to install and use OneLake File Explorer to access your Microsoft Fabric data (Lakehouse files) directly from your Windows file system.

OneLake File Explorer is a simple yet powerful tool that brings the data lake experience closer to local development by letting you browse, upload, and manage files stored in your Fabric Lakehouse.

๐Ÿงญ Why Use OneLake File Explorer?

  • ๐Ÿš€ Access Fabric files just like local folders
  • ๐Ÿ“‚ Drag-and-drop support for CSV, JSON, and Parquet
  • ๐Ÿ”’ Secure access using Microsoft Entra ID (Azure AD)
  • ๐Ÿ”„ Sync between your Lakehouse and desktop tools (VS Code, Excel, Power BI Desktop)

๐Ÿ› ️ How to Install OneLake File Explorer

  1. Go to the official Microsoft download page: Download OneLake File Explorer
  2. Download and run the installer
  3. Sign in with your Microsoft Fabric account (work or school)
  4. Once installed, OneLake will appear in your File Explorer under This PC

๐Ÿ“‚ Browse and Manage Lakehouse Data

After installation, you can:

  • Open OneLake in File Explorer
  • Navigate through your Fabric workspaces and Lakehouses
  • Drag and drop files into /Files or /Tables
  • Rename, move, or delete files like a normal folder

๐Ÿ’ก Best Practices

  • Keep a folder structure like raw/, processed/, curated/
  • Ensure your files are in Parquet or Delta format for better performance
  • Use OneLake Explorer for quick uploads during development and testing

๐ŸŽฌ Watch the Video Tutorial

Blog created with help from ChatGPT and Gemini.

Indexing in Azure Synapse Dedicated SQL Pool Explained- Columnstore Heap Clustered & Rebuild Tips | Azure Synapse Analytics Tutorial

Indexing in Azure Synapse Dedicated SQL Pool Explained – Columnstore, Heap, Clustered & Rebuild Tips

Indexing in Azure Synapse Dedicated SQL Pool Explained – Columnstore, Heap, Clustered & Rebuild Tips

๐Ÿ“˜ Introduction

Indexes in Azure Synapse Dedicated SQL Pool play a crucial role in optimizing query performance and storage. Synapse supports multiple types of indexes depending on your workload type—batch analytics, lookups, or staging data.

This blog covers:

  • Clustered Columnstore Index
  • Heap Tables
  • Clustered Index
  • Nonclustered Index
  • Index Rebuild Tips

๐Ÿ”น 1. Clustered Columnstore Index (Default & Best for Analytics)

Best suited for large, analytical workloads. This index compresses data and improves performance. It's the default if no index is specified.

CREATE TABLE dbo.SalesFact_Columnstore
(
    SaleID INT,
    CustomerID INT,
    Amount FLOAT,
    SaleDate DATE
)
WITH (
    CLUSTERED COLUMNSTORE INDEX,
    DISTRIBUTION = HASH(CustomerID)
);

๐Ÿ”น 2. Heap Table (No Index)

No indexes are applied. Ideal for fast loading in staging or transient tables where performance tuning is not required yet.

CREATE TABLE dbo.SalesFact_Heap
(
    SaleID INT,
    CustomerID INT,
    Amount FLOAT,
    SaleDate DATE
)
WITH (
    HEAP,
    DISTRIBUTION = ROUND_ROBIN
);

๐Ÿ”น 3. Clustered Index (Rowstore)

Stores data in a sorted row-based format. Best for small dimension or lookup tables and selective filters.

CREATE TABLE dbo.Customer_Detail
(
    CustomerID INT NOT NULL,
    FullName NVARCHAR(100),
    Region NVARCHAR(50)
)
WITH (
    CLUSTERED INDEX (CustomerID),
    DISTRIBUTION = HASH(CustomerID)
);

๐Ÿ”น 4. Nonclustered Index

Creates a separate index on one or more columns for faster lookups. Use selectively as each index adds overhead.

CREATE INDEX idx_Region ON dbo.Customer_Detail(Region);

๐Ÿ”„ 5. Rebuilding Indexes (Especially for Columnstore)

Rebuilding restores compression and performance. Ideal for maintenance schedules, especially after heavy loads or updates.

-- Rebuild all indexes on a table
ALTER INDEX ALL ON dbo.SalesFact_Columnstore REBUILD;

-- Rebuild a specific partition (if partitioned)
-- ALTER INDEX ALL ON dbo.SalesFact_Columnstore REBUILD PARTITION = 1;

-- Optional compression setting
-- ALTER INDEX ALL ON dbo.SalesFact_Columnstore 
--   REBUILD PARTITION = 1 WITH (DATA_COMPRESSION = COLUMNSTORE);

✅ Best Practices

  • Use CLUSTERED COLUMNSTORE for large fact tables and analytics.
  • Use HEAP for fast, staging loads or temporary tables.
  • Use CLUSTERED INDEX for small dimension or lookup tables.
  • Use NONCLUSTERED INDEX for tuning specific query filters.
  • Rebuild indexes regularly to maintain performance.

⚠️ Note: Index rebuilds are offline operations — schedule them during maintenance windows.

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

PySpark when() and otherwise() Explained |Apply If-Else Conditions to DataFrames #pysparktutorial

PySpark when() and otherwise() Explained | Apply If-Else Conditions to DataFrames

PySpark when() and otherwise() Explained

In this tutorial, you'll learn how to use the when() and otherwise() functions in PySpark to apply if-else style conditional logic directly to DataFrames. These functions are useful for transforming values in a column based on conditions.

๐Ÿ”น Step 1: Create SparkSession & Sample Data

from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col

spark = SparkSession.builder.appName("WhenOtherwiseExample").getOrCreate()

data = [
    (1, "Aamir", 50000),
    (2, "Ali", None),
    (3, "Bob", 45000),
    (4, "Lisa", 60000),
    (5, "Zara", None),
    (6, "ALINA", 55000)
]
columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)
df.show()

๐Ÿ”น Step 2: Apply when() and otherwise()

# Create a new column with conditional labels
df_with_label = df.withColumn(
    "salary_label",
    when(col("salary") >= 55000, "High")
    .otherwise("Low")
)

df_with_label.show()

๐Ÿ”น Step 3: Apply Multiple Conditions

# Multiple when conditions
df_with_category = df.withColumn(
    "salary_category",
    when(col("salary") > 60000, "Very High")
    .when((col("salary") >= 50000) & (col("salary") <= 60000), "Medium")
    .when(col("salary") < 50000, "Low")
    .otherwise("Unknown")
)

df_with_category.show()

๐ŸŽฅ Watch Video Tutorial

Primary Key & Unique Constraints in Azure Synapse Dedicated SQL Pool | NOT ENFORCED Explained | Azure Synapse Analytics Tutorial

Primary Key & Unique Constraints in Azure Synapse Dedicated SQL Pool | NOT ENFORCED Explained

Primary Key & Unique Constraints in Azure Synapse Dedicated SQL Pool | NOT ENFORCED Explained

๐Ÿ” Introduction

In Azure Synapse Dedicated SQL Pools, constraints such as PRIMARY KEY and UNIQUE are allowed — but with a twist: they are NOT ENFORCED. This means they serve as metadata hints for the query optimizer rather than enforcing data integrity rules like in traditional SQL Server.

๐Ÿ“Œ Key Points

  • FOREIGN KEY constraints are not supported.
  • PRIMARY KEY and UNIQUE constraints must include NOT ENFORCED.
  • Duplicates are not prevented — you are responsible for ensuring data quality.
  • These constraints can help the query optimizer generate better plans.

๐Ÿ› ️ Demo Overview

Step 1: Create Demo Table

CREATE TABLE dbo.Products (
    ProductID INT NOT NULL,
    ProductName NVARCHAR(100),
    Category NVARCHAR(50)
)
WITH (DISTRIBUTION = ROUND_ROBIN);

Step 2: Insert Duplicate Values

INSERT INTO dbo.Products VALUES (1, 'Laptop', 'Electronics');
INSERT INTO dbo.Products VALUES (2, 'Tablet', 'Electronics');
INSERT INTO dbo.Products VALUES (1, 'Monitor', 'Electronics'); -- Duplicate ProductID
INSERT INTO dbo.Products VALUES (3, 'Desk', 'Furniture');

Step 3: Add UNIQUE Constraint (NOT ENFORCED)

ALTER TABLE dbo.Products
ADD CONSTRAINT UQ_ProductID UNIQUE (ProductID) NOT ENFORCED;

⚠️ The constraint will be created even though duplicates exist.

Step 4: Remove Duplicate Manually

DELETE FROM dbo.Products WHERE ProductName = 'Monitor';

Step 5: Replace UNIQUE with PRIMARY KEY (also NOT ENFORCED)

ALTER TABLE dbo.Products
DROP CONSTRAINT UQ_ProductID;

ALTER TABLE dbo.Products
ADD CONSTRAINT PK_ProductID PRIMARY KEY NONCLUSTERED (ProductID) NOT ENFORCED;

๐Ÿ’ก Why Use These Constraints?

  • Helps Synapse generate optimized query execution plans.
  • Useful for BI/reporting layers that assume uniqueness.
  • Important for documentation and data governance.

๐Ÿšซ Limitations

  • Constraints won't stop bad data — no runtime enforcement.
  • You must manually ensure data quality via queries or ELT logic.

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

PySpark Null & Comparison Functions: between(), isNull(), isin(), like(), rlike(), ilike() | PySpark Tutorial

PySpark Null & Comparison Functions Explained | between(), isNull(), isin(), like(), rlike(), ilike()

PySpark Null & Comparison Functions Explained

This PySpark tutorial explains how to use essential functions for handling nulls, filtering data, and performing pattern matching in DataFrames using:

  • between()
  • isNull() and isNotNull()
  • isin()
  • like(), rlike(), and ilike()

1. Create a Sample DataFrame

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("NullComparisonOps").getOrCreate()

data = [
    (1, "Aamir", 50000),
    (2, "Ali", None),
    (3, "Bob", 45000),
    (4, "Lisa", 60000),
    (5, "Zara", None),
    (6, "ALINA", 55000)
]

columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)
df.show()

2. Use between() Function

Select employees whose salary is between 45000 and 60000:

df.filter(col("salary").between(45000, 60000)).show()

3. Use isNull() and isNotNull()

Filter rows where salary is null:

df.filter(col("salary").isNull()).show()

Filter rows where salary is not null:

df.filter(col("salary").isNotNull()).show()

4. Use isin() Function

Filter names that are in the list ["Aamir", "Lisa"]:

df.filter(col("name").isin("Aamir", "Lisa")).show()

5. Use like(), rlike(), and ilike()

Names that start with 'A':

df.filter(col("name").like("A%")).show()

Names matching regex (e.g., all names ending in 'a'):

df.filter(col("name").rlike(".*a$")).show()

Case-insensitive LIKE (if using Spark 3.3+):

df.filter(col("name").ilike("ali%")).show()

๐Ÿ“บ Watch Full Tutorial

CTAS vs SELECT INTO in Azure Synapse Analytics | Create Table in Dedicated SQL Pool | Azure Synapse Analytics Tutorial

CTAS vs SELECT INTO in Azure Synapse Analytics | Create Table in Dedicated SQL Pool

CTAS vs SELECT INTO in Azure Synapse Analytics | Create Table in Dedicated SQL Pool

๐Ÿ“˜ Overview

In Azure Synapse Analytics, you have two common options for creating new tables based on the results of a query:

  • CTAS – Create Table As Select
  • SELECT INTO – Select data into a new table

While both approaches allow you to create a new table from existing data, there are important differences in features, performance, and support within Synapse Dedicated SQL Pools.

⚖️ Comparison Table: CTAS vs SELECT INTO

Feature CTAS SELECT INTO
Supported in Dedicated SQL Pool ✅ Yes ❌ No
Control Over Distribution Method ✅ Yes ❌ No
Control Over Indexes ✅ Yes (e.g., columnstore index) ❌ No
Used in Serverless SQL Pool ❌ No ✅ Yes
Syntax Simplicity Moderate Very Simple

✅ Example: CTAS in Dedicated SQL Pool

CREATE TABLE dbo.TotalSalesByProduct
WITH (
    DISTRIBUTION = HASH(ProductID),
    CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT ProductID, SUM(Quantity) AS TotalQuantity
FROM dbo.SalesFact
GROUP BY ProductID;

This will create a distributed and indexed table optimized for analytics.

✅ Example: SELECT INTO in Serverless SQL Pool

SELECT ProductID, SUM(Quantity) AS TotalQuantity
INTO dbo.TotalSalesByProduct
FROM OPENROWSET(
    BULK 'https://storageaccount.dfs.core.windows.net/container/sales.parquet',
    FORMAT = 'PARQUET'
) AS sales
GROUP BY ProductID;

This method is easier and works well in exploratory scenarios in Serverless SQL Pools, but doesn't allow control over performance features.

๐Ÿ“Œ Key Takeaways

  • Use CTAS in Dedicated SQL Pool for production-grade performance.
  • Use SELECT INTO in Serverless SQL Pool for lightweight operations or temporary results.
  • CTAS gives you full control over distribution, storage format, and indexing.

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

Create Your First Lakehouse & Load CSV, Parquet, JSON Files | Microsoft Fabric Tutorial for Data Engineers

Create Your First Lakehouse & Load CSV, Parquet, JSON Files | Microsoft Fabric Tutorial

Create Your First Lakehouse & Load CSV, Parquet, JSON Files

Microsoft Fabric Tutorial

๐Ÿ“˜ What is a Lakehouse in Microsoft Fabric?

A Lakehouse in Microsoft Fabric combines the scalability and flexibility of a data lake with the structured querying power of a data warehouse. It stores files in Delta Lake format, enabling analytics directly over raw and structured data — using SQL, notebooks, or Power BI.

✅ What You'll Learn in This Tutorial

  • What a Lakehouse is and its role in Microsoft Fabric
  • Step-by-step process to create a new Lakehouse
  • How to upload and manage CSV, Parquet, and JSON files
  • How Microsoft Fabric unifies data lake and data warehouse capabilities
  • Practical tips to structure your Lakehouse for analytics workloads

๐Ÿ› ️ Step-by-Step: Creating Your First Lakehouse

  1. Log in to Microsoft Fabric Portal
  2. Go to your workspace and click + NewLakehouse
  3. Give your Lakehouse a name and hit Create

Once created, you'll land in the Lakehouse explorer which allows you to manage files, tables, and notebooks.

๐Ÿ“‚ Upload CSV, Parquet, and JSON Files

Inside your Lakehouse, switch to the Files tab:

  • Click on Upload and select one or more files (CSV, Parquet, or JSON)
  • Uploaded files are stored in /Files folder
  • You can preview and open these files in notebooks or convert them into managed Delta tables

๐Ÿ“Š Unifying Data Lake and Warehouse

Microsoft Fabric allows you to treat your Lakehouse like a warehouse using DirectLake and SQL endpoints:

  • Run SQL queries on files/tables using SQL analytics endpoint
  • Use Power BI for visualizations without importing data
  • Query Delta tables using Spark notebooks

๐Ÿ’ก Tips to Structure Your Lakehouse

  • Use folders like /raw, /processed, and /curated to stage data
  • Convert CSV and JSON into Delta tables for analytics
  • Tag or name files consistently: e.g., sales_2025_Q2.csv

๐ŸŽฌ Watch the Video Tutorial

Blog created with help from ChatGPT and Gemini.

PySpark cast() vs astype() Explained |Convert String to Int, Float & Double in DataFrame | PySpark Tutorial

PySpark cast() vs astype() Explained | Convert String to Int, Float & Double

PySpark cast() vs astype() Explained

In this tutorial, we'll explore how to convert PySpark DataFrame columns from one type to another using cast() and astype(). You'll learn how to convert string columns to integers, floats, and doubles in a clean and efficient way.

1. Sample DataFrame

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("CastExample").getOrCreate()

data = [
    ("1", "Aamir", "50000.5"),
    ("2", "Ali", "45000.0"),
    ("3", "Bob", None),
    ("4", "Lisa", "60000.75")
]

columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)
df.printSchema()
df.show()

2. Using cast() Function

Convert id to integer and salary to float:

df_casted = df.withColumn("id", col("id").cast("int")) \
              .withColumn("salary", col("salary").cast("float"))
df_casted.printSchema()
df_casted.show()

3. Using astype() Function

This is an alias for cast() and used in the same way:

df_astype = df_casted.withColumn("salary", col("salary").astype("double"))
df_astype.printSchema()
df_astype.show()

Output:

Original DataFrame (all columns as strings):
+---+-----+--------+
| id| name| salary |
+---+-----+--------+
| 1 |Aamir|50000.5 |
| 2 | Ali |45000.0 |
| 3 | Bob |  null  |
| 4 |Lisa |60000.75|
+---+-----+--------+

After cast():
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: float (nullable = true)

After astype():
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: double (nullable = true)

๐Ÿ“บ Watch the Full Tutorial

How to Use CTAS (Create Table As Select) in Azure Synapse Analytics | Azure Synapse Analytics Tutorial

How to Use CTAS in Azure Synapse Analytics to Create New Tables

How to Use CTAS (Create Table As) in Azure Synapse Analytics

๐Ÿ“˜ What is CTAS?

CTAS stands for Create Table As Select. It’s a T-SQL statement used in Azure Synapse Dedicated SQL Pools to create a new table by selecting data from one or more existing tables. CTAS is highly efficient for:

  • Creating aggregated or filtered datasets
  • Staging intermediate results
  • Transforming or reshaping data

CTAS leverages the Massively Parallel Processing (MPP) engine of Synapse to create new distributed tables efficiently.

✅ Syntax

CREATE TABLE [schema].[new_table]
WITH (DISTRIBUTION = HASH(column) | ROUND_ROBIN | REPLICATE)
AS
SELECT ... FROM existing_table;

๐Ÿ› ️ Real Examples

๐Ÿ”น Scenario 1: Create Table with Subset of Columns

CREATE TABLE dbo.DimCustomerName
WITH (DISTRIBUTION = ROUND_ROBIN)
AS
SELECT customer_id, fname, lastname
FROM dbo.Customers_CSV;

๐Ÿ”น Scenario 2: Create Table with Aggregated Data

CREATE TABLE dbo.TotalSalesByProduct
WITH (DISTRIBUTION = HASH(ProductID))
AS
SELECT ProductID, SUM(Quantity) AS TotalQuantitySold
FROM dbo.SalesFact_CSV
GROUP BY ProductID;

๐Ÿ”น Scenario 3: Create Table with Filtered Data

CREATE TABLE dbo.HighValueCustomers
WITH (DISTRIBUTION = REPLICATE)
AS
SELECT customer_id, fname, lastname
FROM dbo.Customers_CSV
WHERE customer_id IN (1, 3);

๐Ÿ”น Scenario 4: Rename Columns and Cast Data Types

CREATE TABLE dbo.ProductPricing
WITH (DISTRIBUTION = HASH(ProductID_New))
AS
SELECT ProductID AS ProductID_New,
       ProductName AS Name,
       Category,
       CAST(Price AS DECIMAL(12, 4)) AS Price_Adjusted
FROM dbo.ProductInfo_Parquet;

๐Ÿ”น Scenario 5: Create Table from JOIN

CREATE TABLE dbo.CustomerSales
WITH (DISTRIBUTION = HASH(customer_id))
AS
SELECT c.customer_id, c.fname, c.lastname,
       s.SaleID, s.ProductID, s.Quantity, s.SaleDate
FROM dbo.Customers_CSV c
JOIN dbo.SalesFact_CSV s
  ON c.customer_id = s.CustomerID;

๐Ÿ”น Scenario 6: Use a Different Schema

CREATE TABLE staging.StagedProducts
WITH (DISTRIBUTION = ROUND_ROBIN)
AS
SELECT ProductID, ProductName, Category, Price
FROM dbo.ProductInfo_Parquet
WHERE Category = 'Electronics';

๐Ÿ”น Scenario 7: Create Table with New Data Types

CREATE TABLE dbo.SalesFactBigIntID
WITH (DISTRIBUTION = HASH(ProductID),
      CLUSTERED COLUMNSTORE INDEX)
AS
SELECT CAST(SaleID AS BIGINT) AS SaleID_BigInt,
       ProductID, Quantity, SaleDate, CustomerID
FROM dbo.SalesFact_CSV;

๐Ÿ“Œ CTAS Best Practices

  • Always choose the right DISTRIBUTION method (HASH for large fact tables, REPLICATE for small dimensions).
  • Use CTAS for performance-optimized reporting layers.
  • Drop and recreate staging tables during ELT to minimize fragmentation.
  • Use appropriate data types to reduce storage size.

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

What Are Workspaces and How to Organize Fabric Items | Microsoft Fabric Tutorial for Data Engineers

What Are Workspaces and How to Organize Fabric Items | Microsoft Fabric Tutorial

๐Ÿงฑ What Are Workspaces and How to Organize Fabric Items

Microsoft Fabric Tutorial

๐Ÿ“˜ Introduction

In Microsoft Fabric, a Workspace is a container for organizing and managing your analytics items — such as Lakehouses, Warehouses, Notebooks, Reports, Pipelines, and more.

Workspaces help you group related data assets by project, department, or use case — making collaboration, security, and governance easier.

This tutorial covers:

  • What a workspace is
  • Key workspace features
  • How to organize items inside
  • Real-world use cases

๐Ÿงพ What Is a Workspace?

A workspace is like a folder or project space in Microsoft Fabric.

You can think of it as your analytics sandbox where you build, store, and manage related data assets.

Each workspace includes:

  • ๐Ÿ”ธ Access control (who can view/edit content)
  • ๐Ÿ”ธ Linked capacity (defines compute power)
  • ๐Ÿ”ธ Items like Lakehouses, Warehouses, Dashboards, Pipelines, Notebooks, etc.

๐Ÿ“ฆ Types of Items You Can Organize

  • ๐Ÿž️ Lakehouses – Delta-based storage for structured + unstructured data
  • ๐Ÿข Warehouses – Traditional SQL interface on top of the same OneLake
  • ๐Ÿ““ Notebooks – Interactive development with PySpark
  • ๐Ÿ” Data Pipelines – ETL orchestration using visual tools
  • ๐Ÿ“Š Power BI Reports – For visualizations and analytics

๐Ÿ—‚️ Organizing Strategy Tips

  • Create a separate workspace per department, project, or data domain
  • Use consistent naming conventions (e.g., Sales_Lakehouse_US, HR_Warehouse_EU)
  • Use folders inside workspaces to group Notebooks, Pipelines, and Files
  • Assign different roles (Admin, Member, Viewer) to control access securely

๐ŸŽฏ Real-World Example

In a large organization, you may have:

  • Marketing Workspace – Holds campaign data, dashboards, and reports
  • Sales Workspace – Includes Lakehouses for transactions, Warehouses for aggregations
  • IT Ops Workspace – Contains pipelines for ingestion, notebooks for automation

๐ŸŽฌ Watch the Video Tutorial

Blog created with help from ChatGPT and Gemini.

How to Drop All Tables from Lakehouse in Microsoft Fabric | Microsoft Fabric Tutorial for Beginners

How to Drop All Tables from Lakehouse in Microsoft Fabric

How to Drop All Tables from Lakehouse in Microsoft Fabric

Microsoft Fabric Tutorial for Beginners

๐Ÿ“˜ Overview

If you're working with a Microsoft Fabric Lakehouse and want to clean up your environment, dropping all existing tables can help you reset your workspace. This tutorial shows how to do that programmatically using PySpark and the spark.catalog.

๐Ÿ” Step 1: List All Tables in the Lakehouse

Use the Spark catalog API to fetch all tables in the current Lakehouse:

# List all tables
tables = spark.catalog.listTables()

This returns metadata such as table name, database, and type (managed or external).

๐Ÿ” Step 2: Loop Through and Drop Each Table

Loop through each entry and run a DROP TABLE command:

for table in tables:
    table_name = table.name
    spark.sql(f"DROP TABLE IF EXISTS {table_name}")

This ensures that each table is removed cleanly from the catalog.

๐Ÿ’ก Notes

  • This approach works only on tables in the current Lakehouse context.
  • Always double-check before running this in production — this operation is irreversible!
  • Works well during automated clean-up workflows or notebook resets.

๐ŸŽฌ Watch the Demo Video

Blog created with help from ChatGPT and Gemini.

PySpark Sorting Explained | ASC vs DESC | Handling NULLs with asc_nulls_first & desc_nulls_last | PySpark Tutorial

PySpark Sorting Explained: ASC vs DESC | Handling NULLs with asc_nulls_first & desc_nulls_last

PySpark Sorting Explained: ASC vs DESC | Handling NULLs with asc_nulls_first & desc_nulls_last

In this tutorial, you’ll learn how to sort DataFrames in PySpark using ascending and descending order, while controlling NULL value placement using asc_nulls_first() and desc_nulls_last(). Real examples are provided for clarity.

1️⃣ Create Spark Session

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("SortingExamples").getOrCreate()

2️⃣ Create Sample Data

data = [
    (1, "Aamir", 50000),
    (2, "Ali", None),
    (3, "Bob", 45000),
    (4, "Lisa", 60000),
    (5, "Zara", None)
]
columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)
df.show()

3️⃣ Sort by ASC (default)

df.orderBy(col("salary").asc()).show()

4️⃣ Sort by ASC with NULLs First

df.orderBy(col("salary").asc_nulls_first()).show()

5️⃣ Sort by ASC with NULLs Last

df.orderBy(col("salary").asc_nulls_last()).show()

6️⃣ Sort by DESC

df.orderBy(col("salary").desc()).show()

7️⃣ DESC with NULLs First

df.orderBy(col("salary").desc_nulls_first()).show()

8️⃣ DESC with NULLs Last

df.orderBy(col("salary").desc_nulls_last()).show()

๐Ÿ“บ Watch the Full Tutorial

Create External Tables in Azure Synapse Dedicated Pool for CSV & Parquet Files | Azure Synapse Analytics Tutorial

Create External Tables in Azure Synapse Dedicated Pool for CSV & Parquet Files

Create External Tables in Azure Synapse Dedicated Pool for CSV & Parquet Files

๐Ÿ“˜ What is an External Table?

An External Table in Azure Synapse Dedicated SQL Pool allows you to query data stored outside the database — typically in Azure Data Lake Storage Gen2 — without importing it into the SQL pool itself. This enables cost-effective querying of large files like .csv or .parquet formats directly using T-SQL.

External tables are used in scenarios like raw zone exploration, staging, or ELT pipelines where data exists in a data lake but needs to be accessed for transformations or analytics.

๐Ÿ“ File Formats Supported

  • CSV: Commonly used for flat files
  • Parquet: Optimized, columnar format best for analytics

๐Ÿ”ง Pre-requisites

  • Azure Synapse workspace with a Dedicated SQL Pool
  • Data files (CSV/Parquet) stored in ADLS Gen2
  • Database scoped credential to access the storage account
  • External data source and file format definitions

๐Ÿ› ️ Step-by-Step Example for CSV

1. Create Database Scoped Credential

CREATE DATABASE SCOPED CREDENTIAL myadls_cred
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 'sv=2024...';

2. Create External Data Source

CREATE EXTERNAL DATA SOURCE my_adls
WITH (
    LOCATION = 'abfss://datalake@yourstorage.dfs.core.windows.net/',
    CREDENTIAL = myadls_cred
);

3. Create External File Format for CSV

CREATE EXTERNAL FILE FORMAT csv_format
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = ',',
        STRING_DELIMITER = '\"',
        FIRST_ROW = 2
    )
);

4. Create External Table

CREATE EXTERNAL TABLE dbo.ExternalSalesCSV
(
    SaleID INT,
    ProductName VARCHAR(100),
    Quantity INT,
    Price MONEY
)
WITH (
    LOCATION = '/sales/csv/',
    DATA_SOURCE = my_adls,
    FILE_FORMAT = csv_format,
    REJECT_TYPE = VALUE,
    REJECT_VALUE = 0
);

๐Ÿ› ️ Step-by-Step Example for Parquet

1. Create External File Format for Parquet

CREATE EXTERNAL FILE FORMAT parquet_format
WITH (
    FORMAT_TYPE = PARQUET
);

2. Create External Table

CREATE EXTERNAL TABLE dbo.ExternalProductsParquet
(
    ProductID INT,
    ProductName VARCHAR(100),
    Category VARCHAR(50),
    Price FLOAT
)
WITH (
    LOCATION = '/products/parquet/',
    DATA_SOURCE = my_adls,
    FILE_FORMAT = parquet_format
);

๐Ÿ“Œ Best Practices

  • Use Parquet format for large analytical queries to benefit from columnar compression.
  • Keep external table definitions consistent with the structure of the underlying file.
  • Use views to abstract external table logic for business users.
  • Use the REJECT_TYPE clause to handle malformed rows gracefully.

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

What Is OneLake? Unified Storage Layer Explained | Microsoft Fabric Tutorial #microsoftfabric

What Is OneLake? Unified Storage Layer Explained | Microsoft Fabric Tutorial

๐Ÿ’ง What Is OneLake? Unified Storage Layer Explained

Microsoft Fabric Tutorial

๐Ÿ“˜ What is OneLake?

OneLake is Microsoft Fabric’s unified, enterprise-grade data lake that acts as the single storage layer for all data workloads in Fabric — including Lakehouse, Warehouse, Data Science, Real-Time Analytics, and Power BI.

It is built on top of Azure Data Lake Storage Gen2 (ADLS Gen2) but managed as part of the Fabric SaaS experience.

✅ Key Benefits of OneLake

  • ๐Ÿงฉ One Copy of Data: Store once, access across all workloads (SQL, Spark, Power BI, Dataflow)
  • ๐Ÿ” Centralized Governance: Integrated with Microsoft Purview and Entra ID
  • DirectLake Performance: Power BI can query Delta tables directly without import or refresh
  • ๐ŸŒ Global Namespace: All data is accessed under OneLake://<workspace>/<item>
  • ๐Ÿ—ƒ️ Delta Format First: Built-in support for Delta Lake format (transactional + analytics)

๐ŸŒ How OneLake Integrates Across Fabric

OneLake acts as the foundation for:

  • Lakehouse: Built on top of OneLake using Apache Spark and Delta tables
  • Warehouse: Traditional SQL interface using the same underlying storage
  • Power BI DirectLake: Connects Power BI reports directly to OneLake-based Delta tables without needing refresh or dataset import
  • Data Science: Notebooks and ML workloads directly use OneLake-backed data

๐ŸŽฌ Watch the Video Tutorial

Blog created with help from ChatGPT and Gemini.

How to Read multiple CSV files from Blog Storage and Write to Azure SQL Table with Filenames | PySpark Tutorial

How to Read Multiple CSV Files from Blob Storage and Write to Azure SQL Table with Filenames

How to Read Multiple CSV Files from Blob Storage and Write to Azure SQL Table with Filenames

In this PySpark tutorial, you’ll learn how to load multiple CSV files from Azure Blob Storage, extract the filename from each file, and write the combined data to an Azure SQL table for tracking and processing.

1️⃣ Step 1: Setup Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ReadCSV_WriteSQL_WithFilenames") \
    .getOrCreate()

2️⃣ Step 2: Set Azure Blob Storage Configuration

spark.conf.set(
    "fs.azure.sas.<container>.<storage_account>.blob.core.windows.net",
    "<your_sas_token>"
)

path = "wasbs://<container>@<storage_account>.blob.core.windows.net/input-folder/*.csv"

3️⃣ Step 3: Read All CSV Files and Add Filename Column

from pyspark.sql.functions import input_file_name, regexp_extract

df = spark.read.option("header", True).csv(path)
df = df.withColumn("filename", regexp_extract(input_file_name(), r"([^/]+$)", 1))
df.show()

4️⃣ Step 4: JDBC Configuration for Azure SQL

jdbcUrl = "jdbc:sqlserver://<your_server>.database.windows.net:1433;database=<your_db>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30"

connectionProperties = {
    "user": "<your_username>",
    "password": "<your_password>",
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

5️⃣ Step 5: Write DataFrame to Azure SQL Table

df.write \
    .mode("append") \
    .jdbc(url=jdbcUrl, table="your_table_name", properties=connectionProperties)

print("✅ All CSV files uploaded to Azure SQL with filename column.")

๐Ÿ“บ Watch the Full Tutorial

How to Use COPY INTO or Bulk Load CSV/Parquet Files into Dedicated SQL Pool | Azure Synapse Analytics Tutorial

How to Use COPY INTO or Bulk Load CSV/Parquet Files into Dedicated SQL Pool | Azure Synapse

How to Use COPY INTO or Bulk Load CSV/Parquet Files into Dedicated SQL Pool | Azure Synapse

๐Ÿ“ฆ What is COPY INTO in Azure Synapse?

COPY INTO is a high-performance T-SQL command used in Azure Synapse Analytics to bulk load data from external files (CSV or Parquet) in Azure Data Lake Storage Gen2 into a table in a Dedicated SQL Pool.

This method is optimized for loading large datasets efficiently and supports common file formats like .csv and .parquet, using parallel processing under the hood.

๐Ÿ—‚️ Supported File Formats

  • CSV — with or without headers
  • Parquet — columnar storage format for analytics

๐Ÿ› ️ Syntax of COPY INTO

COPY INTO target_table
FROM 'https://.dfs.core.windows.net//path/'
WITH (
    FILE_TYPE = 'CSV', -- or 'PARQUET'
    CREDENTIAL = (IDENTITY= 'Shared Access Signature', SECRET=''),
    FIELDTERMINATOR = ',',  -- CSV specific
    ROWTERMINATOR = '0x0A', -- CSV specific
    FIRSTROW = 2            -- skip header if present
);

✅ Example: Load CSV File into Sales Table

COPY INTO dbo.Sales
FROM 'https://techstorage.dfs.core.windows.net/raw/sales.csv'
WITH (
    FILE_TYPE = 'CSV',
    CREDENTIAL = (IDENTITY= 'Shared Access Signature', SECRET='sv=2024...'),
    FIELDTERMINATOR = ',',
    ROWTERMINATOR = '0x0A',
    FIRSTROW = 2
);

✅ Example: Load Parquet File into Product Table

COPY INTO dbo.Product
FROM 'https://techstorage.dfs.core.windows.net/raw/productdata/'
WITH (
    FILE_TYPE = 'PARQUET',
    CREDENTIAL = (IDENTITY= 'Shared Access Signature', SECRET='sv=2024...')
);

๐Ÿ“Œ Tips for Using COPY INTO

  • Make sure your Dedicated SQL Pool is resumed and ready to load.
  • Use Azure Data Lake Storage Gen2 for optimal performance.
  • Test COPY INTO with a small file before loading TBs of data.
  • Always provide correct SAS token or Managed Identity permissions for access.

๐Ÿš€ When to Use COPY INTO

  • Bulk data ingestion from raw zones in your data lake
  • Loading files created by Synapse Pipelines or Azure Data Factory
  • Loading historical or batch data into a reporting database

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

Sign Up for Microsoft Fabric Free Trial Using a Personal Email | Microsoft Fabric Tutorial

Sign Up for Microsoft Fabric Free Trial Using a Personal Email | Microsoft Fabric Tutorial

Sign Up for Microsoft Fabric Free Trial Using a Personal Email

Microsoft Fabric Tutorial

๐Ÿš€ What You’ll Learn

In this step-by-step tutorial, you’ll learn how to sign up for the Microsoft Fabric free trial using your personal Gmail, Outlook, or other email accounts.

  • Accessing Microsoft Fabric as an individual (non-corporate email)
  • Understanding licensing and trial limits
  • Creating a Fabric workspace
  • Verifying access to Power BI and other Fabric services

๐ŸŽฌ Watch the Video Tutorial

๐Ÿ“ Quick Steps to Sign Up

  1. Go to https://app.fabric.microsoft.com
  2. Click on "Try for free"
  3. Sign in with your personal email (Gmail, Outlook, Yahoo, etc.)
  4. Follow the prompts to complete setup
  5. Once in, create a new workspace and explore Fabric features like Lakehouse, Pipelines, and Notebooks

๐Ÿ’ก Important Notes

  • This trial gives you access to Microsoft Fabric for 60 days
  • Some features may be limited based on region or account type
  • No credit card is required to start the trial
Blog created with the help of ChatGPT and Gemini.

How to Read data from Azure SQL Table and Write to JSON File in Blob Storage | PySpark Tutorial

How to Read Data from Azure SQL Table and Write to JSON File in Blob Storage | PySpark Tutorial

How to Read Data from Azure SQL Table and Write to JSON File in Blob Storage | PySpark Tutorial

This PySpark tutorial demonstrates how to read data from an Azure SQL Table using JDBC and write the results to a JSON file in Azure Blob Storage using SAS token access.

1️⃣ Step 1: Set Up Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ReadSQL_WriteToBlob_JSON") \
    .getOrCreate()

2️⃣ Step 2: Define JDBC Connection

jdbcHostname = "yourserver.database.windows.net"
jdbcPort = 1433
jdbcDatabase = "yourdatabase"
jdbcUsername = "sqladmin"
jdbcPassword = "YourPassword123!"

jdbcUrl = f"jdbc:sqlserver://{jdbcHostname}:{jdbcPort};database={jdbcDatabase};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30"

connectionProperties = {
  "user": jdbcUsername,
  "password": jdbcPassword,
  "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

3️⃣ Step 3: Read Data from Azure SQL Table

df = spark.read.jdbc(
    url=jdbcUrl,
    table="your_table_name",
    properties=connectionProperties
)

df.show()

4️⃣ Step 4: Set SAS Token and Blob Storage Config

container_name = "yourcontainer"
storage_account_name = "yourstorageaccount"
sas_token = "<your_sas_token>"

spark.conf.set(
  f"fs.azure.sas.{container_name}.{storage_account_name}.blob.core.windows.net",
  sas_token
)

output_path = f"wasbs://{container_name}@{storage_account_name}.blob.core.windows.net/sql_to_json"

5️⃣ Step 5: Write to JSON Format in Blob Storage

df.coalesce(1).write.mode("overwrite").json(output_path)

print("✅ Data successfully written to Azure Blob Storage in JSON format.")

๐Ÿ“บ Watch the Full Tutorial

How to Connect to Synapse Analytics Dedicated SQL Pool Using SSMS, Azure Data Studio,or VS Code | Azure Synapse Analytics Tutorial

How to Connect to Synapse Analytics Dedicated SQL Pool Using SSMS, Azure Data Studio, or VS Code

How to Connect to Synapse Analytics Dedicated SQL Pool Using SSMS, Azure Data Studio, or VS Code

๐Ÿ” Overview

Azure Synapse Analytics offers a powerful Dedicated SQL Pool for high-performance data warehousing. To manage and query your dedicated SQL pool, you can use various tools like:

  • SQL Server Management Studio (SSMS)
  • Azure Data Studio
  • Visual Studio Code (VS Code)

These tools allow you to connect, write T-SQL queries, and manage your dedicated SQL pools efficiently outside of Synapse Studio.

๐Ÿ› ️ What You'll Need

  • Azure Synapse Workspace with a Dedicated SQL Pool
  • Fully Qualified Server Name (e.g., synapse-workspace.sql.azuresynapse.net)
  • SQL Admin Username and Password
  • SSMS, Azure Data Studio, or VS Code installed on your machine

๐Ÿ”— How to Connect Using SSMS

  1. Open SQL Server Management Studio (SSMS).
  2. In the Connect to Server dialog:
    • Server type: Database Engine
    • Server name: yourworkspace.sql.azuresynapse.net
    • Authentication: SQL Server Authentication
    • Login: your SQL admin username
    • Password: your SQL password
  3. Click Connect.
  4. Select the dedicated SQL pool from the database dropdown to run your queries.

๐Ÿ”— How to Connect Using Azure Data Studio

  1. Open Azure Data Studio.
  2. Click New Connection.
  3. Enter the connection info:
    • Server: yourworkspace.sql.azuresynapse.net
    • Database: (optional or specify your SQL pool name)
    • Authentication Type: SQL Login
    • Username/Password: Your SQL credentials
  4. Click Connect.

๐Ÿ”— How to Connect Using VS Code

  1. Install the SQL Server (mssql) extension from the VS Code Marketplace.
  2. Open the Command Palette (Ctrl+Shift+P) and search for MS SQL: Connect.
  3. Enter the following details when prompted:
    • Server name: yourworkspace.sql.azuresynapse.net
    • Authentication Type: SQL Login
    • Username: your SQL admin user
    • Password: your password
  4. Select your Dedicated SQL Pool database after connection.

๐Ÿ“Œ Tips

  • Ensure your client IP address is added to the Synapse workspace’s firewall rules.
  • Use Active Directory authentication if your workspace is configured for Azure AD.
  • Always connect to .sql.azuresynapse.net for dedicated SQL pools.

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

What is Microsoft Fabric? | Microsoft Fabric Tutorial for Beginners to Advanced #microsoftfabric

What is Microsoft Fabric – Unified Analytics Platform

๐Ÿ“˜ What is Microsoft Fabric?

An Overview of the Unified Analytics Platform

๐Ÿ” Introduction

Microsoft Fabric is an all-in-one analytics platform that unifies data movement, data engineering, data science, real-time analytics, and business intelligence (BI) — all built on OneLake, a single, secure data lake for the entire organization.

It is the evolution of the modern data stack combining technologies from Power BI, Azure Synapse Analytics, and Azure Data Factory into a SaaS-based experience.

๐Ÿงฑ Core Components of Microsoft Fabric

Component Description
OneLake The unified data lake storage that underpins all Fabric workloads
Lakehouse Combines the scalability of a data lake with the usability of a data warehouse
Warehouse Traditional SQL data warehouse experience for BI workloads
Data Factory Visual data pipeline orchestration and transformation (ETL/ELT)
Notebooks Interactive development environment using Spark for data science and engineering
Real-Time Analytics Analyze streaming data at low latency using KQL-based engines
Power BI Unified experience for data visualization and reporting

๐ŸŽฏ Key Benefits

  • Unified platform eliminates the need to stitch multiple tools
  • Built-in governance, security, and compliance
  • Deep integration with Microsoft 365 and Azure ecosystem
  • Easy collaboration across business and technical teams

๐ŸŽฌ Watch the Demo

Blog created with help from ChatGPT and Gemini.

How to Write DataFrame to Azure SQL Table Using PySpark | PySpark JDBC write() Function Tutorial

How to Write DataFrame to Azure SQL Table Using PySpark

How to Write DataFrame to Azure SQL Table Using PySpark

In this tutorial, you'll learn how to save a PySpark DataFrame to an Azure SQL Database using the `write()` function and JDBC connector. We’ll cover all necessary steps including JDBC URL, credentials, and data formatting.

1️⃣ Step 1: Create Spark Session

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WriteToAzureSQL").getOrCreate()

2️⃣ Step 2: Create Sample DataFrame

data = [
  (1, "Aamir Shahzad", "aamir@example.com", "USA"),
  (2, "Ali Raza", "ali@example.com", "Canada"),
  (3, "Bob", "bob@example.com", "UK"),
  (4, "Lisa", "lisa@example.com", "Germany")
]
columns = ["customer_id", "name", "email", "country"]

df = spark.createDataFrame(data, columns)
df.show()

3️⃣ Step 3: JDBC Configuration

jdbcHostname = "yourserver.database.windows.net"
jdbcPort = 1433
jdbcDatabase = "yourdatabase"
jdbcUsername = "sqladmin"
jdbcPassword = "YourPassword123!"

jdbcUrl = f"jdbc:sqlserver://{jdbcHostname}:{jdbcPort};database={jdbcDatabase};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30"

connectionProperties = {
  "user": jdbcUsername,
  "password": jdbcPassword,
  "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

4️⃣ Step 4: Write DataFrame to Azure SQL Table

df.write \
  .mode("append") \
  .jdbc(url=jdbcUrl, table="customer", properties=connectionProperties)

print("✅ Data successfully written to Azure SQL Database 'customer' table.")

๐Ÿ“บ Watch the Full Tutorial

How to Write DataFrame to JSON File in Azure Blob Storage Using PySpark | PySpark Tutorial

How to Write DataFrame to JSON File in Azure Blob Storage | PySpark Tutorial

How to Write DataFrame to JSON File in Azure Blob Storage Using PySpark

This tutorial demonstrates how to use PySpark's write().json() function to export a DataFrame as a JSON file to Azure Blob Storage. You'll also see how to use coalesce(1) and secure your connection with a SAS token.

1️⃣ Step 1: Initialize Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WriteToJSONBlob").getOrCreate()

2️⃣ Step 2: Create a Sample DataFrame

data = [
  ("Aamir Shahzad", "Engineer", 35),
  ("Ali Raza", "Data Analyst", 28),
  ("Bob", "Manager", 40),
  ("Lisa", "Developer", 25)
]
columns = ["name", "designation", "age"]

df = spark.createDataFrame(data, schema=columns)
df.show()

3️⃣ Step 3: Configure Azure Blob Storage Access

# Replace with your actual values
spark.conf.set("fs.azure.sas.<container>.<storage_account>.blob.core.windows.net", "<sas_token>")

4️⃣ Step 4: Write DataFrame to JSON

output_path = "wasbs://<container>@<storage_account>.blob.core.windows.net/output-json"

df.coalesce(1).write \
  .mode("overwrite") \
  .json(output_path)

print("✅ Data written to Azure Blob Storage in JSON format.")

5️⃣ Step 5: Read Data Back from JSON

df_read = spark.read.json(output_path)
df_read.show()

print("✅ Data read back from Azure Blob Storage successfully.")

๐Ÿ“บ Watch the Full Tutorial

How to Create Table in Dedicated SQL Pool with Distribution Keys | Azure Synapse Analytics Tutorial for beginners and Advance users

How to Create Table in Dedicated SQL Pool with Distribution Keys | Azure Synapse Analytics

How to Create Table in Dedicated SQL Pool with Distribution Keys | Azure Synapse Analytics

๐Ÿ” What is a Distribution Key?

In Azure Synapse Analytics' Dedicated SQL Pool, a distribution key is a column used to determine how table data is divided across 60 physical distributions (partitions). This enables parallel query execution using Massively Parallel Processing (MPP), improving performance for large datasets.

Choosing the correct distribution method and key helps reduce data movement and enhances performance, especially for joins and aggregations.

๐Ÿ“ฆ Types of Distribution Methods

  • HASH: Rows are assigned based on the hash of a specified column. Best for large fact tables with predictable joins.
  • ROUND ROBIN: Rows are evenly and randomly spread across all distributions. Simple but not optimal for joins.
  • REPLICATE: Entire table is copied to each compute node. Best for small dimension tables used in frequent joins.

๐Ÿ› ️ Steps to Create a Table with Distribution Key

✅ Step 1: Open Synapse Studio

  • Go to your Synapse Workspace in Azure Portal.
  • Launch Synapse Studio from the workspace overview.

✅ Step 2: Create New SQL Script

  • In Synapse Studio, go to the Data tab.
  • Right-click your Dedicated SQL Pool and choose New SQL ScriptEmpty script.

✅ Step 3: Use CREATE TABLE Statement with Distribution

Use the following syntax to create a distributed table:

CREATE TABLE dbo.SalesFact
(
    SalesID INT NOT NULL,
    CustomerID INT,
    ProductID INT,
    Quantity INT,
    SalesAmount DECIMAL(10,2)
)
WITH
(
    DISTRIBUTION = HASH(CustomerID),
    CLUSTERED COLUMNSTORE INDEX
);

Explanation:

  • CustomerID is used as the distribution key to optimize joins with customer data.
  • CLUSTERED COLUMNSTORE INDEX is used for efficient compression and analytic performance.

๐Ÿ“Œ Tips for Choosing Distribution Key

  • Choose a column that is frequently used in joins and has high cardinality.
  • Avoid columns with many repeated values (low cardinality), as this leads to data skew.
  • Use REPLICATE for small lookup tables and ROUND ROBIN for staging/temporary tables.

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.