Views & Materialized Views in Azure Synapse Dedicated SQL Pool | Azure Synapse Analytics Tutorial

Views & Materialized Views in Azure Synapse Dedicated SQL Pool | Explained

Views & Materialized Views in Azure Synapse Dedicated SQL Pool | Explained

📘 Introduction

Azure Synapse Analytics supports both Views and Materialized Views (also known as Indexed Views) in the Dedicated SQL Pool. While both help simplify complex queries, their behavior and performance implications differ significantly.

🔹 What is a View?

A View is a saved SQL SELECT query. It does not store data physically — every time you query the view, Synapse re-executes the underlying SQL logic. Views are useful for code reuse, abstraction, and simplifying joins or filters.

Example: Create a Simple View

CREATE VIEW dbo.vwTop100Sales1 AS
SELECT TOP (100)
    SaleID,
    ProductID,
    Quantity
FROM dbo.SalesFact_CSV
ORDER BY SaleDate DESC;

This view always returns the latest 100 sales records from the SalesFact_CSV table.

🔹 What is a Materialized View?

A Materialized View in Synapse stores the result set of a query physically. It is automatically refreshed by Synapse in the background and is useful for performance optimization — especially for repeated queries involving joins or aggregations.

Example: Create a Materialized View

CREATE MATERIALIZED VIEW dbo.mv_TotalQuantityByCustomer
WITH (
    DISTRIBUTION = HASH(CustomerID)
)
AS
SELECT
    CustomerID,
    COUNT_BIG(*) AS TotalOrders,
    SUM(ISNULL(Quantity, 0)) AS TotalQuantity
FROM dbo.SalesFact_CSV
GROUP BY CustomerID;

This view pre-aggregates the total quantity and order count by CustomerID and speeds up analytic queries significantly.

📌 Key Differences

  • Views are logical (no data storage).
  • Materialized Views are physical and improve performance by caching results.
  • Materialized Views are auto-refreshed in Synapse (no manual refresh needed).
  • Use Views for modularity and reuse; use Materialized Views for speed.

🧠 Best Practices

  • Use Views to encapsulate query logic for reuse across reports.
  • Use Materialized Views to accelerate queries on large datasets or frequent joins.
  • Choose appropriate DISTRIBUTION strategy (e.g., HASH for large aggregations).
  • Regularly monitor performance to determine when a view should be upgraded to a materialized view.

🔄 Demo Recap from SQL File

  • vwTop100Sales1 – Standard View showing top 100 latest sales.
  • mv_TotalQuantityByCustomer – Materialized View for preaggregated sales data by customer.
  • Materialized View will automatically reflect changes in source data (with some lag).

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

Working with Structs and Nested Fields in PySpark | getField, getItem, withField, dropFields Explained | PySpark Tutorial

Working with Structs and Nested Fields in PySpark | getField, getItem, withField, dropFields

Working with Structs and Nested Fields in PySpark

In this PySpark tutorial, we’ll explore how to work with StructType and nested fields in DataFrames using powerful functions like getField(), getItem(), withField(), and dropFields().

Sample Data

data = [
  (1, {"first": "Aamir", "last": "Shahzad", "country": "Pakistan"}),
  (2, {"first": "Ali", "last": "Raza", "country": "USA"}),
  (3, {"first": "Lisa", "last": "Brown", "country": "UK"})
]

schema = "id INT, full_name STRUCT"

df = spark.createDataFrame(data, schema=schema)
df.show(truncate=False)

getField() - Access specific nested field

df.select(
    col("full_name").getField("first").alias("first_name"),
    col("full_name").getField("last").alias("last_name"),
    col("country")
).show()

getItem() - Access fields like a dictionary

df.select(
    col("full_name").getItem("first").alias("first_name_item"),
    col("full_name").getItem("last").alias("last_name_item"),
    col("country")
).show()

withField() - Add or update nested field

df_with_middle = df.withColumn(
    "full_name",
    col("full_name").withField("middle", lit("MiddleName."))
)

df_with_middle.select("id", "full_name", "country").show(truncate=False)

dropFields() - Remove a nested field

df_dropped = df_with_middle.withColumn(
    "full_name",
    col("full_name").dropFields("last")
)

df_dropped.select("id", "full_name", "country").show(truncate=False)

🎥 Watch the Full Video Tutorial

Tags: PySpark, StructType, getField, getItem, withField, dropFields, nested fields in PySpark, access nested data, modify nested structure, PySpark DataFrame, PySpark tutorial, databricks tutorial, techbrothersit, data engineering

Hashtags: #PySpark #StructType #NestedFields #PySparkTutorial #

Read, Clean, and Save Lakehouse to Delta Table Using Data Wrangler | Microsoft Fabric Tutorial for Beginners

Read, Clean, and Save Lakehouse to Delta Table Using Data Wrangler | Microsoft Fabric Tutorial

Read, Clean, and Save Lakehouse to Delta Table Using Data Wrangler

Microsoft Fabric Tutorial

📘 What is Data Wrangler in Microsoft Fabric?

Data Wrangler is a powerful UI-based tool in Microsoft Fabric that allows you to explore, clean, transform, and prepare your data visually before saving it to a Delta table. It is built with data engineers and analysts in mind, allowing no-code or low-code data shaping before analysis or modeling.

✅ What You'll Learn

  • How to launch Data Wrangler from your Lakehouse
  • Read raw files (CSV/Parquet/JSON) into a temporary table
  • Apply transformations such as filtering, renaming, changing data types, and cleaning nulls
  • Save the cleaned dataset as a managed Delta Table in the Lakehouse

🛠️ Step-by-Step Instructions

  1. Go to your Lakehouse workspace in Microsoft Fabric
  2. Navigate to the Files tab and locate a CSV or Parquet file
  3. Right-click the file and select Open in Data Wrangler
  4. Apply transformations using the UI (you’ll see Spark/Python code generated)
  5. Click Export to Lakehouse and choose to save as a Delta Table

🎯 Best Practices

  • Preview your data to detect malformed rows before loading
  • Use column profiling to check distributions and nulls
  • Export frequently used clean datasets into Delta for better query performance
  • Rename columns and fix data types early in the wrangling process

🎬 Watch the Tutorial Video

Blog created with help from ChatGPT and Gemini.

Understanding IDENTITY Behavior in Azure Synapse SQL Pool | Non-Sequential IDs Explained | Azure Synapse Analytics Tutorial

Understanding IDENTITY Behavior in Azure Synapse SQL Pool | Non-Sequential IDs Explained

Understanding IDENTITY Behavior in Azure Synapse SQL Pool | Non-Sequential IDs Explained

📘 What is a Surrogate Key?

A surrogate key is an artificial, system-generated unique identifier used to represent each row in a table. It is not derived from actual business data and is commonly used in data warehouse dimension tables to simplify joins with fact tables.

🔹 Using IDENTITY in Synapse SQL Pools

In Azure Synapse Dedicated SQL Pools, you can use the IDENTITY property to automatically generate surrogate keys. However, due to the distributed nature of Synapse, IDENTITY values are not guaranteed to be sequential.

✅ Create Table with IDENTITY

CREATE TABLE dbo.DimCustomer
(
    CustomerSK INT IDENTITY(1,1) NOT NULL,
    CustomerName NVARCHAR(100),
    Country NVARCHAR(50)
)
WITH (
    DISTRIBUTION = HASH(CustomerName),
    CLUSTERED COLUMNSTORE INDEX
);

🚀 Insert Sample Data

INSERT INTO dbo.DimCustomer (CustomerName, Country) VALUES ('Alice', 'USA');
INSERT INTO dbo.DimCustomer (CustomerName, Country) VALUES ('Bob', 'Canada');
INSERT INTO dbo.DimCustomer (CustomerName, Country) VALUES ('Charlie', 'UK');

🛠️ Insert Custom IDENTITY Value

Sometimes you might need to insert a specific surrogate key (e.g., -1 for "Unknown"). You can do that by temporarily enabling IDENTITY_INSERT:

SET IDENTITY_INSERT dbo.DimCustomer ON;
INSERT INTO dbo.DimCustomer (CustomerSK, CustomerName, Country)
VALUES (-1, 'Unknown', 'N/A');
SET IDENTITY_INSERT dbo.DimCustomer OFF;

🔎 Query Metadata About IDENTITY Column

SELECT sm.name AS schema_name,
       tb.name AS table_name,
       co.name AS column_name,
       CASE WHEN ic.column_id IS NOT NULL THEN 1 ELSE 0 END AS is_identity
FROM sys.schemas sm
JOIN sys.tables tb ON sm.schema_id = tb.schema_id
JOIN sys.columns co ON tb.object_id = co.object_id
LEFT JOIN sys.identity_columns ic ON co.object_id = ic.object_id AND co.column_id = ic.column_id
WHERE sm.name = 'dbo' AND tb.name = 'DimCustomer';

📏 Check Seed & Increment Values

SELECT ic.seed_value, ic.increment_value
FROM sys.identity_columns ic
JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
JOIN sys.tables t ON t.object_id = c.object_id
WHERE t.name = 'DimCustomer';

⚠️ IDENTITY Behavior Caveats

  • IDENTITY values may skip or repeat due to parallelism.
  • Do not use IDENTITY on the distribution column.
  • CTAS and SELECT INTO do not preserve IDENTITY properties.
  • Use INSERT INTO ... SELECT when populating IDENTITY tables.

💡 Best Practices

  • Use IDENTITY columns for dimension surrogate keys.
  • Do not rely on IDENTITY values being sequential across rows.
  • For reporting or incremental loading, always use a MAX(ID) lookup strategy.

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

PySpark String Functions Explained | contains(), startswith(), substr(), endswith() with Examples | PySpark Tutorial

PySpark String Functions Explained | contains(), startswith(), substr(), endswith()

PySpark String Functions Explained

In this tutorial, you'll learn how to use PySpark string functions like contains(), startswith(), substr(), and endswith(). These functions are very helpful for filtering, searching, and extracting string data in PySpark DataFrames.

🔹 Sample Data

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("StringFunctionsExample").getOrCreate()

data = [
    (1, "Aamir"),
    (2, "Ali"),
    (3, "Bob"),
    (4, "Lisa"),
    (5, "Zara"),
    (6, "ALINA"),
    (7, "amrita"),
    (8, "Sana")
]
columns = ["id", "name"]
df = spark.createDataFrame(data, columns)
df.show()

🔹 contains() Function

Filter rows where name contains "a":

df.filter(col("name").contains("a")).show()

🔹 startswith() Function

Filter rows where name starts with "A":

df.filter(col("name").startswith("A")).show()

🔹 endswith() Function

Filter rows where name ends with "a":

df.filter(col("name").endswith("a")).show()

🔹 substr() Function

Extract first two characters from name:

df.withColumn("first_two", col("name").substr(1, 2)).show()

🎥 Watch the Full Tutorial

Full Notebook Interface Walkthrough with Hands-On Demo | Microsoft Fabric Tutorial for Beginners and Advance Users / Data Engineers

Full Notebook Interface Walkthrough with Hands-On Demo | Microsoft Fabric Tutorial

🧠 Full Notebook Interface Walkthrough with Hands-On Demo

Microsoft Fabric Tutorial for Beginners and Data Engineers

📓 What Is a Notebook in Microsoft Fabric?

A notebook in Microsoft Fabric is an interactive development environment where you can run PySpark, Spark SQL, or R code, visualize results, and collaborate with your team. It includes advanced features like AI assistance, Git integration, scheduling, and Spark session management.

🏠 Home Tab

  • ▶️ Run All: Executes all cells in order
  • 💾 Save / Auto-Save: Keeps your work updated automatically
  • 🔗 Connect: Allows you to connect your notebook to a Spark runtime
  • 🧪 Kernel: Select or restart the notebook’s compute kernel

🧠 AI & Copilot Tab

  • Generate code with AI suggestions
  • Explain cell logic using natural language
  • Auto-generate visualizations and queries based on schema

🔢 Cell Operations Tab

  • Add code or markdown cells
  • Move cells up or down
  • Clear output or delete cell

📅 Schedule Tab

  • Convert notebook into scheduled pipeline activities
  • Trigger based on time or event
  • Integrate into Fabric Data Pipelines

📂 Data Tab

  • Browse Lakehouse files and tables directly within the notebook
  • Drag and drop tables to auto-generate read/load code

⚙️ Settings Tab

  • Select Spark pool and session timeout
  • Adjust language mode (PySpark, SQL, R)

🧪 Hands-On Demo Overview

In the video walkthrough below, we demonstrate how to:

  • Navigate and utilize each tab
  • Write and run code in code cells
  • Use markdown for documentation
  • Explore and query Lakehouse data
  • Leverage Spark SQL and visual output

🎬 Watch the Full UI Tour & Demo

Blog created with help from ChatGPT and Gemini.

Install OneLake File Explorer to Access Fabric from Local Computer | Microsoft Fabric Tutorial for Beginners and Advance Users

Install OneLake File Explorer to Access Fabric from Local Computer | Microsoft Fabric Tutorial

Install OneLake File Explorer to Access Fabric from Local Computer

Microsoft Fabric Tutorial

📘 Introduction

In this tutorial, you'll learn how to install and use OneLake File Explorer to access your Microsoft Fabric data (Lakehouse files) directly from your Windows file system.

OneLake File Explorer is a simple yet powerful tool that brings the data lake experience closer to local development by letting you browse, upload, and manage files stored in your Fabric Lakehouse.

🧭 Why Use OneLake File Explorer?

  • 🚀 Access Fabric files just like local folders
  • 📂 Drag-and-drop support for CSV, JSON, and Parquet
  • 🔒 Secure access using Microsoft Entra ID (Azure AD)
  • 🔄 Sync between your Lakehouse and desktop tools (VS Code, Excel, Power BI Desktop)

🛠️ How to Install OneLake File Explorer

  1. Go to the official Microsoft download page: Download OneLake File Explorer
  2. Download and run the installer
  3. Sign in with your Microsoft Fabric account (work or school)
  4. Once installed, OneLake will appear in your File Explorer under This PC

📂 Browse and Manage Lakehouse Data

After installation, you can:

  • Open OneLake in File Explorer
  • Navigate through your Fabric workspaces and Lakehouses
  • Drag and drop files into /Files or /Tables
  • Rename, move, or delete files like a normal folder

💡 Best Practices

  • Keep a folder structure like raw/, processed/, curated/
  • Ensure your files are in Parquet or Delta format for better performance
  • Use OneLake Explorer for quick uploads during development and testing

🎬 Watch the Video Tutorial

Blog created with help from ChatGPT and Gemini.

Indexing in Azure Synapse Dedicated SQL Pool Explained- Columnstore Heap Clustered & Rebuild Tips | Azure Synapse Analytics Tutorial

Indexing in Azure Synapse Dedicated SQL Pool Explained – Columnstore, Heap, Clustered & Rebuild Tips

Indexing in Azure Synapse Dedicated SQL Pool Explained – Columnstore, Heap, Clustered & Rebuild Tips

📘 Introduction

Indexes in Azure Synapse Dedicated SQL Pool play a crucial role in optimizing query performance and storage. Synapse supports multiple types of indexes depending on your workload type—batch analytics, lookups, or staging data.

This blog covers:

  • Clustered Columnstore Index
  • Heap Tables
  • Clustered Index
  • Nonclustered Index
  • Index Rebuild Tips

🔹 1. Clustered Columnstore Index (Default & Best for Analytics)

Best suited for large, analytical workloads. This index compresses data and improves performance. It's the default if no index is specified.

CREATE TABLE dbo.SalesFact_Columnstore
(
    SaleID INT,
    CustomerID INT,
    Amount FLOAT,
    SaleDate DATE
)
WITH (
    CLUSTERED COLUMNSTORE INDEX,
    DISTRIBUTION = HASH(CustomerID)
);

🔹 2. Heap Table (No Index)

No indexes are applied. Ideal for fast loading in staging or transient tables where performance tuning is not required yet.

CREATE TABLE dbo.SalesFact_Heap
(
    SaleID INT,
    CustomerID INT,
    Amount FLOAT,
    SaleDate DATE
)
WITH (
    HEAP,
    DISTRIBUTION = ROUND_ROBIN
);

🔹 3. Clustered Index (Rowstore)

Stores data in a sorted row-based format. Best for small dimension or lookup tables and selective filters.

CREATE TABLE dbo.Customer_Detail
(
    CustomerID INT NOT NULL,
    FullName NVARCHAR(100),
    Region NVARCHAR(50)
)
WITH (
    CLUSTERED INDEX (CustomerID),
    DISTRIBUTION = HASH(CustomerID)
);

🔹 4. Nonclustered Index

Creates a separate index on one or more columns for faster lookups. Use selectively as each index adds overhead.

CREATE INDEX idx_Region ON dbo.Customer_Detail(Region);

🔄 5. Rebuilding Indexes (Especially for Columnstore)

Rebuilding restores compression and performance. Ideal for maintenance schedules, especially after heavy loads or updates.

-- Rebuild all indexes on a table
ALTER INDEX ALL ON dbo.SalesFact_Columnstore REBUILD;

-- Rebuild a specific partition (if partitioned)
-- ALTER INDEX ALL ON dbo.SalesFact_Columnstore REBUILD PARTITION = 1;

-- Optional compression setting
-- ALTER INDEX ALL ON dbo.SalesFact_Columnstore 
--   REBUILD PARTITION = 1 WITH (DATA_COMPRESSION = COLUMNSTORE);

✅ Best Practices

  • Use CLUSTERED COLUMNSTORE for large fact tables and analytics.
  • Use HEAP for fast, staging loads or temporary tables.
  • Use CLUSTERED INDEX for small dimension or lookup tables.
  • Use NONCLUSTERED INDEX for tuning specific query filters.
  • Rebuild indexes regularly to maintain performance.

⚠️ Note: Index rebuilds are offline operations — schedule them during maintenance windows.

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

PySpark when() and otherwise() Explained |Apply If-Else Conditions to DataFrames #pysparktutorial

PySpark when() and otherwise() Explained | Apply If-Else Conditions to DataFrames

PySpark when() and otherwise() Explained

In this tutorial, you'll learn how to use the when() and otherwise() functions in PySpark to apply if-else style conditional logic directly to DataFrames. These functions are useful for transforming values in a column based on conditions.

🔹 Step 1: Create SparkSession & Sample Data

from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col

spark = SparkSession.builder.appName("WhenOtherwiseExample").getOrCreate()

data = [
    (1, "Aamir", 50000),
    (2, "Ali", None),
    (3, "Bob", 45000),
    (4, "Lisa", 60000),
    (5, "Zara", None),
    (6, "ALINA", 55000)
]
columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)
df.show()

🔹 Step 2: Apply when() and otherwise()

# Create a new column with conditional labels
df_with_label = df.withColumn(
    "salary_label",
    when(col("salary") >= 55000, "High")
    .otherwise("Low")
)

df_with_label.show()

🔹 Step 3: Apply Multiple Conditions

# Multiple when conditions
df_with_category = df.withColumn(
    "salary_category",
    when(col("salary") > 60000, "Very High")
    .when((col("salary") >= 50000) & (col("salary") <= 60000), "Medium")
    .when(col("salary") < 50000, "Low")
    .otherwise("Unknown")
)

df_with_category.show()

🎥 Watch Video Tutorial

Primary Key & Unique Constraints in Azure Synapse Dedicated SQL Pool | NOT ENFORCED Explained | Azure Synapse Analytics Tutorial

Primary Key & Unique Constraints in Azure Synapse Dedicated SQL Pool | NOT ENFORCED Explained

Primary Key & Unique Constraints in Azure Synapse Dedicated SQL Pool | NOT ENFORCED Explained

🔍 Introduction

In Azure Synapse Dedicated SQL Pools, constraints such as PRIMARY KEY and UNIQUE are allowed — but with a twist: they are NOT ENFORCED. This means they serve as metadata hints for the query optimizer rather than enforcing data integrity rules like in traditional SQL Server.

📌 Key Points

  • FOREIGN KEY constraints are not supported.
  • PRIMARY KEY and UNIQUE constraints must include NOT ENFORCED.
  • Duplicates are not prevented — you are responsible for ensuring data quality.
  • These constraints can help the query optimizer generate better plans.

🛠️ Demo Overview

Step 1: Create Demo Table

CREATE TABLE dbo.Products (
    ProductID INT NOT NULL,
    ProductName NVARCHAR(100),
    Category NVARCHAR(50)
)
WITH (DISTRIBUTION = ROUND_ROBIN);

Step 2: Insert Duplicate Values

INSERT INTO dbo.Products VALUES (1, 'Laptop', 'Electronics');
INSERT INTO dbo.Products VALUES (2, 'Tablet', 'Electronics');
INSERT INTO dbo.Products VALUES (1, 'Monitor', 'Electronics'); -- Duplicate ProductID
INSERT INTO dbo.Products VALUES (3, 'Desk', 'Furniture');

Step 3: Add UNIQUE Constraint (NOT ENFORCED)

ALTER TABLE dbo.Products
ADD CONSTRAINT UQ_ProductID UNIQUE (ProductID) NOT ENFORCED;

⚠️ The constraint will be created even though duplicates exist.

Step 4: Remove Duplicate Manually

DELETE FROM dbo.Products WHERE ProductName = 'Monitor';

Step 5: Replace UNIQUE with PRIMARY KEY (also NOT ENFORCED)

ALTER TABLE dbo.Products
DROP CONSTRAINT UQ_ProductID;

ALTER TABLE dbo.Products
ADD CONSTRAINT PK_ProductID PRIMARY KEY NONCLUSTERED (ProductID) NOT ENFORCED;

💡 Why Use These Constraints?

  • Helps Synapse generate optimized query execution plans.
  • Useful for BI/reporting layers that assume uniqueness.
  • Important for documentation and data governance.

🚫 Limitations

  • Constraints won't stop bad data — no runtime enforcement.
  • You must manually ensure data quality via queries or ELT logic.

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

PySpark Null & Comparison Functions: between(), isNull(), isin(), like(), rlike(), ilike() | PySpark Tutorial

PySpark Null & Comparison Functions Explained | between(), isNull(), isin(), like(), rlike(), ilike()

PySpark Null & Comparison Functions Explained

This PySpark tutorial explains how to use essential functions for handling nulls, filtering data, and performing pattern matching in DataFrames using:

  • between()
  • isNull() and isNotNull()
  • isin()
  • like(), rlike(), and ilike()

1. Create a Sample DataFrame

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("NullComparisonOps").getOrCreate()

data = [
    (1, "Aamir", 50000),
    (2, "Ali", None),
    (3, "Bob", 45000),
    (4, "Lisa", 60000),
    (5, "Zara", None),
    (6, "ALINA", 55000)
]

columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)
df.show()

2. Use between() Function

Select employees whose salary is between 45000 and 60000:

df.filter(col("salary").between(45000, 60000)).show()

3. Use isNull() and isNotNull()

Filter rows where salary is null:

df.filter(col("salary").isNull()).show()

Filter rows where salary is not null:

df.filter(col("salary").isNotNull()).show()

4. Use isin() Function

Filter names that are in the list ["Aamir", "Lisa"]:

df.filter(col("name").isin("Aamir", "Lisa")).show()

5. Use like(), rlike(), and ilike()

Names that start with 'A':

df.filter(col("name").like("A%")).show()

Names matching regex (e.g., all names ending in 'a'):

df.filter(col("name").rlike(".*a$")).show()

Case-insensitive LIKE (if using Spark 3.3+):

df.filter(col("name").ilike("ali%")).show()

📺 Watch Full Tutorial

CTAS vs SELECT INTO in Azure Synapse Analytics | Create Table in Dedicated SQL Pool | Azure Synapse Analytics Tutorial

CTAS vs SELECT INTO in Azure Synapse Analytics | Create Table in Dedicated SQL Pool

CTAS vs SELECT INTO in Azure Synapse Analytics | Create Table in Dedicated SQL Pool

📘 Overview

In Azure Synapse Analytics, you have two common options for creating new tables based on the results of a query:

  • CTAS – Create Table As Select
  • SELECT INTO – Select data into a new table

While both approaches allow you to create a new table from existing data, there are important differences in features, performance, and support within Synapse Dedicated SQL Pools.

⚖️ Comparison Table: CTAS vs SELECT INTO

Feature CTAS SELECT INTO
Supported in Dedicated SQL Pool ✅ Yes ❌ No
Control Over Distribution Method ✅ Yes ❌ No
Control Over Indexes ✅ Yes (e.g., columnstore index) ❌ No
Used in Serverless SQL Pool ❌ No ✅ Yes
Syntax Simplicity Moderate Very Simple

✅ Example: CTAS in Dedicated SQL Pool

CREATE TABLE dbo.TotalSalesByProduct
WITH (
    DISTRIBUTION = HASH(ProductID),
    CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT ProductID, SUM(Quantity) AS TotalQuantity
FROM dbo.SalesFact
GROUP BY ProductID;

This will create a distributed and indexed table optimized for analytics.

✅ Example: SELECT INTO in Serverless SQL Pool

SELECT ProductID, SUM(Quantity) AS TotalQuantity
INTO dbo.TotalSalesByProduct
FROM OPENROWSET(
    BULK 'https://storageaccount.dfs.core.windows.net/container/sales.parquet',
    FORMAT = 'PARQUET'
) AS sales
GROUP BY ProductID;

This method is easier and works well in exploratory scenarios in Serverless SQL Pools, but doesn't allow control over performance features.

📌 Key Takeaways

  • Use CTAS in Dedicated SQL Pool for production-grade performance.
  • Use SELECT INTO in Serverless SQL Pool for lightweight operations or temporary results.
  • CTAS gives you full control over distribution, storage format, and indexing.

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

Create Your First Lakehouse & Load CSV, Parquet, JSON Files | Microsoft Fabric Tutorial for Data Engineers

Create Your First Lakehouse & Load CSV, Parquet, JSON Files | Microsoft Fabric Tutorial

Create Your First Lakehouse & Load CSV, Parquet, JSON Files

Microsoft Fabric Tutorial

📘 What is a Lakehouse in Microsoft Fabric?

A Lakehouse in Microsoft Fabric combines the scalability and flexibility of a data lake with the structured querying power of a data warehouse. It stores files in Delta Lake format, enabling analytics directly over raw and structured data — using SQL, notebooks, or Power BI.

✅ What You'll Learn in This Tutorial

  • What a Lakehouse is and its role in Microsoft Fabric
  • Step-by-step process to create a new Lakehouse
  • How to upload and manage CSV, Parquet, and JSON files
  • How Microsoft Fabric unifies data lake and data warehouse capabilities
  • Practical tips to structure your Lakehouse for analytics workloads

🛠️ Step-by-Step: Creating Your First Lakehouse

  1. Log in to Microsoft Fabric Portal
  2. Go to your workspace and click + NewLakehouse
  3. Give your Lakehouse a name and hit Create

Once created, you'll land in the Lakehouse explorer which allows you to manage files, tables, and notebooks.

📂 Upload CSV, Parquet, and JSON Files

Inside your Lakehouse, switch to the Files tab:

  • Click on Upload and select one or more files (CSV, Parquet, or JSON)
  • Uploaded files are stored in /Files folder
  • You can preview and open these files in notebooks or convert them into managed Delta tables

📊 Unifying Data Lake and Warehouse

Microsoft Fabric allows you to treat your Lakehouse like a warehouse using DirectLake and SQL endpoints:

  • Run SQL queries on files/tables using SQL analytics endpoint
  • Use Power BI for visualizations without importing data
  • Query Delta tables using Spark notebooks

💡 Tips to Structure Your Lakehouse

  • Use folders like /raw, /processed, and /curated to stage data
  • Convert CSV and JSON into Delta tables for analytics
  • Tag or name files consistently: e.g., sales_2025_Q2.csv

🎬 Watch the Video Tutorial

Blog created with help from ChatGPT and Gemini.

PySpark cast() vs astype() Explained |Convert String to Int, Float & Double in DataFrame | PySpark Tutorial

PySpark cast() vs astype() Explained | Convert String to Int, Float & Double

PySpark cast() vs astype() Explained

In this tutorial, we'll explore how to convert PySpark DataFrame columns from one type to another using cast() and astype(). You'll learn how to convert string columns to integers, floats, and doubles in a clean and efficient way.

1. Sample DataFrame

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("CastExample").getOrCreate()

data = [
    ("1", "Aamir", "50000.5"),
    ("2", "Ali", "45000.0"),
    ("3", "Bob", None),
    ("4", "Lisa", "60000.75")
]

columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)
df.printSchema()
df.show()

2. Using cast() Function

Convert id to integer and salary to float:

df_casted = df.withColumn("id", col("id").cast("int")) \
              .withColumn("salary", col("salary").cast("float"))
df_casted.printSchema()
df_casted.show()

3. Using astype() Function

This is an alias for cast() and used in the same way:

df_astype = df_casted.withColumn("salary", col("salary").astype("double"))
df_astype.printSchema()
df_astype.show()

Output:

Original DataFrame (all columns as strings):
+---+-----+--------+
| id| name| salary |
+---+-----+--------+
| 1 |Aamir|50000.5 |
| 2 | Ali |45000.0 |
| 3 | Bob |  null  |
| 4 |Lisa |60000.75|
+---+-----+--------+

After cast():
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: float (nullable = true)

After astype():
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: double (nullable = true)

📺 Watch the Full Tutorial

How to Use CTAS (Create Table As Select) in Azure Synapse Analytics | Azure Synapse Analytics Tutorial

How to Use CTAS in Azure Synapse Analytics to Create New Tables

How to Use CTAS (Create Table As) in Azure Synapse Analytics

📘 What is CTAS?

CTAS stands for Create Table As Select. It’s a T-SQL statement used in Azure Synapse Dedicated SQL Pools to create a new table by selecting data from one or more existing tables. CTAS is highly efficient for:

  • Creating aggregated or filtered datasets
  • Staging intermediate results
  • Transforming or reshaping data

CTAS leverages the Massively Parallel Processing (MPP) engine of Synapse to create new distributed tables efficiently.

✅ Syntax

CREATE TABLE [schema].[new_table]
WITH (DISTRIBUTION = HASH(column) | ROUND_ROBIN | REPLICATE)
AS
SELECT ... FROM existing_table;

🛠️ Real Examples

🔹 Scenario 1: Create Table with Subset of Columns

CREATE TABLE dbo.DimCustomerName
WITH (DISTRIBUTION = ROUND_ROBIN)
AS
SELECT customer_id, fname, lastname
FROM dbo.Customers_CSV;

🔹 Scenario 2: Create Table with Aggregated Data

CREATE TABLE dbo.TotalSalesByProduct
WITH (DISTRIBUTION = HASH(ProductID))
AS
SELECT ProductID, SUM(Quantity) AS TotalQuantitySold
FROM dbo.SalesFact_CSV
GROUP BY ProductID;

🔹 Scenario 3: Create Table with Filtered Data

CREATE TABLE dbo.HighValueCustomers
WITH (DISTRIBUTION = REPLICATE)
AS
SELECT customer_id, fname, lastname
FROM dbo.Customers_CSV
WHERE customer_id IN (1, 3);

🔹 Scenario 4: Rename Columns and Cast Data Types

CREATE TABLE dbo.ProductPricing
WITH (DISTRIBUTION = HASH(ProductID_New))
AS
SELECT ProductID AS ProductID_New,
       ProductName AS Name,
       Category,
       CAST(Price AS DECIMAL(12, 4)) AS Price_Adjusted
FROM dbo.ProductInfo_Parquet;

🔹 Scenario 5: Create Table from JOIN

CREATE TABLE dbo.CustomerSales
WITH (DISTRIBUTION = HASH(customer_id))
AS
SELECT c.customer_id, c.fname, c.lastname,
       s.SaleID, s.ProductID, s.Quantity, s.SaleDate
FROM dbo.Customers_CSV c
JOIN dbo.SalesFact_CSV s
  ON c.customer_id = s.CustomerID;

🔹 Scenario 6: Use a Different Schema

CREATE TABLE staging.StagedProducts
WITH (DISTRIBUTION = ROUND_ROBIN)
AS
SELECT ProductID, ProductName, Category, Price
FROM dbo.ProductInfo_Parquet
WHERE Category = 'Electronics';

🔹 Scenario 7: Create Table with New Data Types

CREATE TABLE dbo.SalesFactBigIntID
WITH (DISTRIBUTION = HASH(ProductID),
      CLUSTERED COLUMNSTORE INDEX)
AS
SELECT CAST(SaleID AS BIGINT) AS SaleID_BigInt,
       ProductID, Quantity, SaleDate, CustomerID
FROM dbo.SalesFact_CSV;

📌 CTAS Best Practices

  • Always choose the right DISTRIBUTION method (HASH for large fact tables, REPLICATE for small dimensions).
  • Use CTAS for performance-optimized reporting layers.
  • Drop and recreate staging tables during ELT to minimize fragmentation.
  • Use appropriate data types to reduce storage size.

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

What Are Workspaces and How to Organize Fabric Items | Microsoft Fabric Tutorial for Data Engineers

What Are Workspaces and How to Organize Fabric Items | Microsoft Fabric Tutorial

🧱 What Are Workspaces and How to Organize Fabric Items

Microsoft Fabric Tutorial

📘 Introduction

In Microsoft Fabric, a Workspace is a container for organizing and managing your analytics items — such as Lakehouses, Warehouses, Notebooks, Reports, Pipelines, and more.

Workspaces help you group related data assets by project, department, or use case — making collaboration, security, and governance easier.

This tutorial covers:

  • What a workspace is
  • Key workspace features
  • How to organize items inside
  • Real-world use cases

🧾 What Is a Workspace?

A workspace is like a folder or project space in Microsoft Fabric.

You can think of it as your analytics sandbox where you build, store, and manage related data assets.

Each workspace includes:

  • 🔸 Access control (who can view/edit content)
  • 🔸 Linked capacity (defines compute power)
  • 🔸 Items like Lakehouses, Warehouses, Dashboards, Pipelines, Notebooks, etc.

📦 Types of Items You Can Organize

  • 🏞️ Lakehouses – Delta-based storage for structured + unstructured data
  • 🏢 Warehouses – Traditional SQL interface on top of the same OneLake
  • 📓 Notebooks – Interactive development with PySpark
  • 🔁 Data Pipelines – ETL orchestration using visual tools
  • 📊 Power BI Reports – For visualizations and analytics

🗂️ Organizing Strategy Tips

  • Create a separate workspace per department, project, or data domain
  • Use consistent naming conventions (e.g., Sales_Lakehouse_US, HR_Warehouse_EU)
  • Use folders inside workspaces to group Notebooks, Pipelines, and Files
  • Assign different roles (Admin, Member, Viewer) to control access securely

🎯 Real-World Example

In a large organization, you may have:

  • Marketing Workspace – Holds campaign data, dashboards, and reports
  • Sales Workspace – Includes Lakehouses for transactions, Warehouses for aggregations
  • IT Ops Workspace – Contains pipelines for ingestion, notebooks for automation

🎬 Watch the Video Tutorial

Blog created with help from ChatGPT and Gemini.

How to Drop All Tables from Lakehouse in Microsoft Fabric | Microsoft Fabric Tutorial for Beginners

How to Drop All Tables from Lakehouse in Microsoft Fabric

How to Drop All Tables from Lakehouse in Microsoft Fabric

Microsoft Fabric Tutorial for Beginners

📘 Overview

If you're working with a Microsoft Fabric Lakehouse and want to clean up your environment, dropping all existing tables can help you reset your workspace. This tutorial shows how to do that programmatically using PySpark and the spark.catalog.

🔍 Step 1: List All Tables in the Lakehouse

Use the Spark catalog API to fetch all tables in the current Lakehouse:

# List all tables
tables = spark.catalog.listTables()

This returns metadata such as table name, database, and type (managed or external).

🔁 Step 2: Loop Through and Drop Each Table

Loop through each entry and run a DROP TABLE command:

for table in tables:
    table_name = table.name
    spark.sql(f"DROP TABLE IF EXISTS {table_name}")

This ensures that each table is removed cleanly from the catalog.

💡 Notes

  • This approach works only on tables in the current Lakehouse context.
  • Always double-check before running this in production — this operation is irreversible!
  • Works well during automated clean-up workflows or notebook resets.

🎬 Watch the Demo Video

Blog created with help from ChatGPT and Gemini.

PySpark Sorting Explained | ASC vs DESC | Handling NULLs with asc_nulls_first & desc_nulls_last | PySpark Tutorial

PySpark Sorting Explained: ASC vs DESC | Handling NULLs with asc_nulls_first & desc_nulls_last

PySpark Sorting Explained: ASC vs DESC | Handling NULLs with asc_nulls_first & desc_nulls_last

In this tutorial, you’ll learn how to sort DataFrames in PySpark using ascending and descending order, while controlling NULL value placement using asc_nulls_first() and desc_nulls_last(). Real examples are provided for clarity.

1️⃣ Create Spark Session

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("SortingExamples").getOrCreate()

2️⃣ Create Sample Data

data = [
    (1, "Aamir", 50000),
    (2, "Ali", None),
    (3, "Bob", 45000),
    (4, "Lisa", 60000),
    (5, "Zara", None)
]
columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)
df.show()

3️⃣ Sort by ASC (default)

df.orderBy(col("salary").asc()).show()

4️⃣ Sort by ASC with NULLs First

df.orderBy(col("salary").asc_nulls_first()).show()

5️⃣ Sort by ASC with NULLs Last

df.orderBy(col("salary").asc_nulls_last()).show()

6️⃣ Sort by DESC

df.orderBy(col("salary").desc()).show()

7️⃣ DESC with NULLs First

df.orderBy(col("salary").desc_nulls_first()).show()

8️⃣ DESC with NULLs Last

df.orderBy(col("salary").desc_nulls_last()).show()

📺 Watch the Full Tutorial

Create External Tables in Azure Synapse Dedicated Pool for CSV & Parquet Files | Azure Synapse Analytics Tutorial

Create External Tables in Azure Synapse Dedicated Pool for CSV & Parquet Files

Create External Tables in Azure Synapse Dedicated Pool for CSV & Parquet Files

📘 What is an External Table?

An External Table in Azure Synapse Dedicated SQL Pool allows you to query data stored outside the database — typically in Azure Data Lake Storage Gen2 — without importing it into the SQL pool itself. This enables cost-effective querying of large files like .csv or .parquet formats directly using T-SQL.

External tables are used in scenarios like raw zone exploration, staging, or ELT pipelines where data exists in a data lake but needs to be accessed for transformations or analytics.

📁 File Formats Supported

  • CSV: Commonly used for flat files
  • Parquet: Optimized, columnar format best for analytics

🔧 Pre-requisites

  • Azure Synapse workspace with a Dedicated SQL Pool
  • Data files (CSV/Parquet) stored in ADLS Gen2
  • Database scoped credential to access the storage account
  • External data source and file format definitions

🛠️ Step-by-Step Example for CSV

1. Create Database Scoped Credential

CREATE DATABASE SCOPED CREDENTIAL myadls_cred
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 'sv=2024...';

2. Create External Data Source

CREATE EXTERNAL DATA SOURCE my_adls
WITH (
    LOCATION = 'abfss://datalake@yourstorage.dfs.core.windows.net/',
    CREDENTIAL = myadls_cred
);

3. Create External File Format for CSV

CREATE EXTERNAL FILE FORMAT csv_format
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = ',',
        STRING_DELIMITER = '\"',
        FIRST_ROW = 2
    )
);

4. Create External Table

CREATE EXTERNAL TABLE dbo.ExternalSalesCSV
(
    SaleID INT,
    ProductName VARCHAR(100),
    Quantity INT,
    Price MONEY
)
WITH (
    LOCATION = '/sales/csv/',
    DATA_SOURCE = my_adls,
    FILE_FORMAT = csv_format,
    REJECT_TYPE = VALUE,
    REJECT_VALUE = 0
);

🛠️ Step-by-Step Example for Parquet

1. Create External File Format for Parquet

CREATE EXTERNAL FILE FORMAT parquet_format
WITH (
    FORMAT_TYPE = PARQUET
);

2. Create External Table

CREATE EXTERNAL TABLE dbo.ExternalProductsParquet
(
    ProductID INT,
    ProductName VARCHAR(100),
    Category VARCHAR(50),
    Price FLOAT
)
WITH (
    LOCATION = '/products/parquet/',
    DATA_SOURCE = my_adls,
    FILE_FORMAT = parquet_format
);

📌 Best Practices

  • Use Parquet format for large analytical queries to benefit from columnar compression.
  • Keep external table definitions consistent with the structure of the underlying file.
  • Use views to abstract external table logic for business users.
  • Use the REJECT_TYPE clause to handle malformed rows gracefully.

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

What Is OneLake? Unified Storage Layer Explained | Microsoft Fabric Tutorial #microsoftfabric

What Is OneLake? Unified Storage Layer Explained | Microsoft Fabric Tutorial

💧 What Is OneLake? Unified Storage Layer Explained

Microsoft Fabric Tutorial

📘 What is OneLake?

OneLake is Microsoft Fabric’s unified, enterprise-grade data lake that acts as the single storage layer for all data workloads in Fabric — including Lakehouse, Warehouse, Data Science, Real-Time Analytics, and Power BI.

It is built on top of Azure Data Lake Storage Gen2 (ADLS Gen2) but managed as part of the Fabric SaaS experience.

✅ Key Benefits of OneLake

  • 🧩 One Copy of Data: Store once, access across all workloads (SQL, Spark, Power BI, Dataflow)
  • 🔐 Centralized Governance: Integrated with Microsoft Purview and Entra ID
  • DirectLake Performance: Power BI can query Delta tables directly without import or refresh
  • 🌍 Global Namespace: All data is accessed under OneLake://<workspace>/<item>
  • 🗃️ Delta Format First: Built-in support for Delta Lake format (transactional + analytics)

🌐 How OneLake Integrates Across Fabric

OneLake acts as the foundation for:

  • Lakehouse: Built on top of OneLake using Apache Spark and Delta tables
  • Warehouse: Traditional SQL interface using the same underlying storage
  • Power BI DirectLake: Connects Power BI reports directly to OneLake-based Delta tables without needing refresh or dataset import
  • Data Science: Notebooks and ML workloads directly use OneLake-backed data

🎬 Watch the Video Tutorial

Blog created with help from ChatGPT and Gemini.

How to Read multiple CSV files from Blog Storage and Write to Azure SQL Table with Filenames | PySpark Tutorial

How to Read Multiple CSV Files from Blob Storage and Write to Azure SQL Table with Filenames

How to Read Multiple CSV Files from Blob Storage and Write to Azure SQL Table with Filenames

In this PySpark tutorial, you’ll learn how to load multiple CSV files from Azure Blob Storage, extract the filename from each file, and write the combined data to an Azure SQL table for tracking and processing.

1️⃣ Step 1: Setup Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ReadCSV_WriteSQL_WithFilenames") \
    .getOrCreate()

2️⃣ Step 2: Set Azure Blob Storage Configuration

spark.conf.set(
    "fs.azure.sas.<container>.<storage_account>.blob.core.windows.net",
    "<your_sas_token>"
)

path = "wasbs://<container>@<storage_account>.blob.core.windows.net/input-folder/*.csv"

3️⃣ Step 3: Read All CSV Files and Add Filename Column

from pyspark.sql.functions import input_file_name, regexp_extract

df = spark.read.option("header", True).csv(path)
df = df.withColumn("filename", regexp_extract(input_file_name(), r"([^/]+$)", 1))
df.show()

4️⃣ Step 4: JDBC Configuration for Azure SQL

jdbcUrl = "jdbc:sqlserver://<your_server>.database.windows.net:1433;database=<your_db>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30"

connectionProperties = {
    "user": "<your_username>",
    "password": "<your_password>",
    "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

5️⃣ Step 5: Write DataFrame to Azure SQL Table

df.write \
    .mode("append") \
    .jdbc(url=jdbcUrl, table="your_table_name", properties=connectionProperties)

print("✅ All CSV files uploaded to Azure SQL with filename column.")

📺 Watch the Full Tutorial

How to Use COPY INTO or Bulk Load CSV/Parquet Files into Dedicated SQL Pool | Azure Synapse Analytics Tutorial

How to Use COPY INTO or Bulk Load CSV/Parquet Files into Dedicated SQL Pool | Azure Synapse

How to Use COPY INTO or Bulk Load CSV/Parquet Files into Dedicated SQL Pool | Azure Synapse

📦 What is COPY INTO in Azure Synapse?

COPY INTO is a high-performance T-SQL command used in Azure Synapse Analytics to bulk load data from external files (CSV or Parquet) in Azure Data Lake Storage Gen2 into a table in a Dedicated SQL Pool.

This method is optimized for loading large datasets efficiently and supports common file formats like .csv and .parquet, using parallel processing under the hood.

🗂️ Supported File Formats

  • CSV — with or without headers
  • Parquet — columnar storage format for analytics

🛠️ Syntax of COPY INTO

COPY INTO target_table
FROM 'https://.dfs.core.windows.net//path/'
WITH (
    FILE_TYPE = 'CSV', -- or 'PARQUET'
    CREDENTIAL = (IDENTITY= 'Shared Access Signature', SECRET=''),
    FIELDTERMINATOR = ',',  -- CSV specific
    ROWTERMINATOR = '0x0A', -- CSV specific
    FIRSTROW = 2            -- skip header if present
);

✅ Example: Load CSV File into Sales Table

COPY INTO dbo.Sales
FROM 'https://techstorage.dfs.core.windows.net/raw/sales.csv'
WITH (
    FILE_TYPE = 'CSV',
    CREDENTIAL = (IDENTITY= 'Shared Access Signature', SECRET='sv=2024...'),
    FIELDTERMINATOR = ',',
    ROWTERMINATOR = '0x0A',
    FIRSTROW = 2
);

✅ Example: Load Parquet File into Product Table

COPY INTO dbo.Product
FROM 'https://techstorage.dfs.core.windows.net/raw/productdata/'
WITH (
    FILE_TYPE = 'PARQUET',
    CREDENTIAL = (IDENTITY= 'Shared Access Signature', SECRET='sv=2024...')
);

📌 Tips for Using COPY INTO

  • Make sure your Dedicated SQL Pool is resumed and ready to load.
  • Use Azure Data Lake Storage Gen2 for optimal performance.
  • Test COPY INTO with a small file before loading TBs of data.
  • Always provide correct SAS token or Managed Identity permissions for access.

🚀 When to Use COPY INTO

  • Bulk data ingestion from raw zones in your data lake
  • Loading files created by Synapse Pipelines or Azure Data Factory
  • Loading historical or batch data into a reporting database

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

Sign Up for Microsoft Fabric Free Trial Using a Personal Email | Microsoft Fabric Tutorial

Sign Up for Microsoft Fabric Free Trial Using a Personal Email | Microsoft Fabric Tutorial

Sign Up for Microsoft Fabric Free Trial Using a Personal Email

Microsoft Fabric Tutorial

🚀 What You’ll Learn

In this step-by-step tutorial, you’ll learn how to sign up for the Microsoft Fabric free trial using your personal Gmail, Outlook, or other email accounts.

  • Accessing Microsoft Fabric as an individual (non-corporate email)
  • Understanding licensing and trial limits
  • Creating a Fabric workspace
  • Verifying access to Power BI and other Fabric services

🎬 Watch the Video Tutorial

📝 Quick Steps to Sign Up

  1. Go to https://app.fabric.microsoft.com
  2. Click on "Try for free"
  3. Sign in with your personal email (Gmail, Outlook, Yahoo, etc.)
  4. Follow the prompts to complete setup
  5. Once in, create a new workspace and explore Fabric features like Lakehouse, Pipelines, and Notebooks

💡 Important Notes

  • This trial gives you access to Microsoft Fabric for 60 days
  • Some features may be limited based on region or account type
  • No credit card is required to start the trial
Blog created with the help of ChatGPT and Gemini.

How to Read data from Azure SQL Table and Write to JSON File in Blob Storage | PySpark Tutorial

How to Read Data from Azure SQL Table and Write to JSON File in Blob Storage | PySpark Tutorial

How to Read Data from Azure SQL Table and Write to JSON File in Blob Storage | PySpark Tutorial

This PySpark tutorial demonstrates how to read data from an Azure SQL Table using JDBC and write the results to a JSON file in Azure Blob Storage using SAS token access.

1️⃣ Step 1: Set Up Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ReadSQL_WriteToBlob_JSON") \
    .getOrCreate()

2️⃣ Step 2: Define JDBC Connection

jdbcHostname = "yourserver.database.windows.net"
jdbcPort = 1433
jdbcDatabase = "yourdatabase"
jdbcUsername = "sqladmin"
jdbcPassword = "YourPassword123!"

jdbcUrl = f"jdbc:sqlserver://{jdbcHostname}:{jdbcPort};database={jdbcDatabase};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30"

connectionProperties = {
  "user": jdbcUsername,
  "password": jdbcPassword,
  "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}

3️⃣ Step 3: Read Data from Azure SQL Table

df = spark.read.jdbc(
    url=jdbcUrl,
    table="your_table_name",
    properties=connectionProperties
)

df.show()

4️⃣ Step 4: Set SAS Token and Blob Storage Config

container_name = "yourcontainer"
storage_account_name = "yourstorageaccount"
sas_token = "<your_sas_token>"

spark.conf.set(
  f"fs.azure.sas.{container_name}.{storage_account_name}.blob.core.windows.net",
  sas_token
)

output_path = f"wasbs://{container_name}@{storage_account_name}.blob.core.windows.net/sql_to_json"

5️⃣ Step 5: Write to JSON Format in Blob Storage

df.coalesce(1).write.mode("overwrite").json(output_path)

print("✅ Data successfully written to Azure Blob Storage in JSON format.")

📺 Watch the Full Tutorial

How to Connect to Synapse Analytics Dedicated SQL Pool Using SSMS, Azure Data Studio,or VS Code | Azure Synapse Analytics Tutorial

How to Connect to Synapse Analytics Dedicated SQL Pool Using SSMS, Azure Data Studio, or VS Code

How to Connect to Synapse Analytics Dedicated SQL Pool Using SSMS, Azure Data Studio, or VS Code

🔍 Overview

Azure Synapse Analytics offers a powerful Dedicated SQL Pool for high-performance data warehousing. To manage and query your dedicated SQL pool, you can use various tools like:

  • SQL Server Management Studio (SSMS)
  • Azure Data Studio
  • Visual Studio Code (VS Code)

These tools allow you to connect, write T-SQL queries, and manage your dedicated SQL pools efficiently outside of Synapse Studio.

🛠️ What You'll Need

  • Azure Synapse Workspace with a Dedicated SQL Pool
  • Fully Qualified Server Name (e.g., synapse-workspace.sql.azuresynapse.net)
  • SQL Admin Username and Password
  • SSMS, Azure Data Studio, or VS Code installed on your machine

🔗 How to Connect Using SSMS

  1. Open SQL Server Management Studio (SSMS).
  2. In the Connect to Server dialog:
    • Server type: Database Engine
    • Server name: yourworkspace.sql.azuresynapse.net
    • Authentication: SQL Server Authentication
    • Login: your SQL admin username
    • Password: your SQL password
  3. Click Connect.
  4. Select the dedicated SQL pool from the database dropdown to run your queries.

🔗 How to Connect Using Azure Data Studio

  1. Open Azure Data Studio.
  2. Click New Connection.
  3. Enter the connection info:
    • Server: yourworkspace.sql.azuresynapse.net
    • Database: (optional or specify your SQL pool name)
    • Authentication Type: SQL Login
    • Username/Password: Your SQL credentials
  4. Click Connect.

🔗 How to Connect Using VS Code

  1. Install the SQL Server (mssql) extension from the VS Code Marketplace.
  2. Open the Command Palette (Ctrl+Shift+P) and search for MS SQL: Connect.
  3. Enter the following details when prompted:
    • Server name: yourworkspace.sql.azuresynapse.net
    • Authentication Type: SQL Login
    • Username: your SQL admin user
    • Password: your password
  4. Select your Dedicated SQL Pool database after connection.

📌 Tips

  • Ensure your client IP address is added to the Synapse workspace’s firewall rules.
  • Use Active Directory authentication if your workspace is configured for Azure AD.
  • Always connect to .sql.azuresynapse.net for dedicated SQL pools.

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

What is Microsoft Fabric? | Microsoft Fabric Tutorial for Beginners to Advanced #microsoftfabric

What is Microsoft Fabric – Unified Analytics Platform

📘 What is Microsoft Fabric?

An Overview of the Unified Analytics Platform

🔍 Introduction

Microsoft Fabric is an all-in-one analytics platform that unifies data movement, data engineering, data science, real-time analytics, and business intelligence (BI) — all built on OneLake, a single, secure data lake for the entire organization.

It is the evolution of the modern data stack combining technologies from Power BI, Azure Synapse Analytics, and Azure Data Factory into a SaaS-based experience.

🧱 Core Components of Microsoft Fabric

Component Description
OneLake The unified data lake storage that underpins all Fabric workloads
Lakehouse Combines the scalability of a data lake with the usability of a data warehouse
Warehouse Traditional SQL data warehouse experience for BI workloads
Data Factory Visual data pipeline orchestration and transformation (ETL/ELT)
Notebooks Interactive development environment using Spark for data science and engineering
Real-Time Analytics Analyze streaming data at low latency using KQL-based engines
Power BI Unified experience for data visualization and reporting

🎯 Key Benefits

  • Unified platform eliminates the need to stitch multiple tools
  • Built-in governance, security, and compliance
  • Deep integration with Microsoft 365 and Azure ecosystem
  • Easy collaboration across business and technical teams

🎬 Watch the Demo

Blog created with help from ChatGPT and Gemini.