T-SQL Notebooks in Fabric – Run Queries, Visualize Data & Save Views | Microsoft Fabric Tutorial

T-SQL Notebooks in Fabric – Run Queries, Visualize Data & Save Views | Microsoft Fabric Tutorial

T-SQL Notebooks in Fabric – Run Queries, Visualize Data & Save Views

Microsoft Fabric now supports T-SQL Notebooks — a powerful, interactive workspace for data engineers, SQL developers, and analysts to write, execute, and visualize T-SQL code with documentation and rich outputs.

๐Ÿ“Œ Key Features of T-SQL Notebooks

  • Write and run T-SQL code directly in notebook cells
  • Connect to Warehouses or SQL analytics endpoints
  • Visualize results using built-in charts and column profiling
  • Save SQL queries as Views
  • Supports cross-warehouse querying with 3-part naming
  • Schedule notebooks and use them in pipelines

๐Ÿงฑ Step 1: Add Warehouse as Data Source

  • Click + Data sources in the notebook toolbar
  • Select your desired Warehouse or SQL analytics endpoint
  • Set one as the Primary source

๐Ÿงช Example Queries

-- Query top records from a table
SELECT TOP 5 * FROM TechDWH.dbo.Customers;

๐Ÿ’ก Code Templates

Right-click a table in Object Explorer to generate SQL templates:

  • SELECT Top 100
  • Create Table
  • Drop Table
  • CTAS (Create Table As Select)

๐Ÿ’พ Save Query as View

Write your query and highlight it, then click “Save as View” in the toolbar:

SELECT * FROM TechDWH.dbo.Orders
WHERE OrderAmount > 1000;

๐Ÿ”— Cross-Warehouse Queries

SELECT * FROM SalesDW.dbo.SalesCustomer;

Use three-part naming convention: Database.Schema.Table

๐Ÿ“Š Visualizing Query Results

  • After query execution, go to the Inspect tab
  • Explore data distribution and column stats
  • Switch to charts (bar, pie, etc.) for quick insights
SELECT TOP 100 CustomerID, FirstName, LastName, Country FROM TechDWH.dbo.Customers;

๐ŸŽฌ Watch the Full Tutorial

Blog post written with the help of ChatGPT.

How to Read Lake Database Tables & Views in Azure Synapse Using PySpark | Azure Synapse Tutorial

How to Read Lake Database Tables & Views in Azure Synapse Using PySpark

How to Read Lake Database Tables & Views in Azure Synapse Using PySpark

๐Ÿ“˜ Overview

In Azure Synapse Analytics, you can use PySpark to read tables and views from a Lake Database. This allows you to harness the power of Apache Spark for analytics on curated datasets in your Data Lake.

✅ Why Read Lake Database in PySpark?

  • Supports reading from Delta and Parquet formats
  • Leverages Spark for distributed processing
  • Easy integration with Synapse Notebooks

๐Ÿ› ️ Step-by-Step Guide

Step 1: Read a Table from Lake Database

df = spark.read.table("LakeDBName.TableName")
df.show()

Replace LakeDBName and TableName with your actual Lake Database and table names.

Step 2: Read a View from Lake Database

view_df = spark.read.table("LakeDBName.ViewName")
view_df.display()

Step 3: Register as Temp View for SQL Queries

df.createOrReplaceTempView("temp_table")
spark.sql("SELECT * FROM temp_table WHERE column = 'value'").show()

๐Ÿ”Ž Sample Use Case

  • Read curated sales data from Lake Database
  • Register as temp view
  • Join with another dataset using Spark SQL

๐Ÿ“Œ Notes

  • Ensure the Spark pool has access to the Lake Database metadata
  • Use display() in notebooks to visualize the output

๐Ÿ“บ Watch the Full Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

PySpark Tutorial: Extract Data with regexp_extract() in PySpark | Regex Patterns Made Easy #pyspark

Extract Substrings in PySpark with regexp_extract()

Extract Substrings with regexp_extract() in PySpark

The regexp_extract() function allows you to use regular expressions to extract substrings from string columns in PySpark. This is extremely useful when working with emails, logs, or structured patterns like phone numbers or dates.

๐Ÿ“˜ Sample Data

data = [
  ("user1@example.com", "[INFO] Login failed at 2024-04-15 10:23:45", "(123) 456-7890"),
  ("john.doe@mail.org", "[ERROR] Disk full at 2024-04-15 12:00:00", "125-456-7890"),
  ("alice@company.net", "[WARN] High memory at 2024-04-15 14:30:10", "123.456.7890")
]

columns = ["email", "log", "phone"]
df = spark.createDataFrame(data, columns)
df.show()

๐Ÿ“ง Extract Domain from Email

from pyspark.sql.functions import regexp_extract, col

df = df.withColumn("domain", regexp_extract(col("email"), "@(.*)", 1))
df.select("email", "domain").show(truncate=False)

Output: Extracts everything after @ in the email.

๐Ÿ” Extract Log Level

df = df.withColumn("log_level", regexp_extract(col("log"), "\\[(.*?)\\]", 1))
df.select("log", "log_level").show(truncate=False)

Output: Extracts the text inside square brackets (e.g., INFO, ERROR, WARN).

๐Ÿ“… Extract Date from Log

pattern_date = "(\\d{4}-\\d{2}-\\d{2})"
df = df.withColumn("log_date", regexp_extract(col("log"), pattern_date, 1))
df.select("log", "log_date").show(truncate=False)

Output: Captures date in the format YYYY-MM-DD from the log string.

๐Ÿ“ž Extract Area Code from Phone Number

df = df.withColumn("area_code", regexp_extract(col("phone"), "(\\d{3})", 1))
df.select("phone", "area_code").show(truncate=False)

Output: Captures the first 3 digits (area code) from the phone number.

๐ŸŽฅ Watch the Full Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

What Are Views in Lake Databases | Azure Synapse Analytics Tutorial

What Are Views in Lake Databases | Azure Synapse Analytics Explained

What Are Views in Lake Databases | Azure Synapse Analytics Explained

๐Ÿ“˜ Overview

In Azure Synapse Analytics, views in Lake Databases allow users to define logical query layers on top of data without duplicating it. These views are stored in the metastore and can be queried using both Spark and Serverless SQL Pools, enabling seamless analytics across structured and semi-structured data.

๐Ÿง  What Is a View in Lake Database?

A view in a Lake Database is a virtual table created by saving a SQL query. It does not store data physically but references data from existing Delta or Parquet tables.

๐Ÿ› ️ How to Create a View in Lake Database

✅ Step 1: Use Spark SQL to Create a View

%%spark
CREATE OR REPLACE VIEW lake.vw_customer_summary AS
SELECT country, COUNT(*) AS customer_count
FROM lake.customer
GROUP BY country;

✅ Step 2: Query the View

%%sql
SELECT * FROM lake.vw_customer_summary;

๐Ÿ“Œ Benefits of Views

  • No data duplication — views reference base tables
  • Reusable logic — great for summarizing and reporting
  • Accessible from both Spark and Serverless SQL Pools
  • Can be used to hide complexity or standardize queries

๐Ÿงฉ Use Cases

  • Creating data marts from Lakehouse tables
  • Power BI models referencing summarized data
  • Abstracting joins, filters, and calculations

⚠️ Limitations

  • Views in Lake Databases are read-only
  • Cannot include procedural logic or parameters
  • Underlying tables must be Delta/Parquet and accessible to Spark and Serverless

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

PySpark Tutorial: Using find_in_set() in PySpark | Search String Position in a Delimited List

Using find_in_set() in PySpark | String Position in Comma-Separated Values

Using find_in_set() in PySpark

In this tutorial, you'll learn how to use the find_in_set() function in PySpark to search for a specific string inside a comma-separated list. The function returns the position (1-based index) of the string if found, or 0 if not found.

๐Ÿ“˜ Sample Data

data = [
    ("apple", "apple,banana,grape"),
    ("car", "She went to the park. Then she read a book."),
    ("pen", "He is smart. Isn't he? Yes, he is.")
]

columns = ["word", "sentence"]
df = spark.createDataFrame(data, columns)
df.show()

Output:

+-----+-----------------------------+
|word |sentence                     |
+-----+-----------------------------+
|apple|apple,banana,grape           |
|car  |She went to the park...      |
|pen  |He is smart. Isn't he? Yes...|
+-----+-----------------------------+

๐Ÿ” Apply find_in_set() to Search Word Index

from pyspark.sql.functions import find_in_set, col

df = df.withColumn("word_index", find_in_set(col("word"), col("sentence")))
df.show(truncate=False)

Output:

+-----+--------------------------+-----------+
|word |sentence                  |word_index |
+-----+--------------------------+-----------+
|apple|apple,banana,grape        |1          |
|car  |She went to the park...   |0          |
|pen  |He is smart. Isn't he?... |0          |
+-----+--------------------------+-----------+

๐ŸŽฅ Watch the Full Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Cross Warehouse Queries in Fabric – Query Across Warehouses with T-SQL | Microsoft Fabric Tutorial

Cross Warehouse Queries in Fabric – Query Across Warehouses with T-SQL | Microsoft Fabric Tutorial

Cross Warehouse Queries in Fabric – Query Across Warehouses with T-SQL

In Microsoft Fabric, you can execute cross warehouse queries to join or fetch data from multiple warehouses within the same workspace using standard T-SQL. This is useful for creating unified reports, performing analytics across departments, or consolidating data silos.

๐Ÿ“˜ What Are Cross Warehouse Queries?

  • Allows querying across two or more Fabric Warehouses
  • Supports read-only access to external warehouse tables
  • Uses Linked Server-style 3-part or 4-part naming convention
  • Enables you to join data without ETL duplication

✅ Prerequisites

  • Both warehouses must be in the same workspace
  • You must have read permissions on the external warehouse
  • SQL editor must be used within a Fabric Warehouse context

๐Ÿงช Example Query

Assume you are connected to SalesWarehouse and want to pull data from FinanceWarehouse:

SELECT o.OrderID, o.CustomerID, f.Balance
FROM dbo.Orders o
JOIN [FinanceWarehouse].[dbo].[CustomerAccounts] f
  ON o.CustomerID = f.CustomerID;

This query joins the Orders table from the current warehouse with the CustomerAccounts table from the FinanceWarehouse.

๐ŸŽฏ Use Cases

  • Central reporting across multiple business domains
  • Joining data from regional and global warehouses
  • Performing data validation across systems
  • Avoiding ETL and staging for short-term analytics

⚠️ Limitations

  • Write operations (INSERT, UPDATE) are not allowed on external warehouse tables
  • Performance depends on table sizes and Fabric capacity
  • Cross queries are scoped to the same workspace only
  • Only supported in SQL Editor, not in Dataflow Gen2 or Power BI visuals directly

๐Ÿ’ก Tips for Efficient Cross Warehouse Queries

  • Use TOP or WHERE clauses to limit data early
  • Apply indexing strategies on foreign keys in both warehouses
  • Use SELECT ... INTO to materialize external data into local staging tables if needed repeatedly

๐ŸŽฌ Watch the Full Tutorial

Blog post written with the help of ChatGPT.

How to Query Tables from Lake Database Using Serverless SQL Pool in Azure Synapse | Azure Synapse Analytics Tutorial

Query Lake Database Tables Using Serverless SQL Pool in Azure Synapse

How to Query Tables from Lake Database Using Serverless SQL Pool in Azure Synapse

๐Ÿ“˜ Overview

In Azure Synapse Analytics, a Lake Database created using Spark pools can be queried directly using Serverless SQL Pools. This allows seamless integration of big data stored in Data Lake with familiar T-SQL syntax for reporting and analytics without data movement.

๐ŸŽฏ Why Query Lake Databases with Serverless SQL?

  • No need to provision or manage compute resources
  • Familiar T-SQL interface for querying big data
  • Cost-effective pay-per-query model
  • Supports querying Delta, Parquet, and CSV formats

๐Ÿ› ️ Step-by-Step: Querying Spark-Created Tables

✅ Step 1: Create Table in Lake Database via Spark (Optional)

%%spark
CREATE TABLE lakehouse.customer (
  id INT,
  name STRING,
  country STRING
)
USING DELTA
LOCATION 'abfss://datalake@youraccount.dfs.core.windows.net/lake/customer'

✅ Step 2: Confirm Table Appears in Lake Database

Go to Synapse Studio > Data > Lake Database and confirm the table is visible under the specified database.

✅ Step 3: Query the Table Using Serverless SQL

-- Use Serverless SQL Pool
SELECT TOP 10 * 
FROM [YourLakeDB].[dbo].[customer];

✅ The table is accessible through the built-in metadata and requires no external table definition.

๐Ÿ“Œ Tips

  • Ensure Spark and SQL pools have access to the same storage and metadata
  • Use TOP or WHERE clauses for performance optimization
  • Lake DB tables should be created under Spark with valid Delta/Parquet paths

๐Ÿ“ˆ Use Cases

  • Ad-hoc querying of curated data lakes
  • Feeding Power BI dashboards directly from Lake DB
  • Joining lake tables with Serverless SQL objects like external tables or views

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

Split_part() in PySpark Explained | Split Strings by Delimiter and Extract Specific Part #pyspark | PySpark Tutorial

split_part() in PySpark | Extract String Parts by Delimiter

split_part() in PySpark – Extract String Parts by Delimiter

In this tutorial, you'll learn how to use the split_part() function in PySpark to extract specific substrings by a given delimiter, such as pulling username from an email, or ZIP code from a location string.

๐Ÿ“˜ Sample Data

data = [
    ("john.doe@example.com", "NY-10001-USA"),
    ("alice.smith@domain.org", "CA-90001-USA"),
    ("bob99@company.net", "TX-73301-USA")
]

columns = ["email", "location"]

df = spark.createDataFrame(data, columns)
df.show()

Output:

+----------------------+--------------+
| email                | location     |
+----------------------+--------------+
| john.doe@example.com | NY-10001-USA |
| alice.smith@domain.org | CA-90001-USA |
| bob99@company.net    | TX-73301-USA |
+----------------------+--------------+

๐Ÿ“ Extract Specific Parts Using split_part()

from pyspark.sql.functions import split_part, col

df = df.withColumn("username", split_part(col("email"), "@", 1)) \
       .withColumn("domain", split_part(col("email"), "@", 2)) \
       .withColumn("state", split_part(col("location"), "-", 1)) \
       .withColumn("zip", split_part(col("location"), "-", 2)) \
       .withColumn("country", split_part(col("location"), "-", 3))

df.select("email", "username", "domain", "location", "state", "zip", "country").show(truncate=False)

Output:

+----------------------+----------------+------------------+--------------+-----+-----+-------+
| email                | username       | domain           | location     |state| zip |country|
+----------------------+----------------+------------------+--------------+-----+-----+-------+
| john.doe@example.com | john.doe       | example.com      | NY-10001-USA | NY  |10001| USA   |
| alice.smith@domain.org| alice.smith   | domain.org       | CA-90001-USA | CA  |90001| USA   |
| bob99@company.net    | bob99          | company.net      | TX-73301-USA | TX  |73301| USA   |
+----------------------+----------------+------------------+--------------+-----+-----+-------+

๐ŸŽฅ Watch the Full Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Time Travel in Fabric Warehouse – Query Historical Data with T-SQL | Microsoft Fabric Tutorial

Time Travel in Fabric Warehouse – Query Historical Data with T-SQL | Microsoft Fabric Tutorial

Time Travel in Fabric Warehouse – Query Historical Data with T-SQL

Microsoft Fabric's Time Travel feature allows you to query past versions of your data warehouse tables using T-SQL. It’s a powerful capability for historical analysis, reproducibility, and data recovery.

๐Ÿ•’ What is Time Travel?

  • Query data as it existed at a specific point in time
  • Retention up to 30 days
  • Uses the FOR TIMESTAMP AS OF clause in SELECT statements
  • Read-only – no updates, deletes, or inserts allowed

✅ Syntax Example

SELECT * 
FROM dbo.dimension_customer
OPTION (FOR TIMESTAMP AS OF '2024-03-13T19:39:35.280');

๐Ÿ“Œ Use this to get current UTC timestamp in correct format:

SELECT CONVERT(varchar, SYSUTCDATETIME(), 126) AS TruncatedUTC;

๐Ÿ” Use Cases for Time Travel

  • ๐Ÿ“ˆ Historical trend analysis and reporting
  • ๐Ÿงช Machine learning model reproducibility
  • ๐Ÿ” Root cause investigations and debugging
  • ๐Ÿ”„ Safe verification during ETL pipeline development
  • ✅ Compliance and audit traceability

๐Ÿ” Permissions

  • Roles: Admin, Member, Contributor, Viewer
  • Security enforced: RLS, CLS, and DDM apply even in time travel

⚠️ Limitations

  • Only SELECT statements are supported
  • Not supported inside views or CTAS queries
  • Only UTC timestamps are allowed (no local timezones)
  • Supports up to 3-digit millisecond precision
  • Not usable on Lakehouse endpoints or Power BI datasets directly

๐ŸŽฌ Watch the Full Tutorial

Blog post written with the help of ChatGPT.

Delta Table Tutorial in Azure Synapse: Create, Insert, Update, Delete & Merge with Spark SQL | Azure Synapse Analytics Tutorial

Delta Table Tutorial in Azure Synapse: Create, Insert, Update, Delete, Merge

Delta Table Tutorial in Azure Synapse: Create, Insert, Update, Delete & Merge with Spark SQL

๐Ÿ“˜ Overview

Delta Lake is a powerful open-source storage layer that brings ACID transactions to Apache Spark. In Azure Synapse, Delta Tables enable you to perform CREATE, INSERT, UPDATE, DELETE, and MERGE operations directly within your Spark SQL workflows.

๐Ÿ› ️ 1. Create a Delta Table

%%spark
CREATE TABLE lake.delta_customers (
  id INT,
  name STRING,
  country STRING
)
USING DELTA
LOCATION 'abfss://data@yourstorageaccount.dfs.core.windows.net/delta/customers'

➕ 2. Insert Records

%%spark
INSERT INTO lake.delta_customers VALUES
  (1, 'Alice', 'USA'),
  (2, 'Bob', 'Canada');

✏️ 3. Update Records

%%spark
UPDATE lake.delta_customers
SET country = 'United States'
WHERE name = 'Alice';

❌ 4. Delete Records

%%spark
DELETE FROM lake.delta_customers
WHERE id = 2;

๐Ÿ” 5. Merge (Upsert) Records

%%spark
MERGE INTO lake.delta_customers AS target
USING (
  SELECT 2 AS id, 'Robert' AS name, 'UK' AS country
) AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

๐Ÿ” 6. Query the Delta Table

%%sql
SELECT * FROM lake.delta_customers;

๐Ÿ“Œ Why Use Delta in Synapse?

  • ACID Transactions
  • Time Travel and Versioning
  • Efficient Upserts (MERGE)
  • Seamless integration with Spark SQL

๐Ÿ“ˆ Use Cases

  • Change Data Capture (CDC) pipelines
  • Slowly Changing Dimensions (SCD)
  • Transactional lakehouse architectures

๐Ÿ“บ Watch the Full Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

Split Strings in PySpark | split(str, pattern, limit) Function Explained with Examples | PySpark Tutorial

String Splitting in PySpark | split(str, pattern[, limit]) Explained

String Splitting in PySpark

In this tutorial, you’ll learn how to use split(str, pattern[, limit]) to break strings into arrays. We'll cover email parsing, splitting full names, and handling pipe-delimited data.

๐Ÿ“ฆ Sample Data

data = [
  ("john.doe@example.com", "John Doe", "john|doe|35|NY"),
  ("alice.smith@mail.org", "Alice Smith", "alice|smith|29|CA"),
  ("bob.jones@test.net", "Bob Jones", "bob|jones|42|TX")
]

columns = ["email", "full_name", "user_data"]
df = spark.createDataFrame(data, columns)
df.show()

Output:

+----------------------+-----------+--------------------+
| email                | full_name | user_data          |
+----------------------+-----------+--------------------+
| john.doe@example.com | John Doe  | john|doe|35|NY      |
| alice.smith@mail.org | Alice Smith | alice|smith|29|CA |
| bob.jones@test.net   | Bob Jones | bob|jones|42|TX     |
+----------------------+-----------+--------------------+

๐Ÿ“ฌ Split Email into Username and Domain

from pyspark.sql.functions import split

df = df.withColumn("email_parts", split("email", "@"))
df.select("email", "email_parts").show(truncate=False)

Output: Shows username and domain in an array

๐Ÿ‘ค Split Full Name into First and Last Name

df = df.withColumn("name_split", split("full_name", " "))
df.select("full_name", "name_split").show(truncate=False)

Output: ["John", "Doe"], ["Alice", "Smith"], etc.

๐Ÿ“Ž Split Pipe-Delimited User Data

df = df.withColumn("user_fields", split("user_data", "\\|"))
df.select("user_data", "user_fields").show(truncate=False)

Output: ["john", "doe", "35", "NY"], etc.

๐ŸŽฅ Watch the Full Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Spark Table Formats in Azure Synapse: Parquet, CSV, Delta, JSON, Avro & ORC Explained | Azure Synapse Analytics Tutorial

Spark Table Formats in Azure Synapse: Parquet, CSV, Delta, JSON, Avro & ORC

Spark Table Formats in Azure Synapse: Parquet, CSV, Delta, JSON, Avro & ORC Explained

๐Ÿ“˜ Overview

Apache Spark in Azure Synapse supports multiple table formats, each optimized for different use cases like performance, compatibility, or schema evolution. This post explores how to create and use Parquet, Delta, CSV, JSON, Avro, and ORC table formats.

๐Ÿ“‚ Common File Formats and Their Use Cases

  • Parquet – Columnar storage, best for analytics
  • Delta – ACID compliant, supports updates/deletes
  • CSV – Simple format for interoperability
  • JSON – Best for semi-structured data
  • Avro – Row-based format, good for schema evolution
  • ORC – Columnar, highly optimized for big data

๐Ÿ› ️ Creating Tables in Different Formats

✅ 1. Parquet Table

%%spark
CREATE TABLE lake.parquet_table (
  id INT, name STRING
)
USING PARQUET
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/parquet'

✅ 2. Delta Table

%%spark
CREATE TABLE lake.delta_table (
  id INT, product STRING
)
USING DELTA
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/delta'

✅ 3. CSV Table

%%spark
CREATE TABLE lake.csv_table (
  id INT, region STRING
)
USING CSV
OPTIONS ('header'='true')
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/csv'

✅ 4. JSON Table

%%spark
CREATE TABLE lake.json_table
USING JSON
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/json'

✅ 5. Avro Table

%%spark
CREATE TABLE lake.avro_table
USING AVRO
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/avro'

✅ 6. ORC Table

%%spark
CREATE TABLE lake.orc_table
USING ORC
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/orc'

๐Ÿ“Š Format Comparison Table

Format Type Supports Schema Evolution ACID Transactions Best Use Case
Parquet Columnar Partial No Read-heavy analytics
Delta Columnar + Transaction Logs Yes Yes ETL with updates/deletes
CSV Row No No Simple data exchange
JSON Row No No Log data or API responses
Avro Row Yes No Data pipelines & evolution
ORC Columnar Yes No Big data batch processing

๐Ÿ“Œ Best Practices

  • Use Delta for ACID-compliant workloads
  • Use Parquet or ORC for optimized analytical queries
  • Use CSV only when human readability or legacy system compatibility is needed

๐Ÿ“บ Watch the Full Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

String Formatting in PySpark | concat_ws, format_number, printf, repeat, lpad, rpad Explained | PySpark Tutorial

String Formatting in PySpark | concat_ws, format_number, printf, repeat, lpad, rpad

String Formatting in PySpark

This tutorial demonstrates how to use PySpark string functions like concat_ws, format_number, format_string, printf, repeat, lpad, and rpad for formatting, combining, and manipulating string values in DataFrames.

๐Ÿ“ฆ Sample Data Creation

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName(\"StringFormattingDemo\").getOrCreate()

data = [
    (\"John\", 950000, \"USD\", \"Smith\"),
    (\"Alice\", 120500, \"EUR\", \"Brown\"),
    (\"Bob\", 87999, \"INR\", \"Taylor\")
]

columns = [\"first_name\", \"salary\", \"country\", \"last_name\"]
df = spark.createDataFrame(data, columns)
df.show()

๐Ÿ“Š Sample DataFrame Output

+-----------+--------+-------+----------+
|first_name | salary |country| last_name|
+-----------+--------+-------+----------+
| John      | 950000 | USD   | Smith    |
| Alice     | 120500 | EUR   | Brown    |
| Bob       |  87999 | INR   | Taylor   |
+-----------+--------+-------+----------+

๐Ÿ”— concat_ws()

df.withColumn("full_name", concat_ws("-", "first_name", "country")).show()

Output: Creates a new column combining first name and country with hyphen

๐Ÿ’ฒ format_number()

df.withColumn("formatted_salary", format_number("salary", 2)).show()

Output: Adds a formatted string version of salary with 2 decimal places

๐Ÿงพ format_string()

df.withColumn("greeting", format_string("Hello %s %s", col("first_name"), col("country"))).show()

Output: Hello John USD

๐Ÿ”ข printf()

df.withColumn("price_tag", printf("Amount = %.2f", col("salary"))).show()

Output: Amount = 950000.00

๐Ÿ” repeat()

df.withColumn("excited", repeat(lit("test"), 2)).show()

Output: testtest

⬅️➡️ lpad() and rpad()

df.withColumn("last_lpad", lpad("last_name", 10, "*")) \
  .withColumn("first_rpad", rpad("first_name", 10, "-")).show()

Output: Pads strings to given length with characters on left/right

๐ŸŽฅ Watch the Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Query Data & Use Data Preview in SQL Editor in Fabric | Microsoft Fabric Tutorial For Data Engineers

Query Data & Use Data Preview in SQL Editor in Fabric | Microsoft Fabric Tutorial

Query Data & Use Data Preview in SQL Editor in Fabric | Microsoft Fabric Tutorial For Data Engineers

Microsoft Fabric offers a powerful and intuitive SQL Editor for data engineers to query, preview, and analyze data in real time. In this tutorial, you’ll learn how to effectively use the SQL Editor and Data Preview features to work with warehouse tables.

✅ How to Open the SQL Editor

  1. Navigate to your Warehouse or Lakehouse in Microsoft Fabric
  2. Click on the SQL Editor tab from the top menu
  3. You can also open an existing SQL script or create a new one

✍️ Writing and Running SQL Queries

Use the editor to write T-SQL queries just like in SSMS or Azure Data Studio. Example:

SELECT TOP 100 * FROM dbo.Orders
WHERE OrderAmount > 500
ORDER BY OrderDate DESC;

Click Run (▶️) or press Ctrl + Enter to execute.

๐Ÿ” Using Data Preview

  • Right-click any table from the object explorer
  • Select Preview data
  • Fabric shows the first 100 rows with basic metadata
  • You can use this to validate structure and sample content

๐ŸŽ›️ Filter, Sort & Explore Result Sets

  • Click column headers to sort ascending or descending
  • Use the filter icon to apply basic text or numeric filters
  • Pin important columns for easier analysis

๐Ÿ’ก Tips for Efficient Querying

  • Use TOP or LIMIT to avoid large result sets during testing
  • Alias long table names for cleaner syntax
  • Click on “Format SQL” for auto-formatting queries
  • Use INFORMATION_SCHEMA.TABLES to find all tables:
SELECT * FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'dbo';

๐ŸŽฌ Watch the Full Tutorial

Blog post written with the help of ChatGPT.

PySpark Tutorial : String Search in PySpark - How to Use contains, startswith, endswith, like, rlike, locate

String Search in PySpark | contains, startswith, endswith, like, rlike, locate

String Search in PySpark

Learn how to perform string filtering and matching in PySpark using functions like contains(), startswith(), endswith(), like, rlike, and locate().

๐Ÿ” Sample DataFrame

+-------------+------------------------+--------+
| customer_name | email                  | status |
+-------------+------------------------+--------+
| Aamir Khan   | aamir@example.com       | Active |
| Bob Smith    | bob.smith@mail.com      | Inactive |
| Lisa Jones   | lisa.j@data.org         | Active |
| Charlie Brown| charlie@peanuts.net     | Pending |
| Alice Wonder | alice.w@wonder.org      | Active |
| Eve H        | eve.h@allabout.eve      | Inactive |
| John Doe     | john.doe@unknown.com    | Active |
+-------------+------------------------+--------+

✅ contains()

Find customers whose address column contains 'Anytown'.

df.filter(col("address").contains("Anytown")).show(truncate=False)

✅ startswith() and endswith()

Filter emails starting with 'a' and ending with '.org'.

df.filter(col("email").startswith("a")).show(truncate=False)
df.filter(col("email").endswith(".org")).show(truncate=False)

✅ like()

Use SQL-style wildcard match.

df.filter(col("customer_name").like("%is%")).show(truncate=False)

✅ rlike()

Use regex pattern match (e.g., find email domains starting with m or d).

df.filter(col("email").rlike("^[md]")).show(truncate=False)

✅ locate()

Find position of substring in the email (e.g., '@' position).

df.select("email", instr(col("email"), "@").alias("at_pos")).show(truncate=False)

๐ŸŽฅ Watch the Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Restore Warehouse In-Place in Fabric – Step-by-Step with Restore Point | Microsoft Fabric Tutorial

Restore Warehouse In-Place in Fabric – Step-by-Step with Restore Point | Microsoft Fabric Tutorial

Restore Warehouse In-Place in Fabric – Step-by-Step with Restore Point

In this Microsoft Fabric tutorial, you'll learn how to restore your warehouse in-place using system or user-defined restore points. This capability is essential for disaster recovery, testing environments, or rolling back undesired changes.

๐Ÿ” What is In-Place Restore?

  • Restores a warehouse to a prior point-in-time
  • Overwrites the existing warehouse (name stays the same)
  • Restores all components: models, queries, tables
  • Useful for undoing accidental deletes, corruption, or failed deployments

๐Ÿ“ฆ Types of Restore Points

  • System-Created: Every 8 hours automatically (up to 180 kept)
  • User-Defined: Manually created before/after major changes
  • Retention for both: 30 days

✅ Common Use Cases

  • Rollback after a schema error
  • Reset test environments to a clean state
  • Recover deleted data or fix bad ETL runs

๐Ÿงญ Step-by-Step: How to Restore

  1. Open Microsoft Fabric Portal
  2. Navigate to your Warehouse
  3. Click Settings → Restore Points
  4. Select a system or user-defined restore point
  5. Click Restore In-Place

⚠️ If you receive Error 5064, try again after a few moments.

๐Ÿ” Permissions Matrix

  • View Restore Points: Admin, Member, Contributor, Viewer
  • Create Restore Points: Admin, Member, Contributor
  • Perform Restore: Admin only

๐Ÿ’ฐ Billing & Cost

  • Storage: Charges apply for Parquet storage under OneLake
  • Compute: Charged when restore point is created or restored
  • Billing is against Fabric Capacity

⚠️ Limitations

  • Cannot rename warehouse on restore (same name only)
  • Restore points expire after 30 days — no extensions
  • Cannot restore across different warehouses
  • Restore does not include downstream linked pipelines

๐ŸŽฌ Watch the Full Video Tutorial

Blog post written with the help of ChatGPT.

Clone Tables in Fabric Warehouse – Zero Copy Cloning with T-SQL | Microsoft Fabric Tutorial

Clone Tables in Fabric Warehouse – Zero Copy Cloning with T-SQL | Microsoft Fabric Tutorial

Clone Tables in Fabric Warehouse – Zero Copy Cloning with T-SQL

In this tutorial, we’ll walk through how to use zero-copy table cloning in Microsoft Fabric Warehouse using simple T-SQL commands. Cloning tables is useful for development, testing, historical reporting, or machine learning experimentation—all without duplicating data.

๐Ÿง  What is a Zero-Copy Clone?

  • Creates a new table referencing the same data files (Parquet in OneLake)
  • No physical data duplication
  • Cloned tables behave as independent tables
  • Extremely fast and storage-efficient

✅ Syntax to Clone a Table

Clone Current State

CREATE TABLE dbo.Sales_Clone AS CLONE OF dbo.Sales;

Clone Table as of Specific Time

CREATE TABLE dbo.Sales_Clone_History 
AS CLONE OF dbo.Sales 
WITH ( SNAPSHOT_TIME = '2025-05-20T10:00:00Z' );

๐Ÿ› ️ Use Cases for Cloning

  • ๐Ÿ”ง Development and Testing
  • ๐Ÿ“Š Point-in-time Historical Reporting
  • ๐Ÿ” Recovery and troubleshooting
  • ๐Ÿ“‚ Data Archiving for compliance
  • ๐Ÿค– Reproducible ML Training Pipelines

⚠️ Limitations to Keep in Mind

  • Cloning is limited to tables within the same warehouse
  • Cannot clone tables from Lakehouse SQL endpoints
  • Cloning entire schema or warehouse is not supported
  • Time-travel cloning must be within 30-day retention
  • Schema changes in the source table may block time-based clones
Tip: Clones are disposable. Deleting a clone does not impact the original table.

๐ŸŽฌ Watch the Full Tutorial

Blog post written with the help of ChatGPT.

Managed vs External Tables in Spark SQL Lake Database | Azure Synapse Analytics Tutorial

Managed vs External Tables in Spark SQL Lake Database | Azure Synapse Analytics

Managed vs External Tables in Spark SQL Lake Database | Azure Synapse Analytics Explained

๐Ÿ“˜ Overview

In Azure Synapse Analytics, you can create both Managed and External Tables within a Lake Database using Spark SQL. Understanding the difference is essential for managing data, storage, and performance effectively.

๐Ÿ“‚ What Are Managed Tables?

Managed tables are those where both the metadata and the data are controlled by Synapse Spark. When you drop the table, the data and metadata are both deleted.

✅ Create a Managed Table

%%spark
CREATE TABLE lakehouse.managed_employee (
    id INT,
    name STRING,
    salary FLOAT
)

✔️ No LOCATION clause is specified — Spark decides where to store the data internally.

๐Ÿ“ What Are External Tables?

External tables store only the metadata in the Lake Database while the data resides externally in user-defined paths like ADLS Gen2.

✅ Create an External Table

%%spark
CREATE TABLE lakehouse.external_sales (
    id INT,
    product STRING,
    amount DOUBLE
)
USING DELTA
LOCATION 'abfss://data@yourstorageaccount.dfs.core.windows.net/sales/external/'

✔️ Data lives outside Spark's internal catalog. Deleting the table will NOT remove the data.

๐Ÿ“Š Key Differences

Feature Managed Table External Table
Data Location Managed by Synapse Custom ADLS path
Drop Table Deletes Data Yes No
Usage Quick prototyping Production pipelines

๐Ÿ“Œ Best Practices

  • Use external tables for production where you need to retain data
  • Use managed tables for temporary analysis or rapid testing
  • Prefer Delta format for ACID support and versioning

๐Ÿ“‚ Example: Read External Table

%%sql
SELECT * FROM lakehouse.external_sales;

๐Ÿ“บ Watch the Full Video Explanation

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

Substring Functions in PySpark: substr(), substring(), overlay(), left(), right() with Real Examples | PySpark Tutorial

Extract Substrings in PySpark | substr(), substring(), overlay(), left(), right()

Extracting Substrings in PySpark

In this tutorial, you'll learn how to use PySpark string functions like substr(), substring(), overlay(), left(), and right() to manipulate string columns in DataFrames.

1️⃣ Initialize Spark Session

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SubstringTutorial").getOrCreate()

2️⃣ Sample DataFrame

data = [("EMP-0001-NC", "maria.fernandez@healthcare.org"),
        ("EMP-0002-TX", "john.davis@fintech.com"),
        ("EMP-0003-CA", "lucy.liu@retailmart.net"),
        ("EMP-0004-FL", "peter.parker@dailybugle.com"),
        ("EMP-0005-NY", "bruce.wayne@gotham.org")]

columns = ["employee_code", "email"]

df = spark.createDataFrame(data, columns)
df.show(truncate=False)

Output:

+-------------+-----------------------------+
|employee_code|email                        |
+-------------+-----------------------------+
|EMP-0001-NC  |maria.fernandez@healthcare.org|
|EMP-0002-TX  |john.davis@fintech.com       |
|EMP-0003-CA  |lucy.liu@retailmart.net      |
|EMP-0004-FL  |peter.parker@dailybugle.com  |
|EMP-0005-NY  |bruce.wayne@gotham.org       |
+-------------+-----------------------------+

3️⃣ substr() Function

Extract substring starting at a specific position with defined length.

from pyspark.sql.functions import col, substr
df_substr = df.withColumn("emp_id", substr("employee_code", 5, 4))
df_substr.show(truncate=False)

Output:

+-------------+-----------------------------+------+
|employee_code|email                        |emp_id|
+-------------+-----------------------------+------+
|EMP-0001-NC  |maria.fernandez@healthcare.org|0001  |
|EMP-0002-TX  |john.davis@fintech.com       |0002  |
|EMP-0003-CA  |lucy.liu@retailmart.net      |0003  |
|EMP-0004-FL  |peter.parker@dailybugle.com  |0004  |
|EMP-0005-NY  |bruce.wayne@gotham.org       |0005  |
+-------------+-----------------------------+------+

4️⃣ substring() Function

Alias for substr() - works the same way.

from pyspark.sql.functions import substring
df_substring = df.withColumn("emp_id", substring("employee_code", 5, 4))
df_substring.show(truncate=False)

5️⃣ overlay() Function

Replaces part of the string with another string.

from pyspark.sql.functions import overlay
df_overlay = df.withColumn("masked_email", overlay("email", "XXXXXX", 1, 5))
df_overlay.show(truncate=False)

6️⃣ left() and right() Functions

Extract leftmost or rightmost characters from string.

from pyspark.sql.functions import left, right

df_left = df.withColumn("left_code", left("employee_code", 3))
df_right = df.withColumn("right_code", right("employee_code", 2))

df_left.show(truncate=False)
df_right.show(truncate=False)

๐Ÿ“บ Watch the Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

User Defined Inline Table-Valued Functions with CROSS APPLY | Microsoft Fabric Tutorial

User Defined Inline Table-Valued Functions with CROSS APPLY | Microsoft Fabric Tutorial

User Defined Inline Table-Valued Functions with CROSS APPLY | Microsoft Fabric Tutorial

In this Microsoft Fabric tutorial, we’ll demonstrate how to create and use inline table-valued functions (iTVFs) and apply them with CROSS APPLY. These are great tools for reusable logic and dynamic row-by-row evaluation.

๐Ÿง  What is a Function in SQL?

A function is a reusable SQL object that returns either a value or a table. In Fabric Warehouse, only inline table-valued functions (iTVFs) are currently supported.

๐Ÿšซ Not Supported:

  • Scalar functions
  • Multi-statement table-valued functions

✅ Supported:

  • Inline Table-Valued Functions (iTVFs)

๐Ÿ—️ Step 1: Create Sample Orders Table

CREATE TABLE dbo.Orders (
    OrderID INT,
    CustomerName VARCHAR(100),
    OrderDate DATE,
    OrderAmount DECIMAL(10, 2)
);

๐Ÿ’พ Step 2: Insert Sample Data

INSERT INTO dbo.Orders VALUES
(1, 'Aamir', '2024-01-10', 1200.00),
(2, 'Sara', '2024-02-15', 800.00),
(3, 'John', '2024-03-20', 300.00),
(4, 'Mia', '2024-03-22', 1500.00),
(5, 'Aamir', '2024-04-01', 200.00);

๐Ÿ”„ Step 3: Create Discount Function with CROSS APPLY

This iTVF applies a 10% discount to a given amount:

CREATE FUNCTION dbo.fn_ApplyDiscount (@Amount DECIMAL(10,2))
RETURNS TABLE
AS
RETURN (
    SELECT @Amount * 0.90 AS DiscountedAmount
);

✅ Using CROSS APPLY

SELECT 
    o.OrderID, o.CustomerName, o.OrderAmount,
    d.DiscountedAmount
FROM dbo.Orders o
CROSS APPLY dbo.fn_ApplyDiscount(o.OrderAmount) d;

๐Ÿ”ธ Step 4: Create Function to Filter Orders

This function returns orders where the amount is greater than or equal to a specified value:

CREATE FUNCTION dbo.fn_GetHighValueOrders (@MinAmount DECIMAL(10,2))
RETURNS TABLE
AS
RETURN (
    SELECT OrderID, CustomerName, OrderAmount
    FROM dbo.Orders
    WHERE OrderAmount >= @MinAmount
);

✅ Usage Example:

SELECT * FROM dbo.fn_GetHighValueOrders(200.00);

๐Ÿงน Optional Cleanup

-- DROP FUNCTION dbo.fn_ApplyDiscount;
-- DROP FUNCTION dbo.fn_GetHighValueOrders;
-- DROP TABLE dbo.Orders;

๐ŸŽฌ Watch the Full Tutorial

Blog post written with the help of ChatGPT.

Lake Database in Azure Synapse: Spark SQL vs Delta vs Parquet | Full Demo with Create & Insert | Azure Synapse Analytics Tutorial

Lake Database in Azure Synapse: Spark SQL vs Delta vs Parquet

Lake Database in Azure Synapse: Spark SQL vs Delta vs Parquet | Full Demo

๐Ÿ“˜ Overview

Azure Synapse Lake Databases offer powerful data lake querying capabilities using Spark SQL. In this tutorial, we demonstrate how to:

  • Create and use Lake Databases
  • Compare Parquet, Delta, and Spark SQL tables
  • Insert, update, and read data using Apache Spark in Synapse

๐Ÿ—️ Step 1: Create Spark SQL Table (Unmanaged)

%%spark
CREATE TABLE IF NOT EXISTS lakehouse.spark_table (
    id INT, name STRING, age INT
)
USING PARQUET
LOCATION 'abfss://lake@storageaccount.dfs.core.windows.net/tables/spark_table'

๐Ÿ’พ Step 2: Create Delta Table

%%spark
CREATE TABLE lakehouse.delta_table (
    id INT, product STRING, quantity INT
)
USING DELTA
LOCATION 'abfss://lake@storageaccount.dfs.core.windows.net/tables/delta_table'

๐Ÿ”„ Step 3: Insert Data into Tables

%%spark
INSERT INTO lakehouse.spark_table VALUES (1, 'Ali', 30), (2, 'Zara', 25);
INSERT INTO lakehouse.delta_table VALUES (1, 'Phone', 10), (2, 'Tablet', 20);

๐Ÿ“Š Step 4: Query Tables

%%sql
SELECT * FROM lakehouse.spark_table;
SELECT * FROM lakehouse.delta_table;

✏️ Step 5: Update Delta Table

Delta supports ACID transactions, including UPDATE and DELETE operations.

%%spark
UPDATE lakehouse.delta_table SET quantity = 15 WHERE id = 1;

๐Ÿ“ˆ Comparison: Parquet vs Delta vs Spark Tables

FeatureParquetDeltaSpark SQL Table
Storage FormatColumnarDelta (based on Parquet + logs)Varies
ACID SupportNoYesNo
Supports Update/DeleteNoYesNo
PerformanceFast ReadFast + Version ControlSimple

๐Ÿ“Œ Best Practices

  • Use Delta Lake for write-heavy transactional workloads
  • Use Parquet for optimized reads and archival
  • Validate permissions for Lake DB and storage paths

๐Ÿ“บ Watch the Full Demo

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

How to perform String Cleaning in PySpark lower, trim, initcap Explained with Real Data | PySpark Tutorial

How to Clean Strings in PySpark | lower(), trim(), initcap() Explained with Real Data

How to Clean Strings in PySpark

Using lower(), trim(), and initcap() with Real Data

๐Ÿ“Œ What You’ll Learn

  • How to use lower() to convert text to lowercase
  • How to use trim() to remove leading/trailing spaces
  • How to use initcap() to capitalize first letter of each word
  • Chaining multiple string functions

๐Ÿ“Š Sample Data

data = [
    (" Aamir ",),
    ("LISA ",),
    ("  charLie   ",),
    ("BOB",),
    (" eli",)
]
columns = ["raw_name"]
df = spark.createDataFrame(data, columns)
df.show(truncate=False)
Output:
+-----------+
|raw_name   |
+-----------+
| Aamir     |
|LISA       |
|  charLie  |
|BOB        |
| eli       |
+-----------+

๐Ÿ”ง Cleaned Data using PySpark Functions

1️⃣ Apply trim()

from pyspark.sql.functions import trim
df_trimmed = df.withColumn("trimmed", trim("raw_name"))
df_trimmed.show(truncate=False)

2️⃣ Apply lower() and upper()

from pyspark.sql.functions import lower, upper
df_lower = df_trimmed.withColumn("lowercase", lower("trimmed"))
df_upper = df_trimmed.withColumn("uppercase", upper("trimmed"))
df_lower.show(truncate=False)
df_upper.show(truncate=False)

3️⃣ Apply initcap()

from pyspark.sql.functions import initcap
df_initcap = df_trimmed.withColumn("titlecase", initcap("trimmed"))
df_initcap.show(truncate=False)

๐ŸŽฅ Watch the Full Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Stored Procedure in Fabric Warehouse Input & Output Parameters SQL Demo | Microsoft Fabric Tutorial

Stored Procedure in Fabric Warehouse – Input & Output Parameters SQL Demo

Stored Procedure in Fabric Warehouse – Input & Output Parameters SQL Demo

In this Microsoft Fabric tutorial, you'll learn how to create and execute a stored procedure with input and output parameters. This is especially useful when you want to encapsulate query logic and reuse it across reports, data pipelines, and dashboards.

๐Ÿง  What is a Stored Procedure?

A stored procedure is a reusable set of SQL statements saved in the database. It can take input parameters, return output values, and perform operations like reading or modifying data.

✅ Benefits of Stored Procedures:

  • Encapsulates business logic
  • Promotes reuse and consistency
  • Supports input and output parameters
  • Can include conditional and complex logic

๐Ÿ—️ Step 1: Create a Sample Table

CREATE TABLE dbo.Orders (
    OrderID INT,
    CustomerName VARCHAR(100),
    OrderAmount DECIMAL(10,2)
);

๐Ÿ’พ Step 2: Insert Sample Data

INSERT INTO dbo.Orders VALUES
(1, 'Aamir', 500.00),
(2, 'Sara', 750.00),
(3, 'John', 1200.00);

⚙️ Step 3: Create the Stored Procedure

This procedure accepts a customer name as input and returns the total order amount as output:

ALTER PROCEDURE dbo.GetCustomerTotalOrders
    @CustomerName VARCHAR(100),
    @TotalAmount DECIMAL(10,2) OUTPUT
AS
BEGIN
    SELECT @TotalAmount = SUM(OrderAmount)
    FROM dbo.Orders
    WHERE CustomerName = @CustomerName;
END;

๐Ÿš€ Step 4: Execute the Stored Procedure

-- Declare output variable
DECLARE @Total DECIMAL(10,2);

-- Execute with input and capture output
EXEC dbo.GetCustomerTotalOrders
    @CustomerName = 'Aamir',
    @TotalAmount = @Total OUTPUT;

-- Display the result
SELECT 'Total Order Amount for Aamir' AS Description, @Total AS TotalAmount;

๐Ÿงน Optional Cleanup

-- DROP PROCEDURE dbo.GetCustomerTotalOrders;
-- DROP TABLE dbo.Orders;

๐Ÿ“Š View Final Orders Table

SELECT * FROM dbo.Orders;

๐ŸŽฌ Watch the Full Tutorial

Blog post written with the help of ChatGPT.

Export Data from Synapse Dedicated SQL Pool to Single CSV File in ADLS Gen2 | Azure Synapse Analytics Tutorial

Export Data from Synapse Dedicated SQL Pool to Single CSV File in ADLS Gen2

Export Data from Synapse Dedicated SQL Pool to Single CSV File in ADLS Gen2

๐Ÿ“˜ Overview

In Azure Synapse Analytics, it’s common to export data from a Dedicated SQL Pool to Azure Data Lake Storage Gen2 in CSV format for reporting, archival, or integration with downstream tools. This tutorial demonstrates how to do that and ensure the output is saved as a single CSV file.

๐Ÿ› ️ Step-by-Step: Export to Single CSV File

✅ 1. Query Data Using Synapse Spark

%%pyspark
df = spark.read \
    .synapsesql("yourdedicatedpool.dbo.SalesData")
df.show()

✅ 2. Repartition to Single File

df_single = df.repartition(1)

This ensures only one output file is generated when writing to storage.

✅ 3. Write to ADLS Gen2 as CSV

df_single.write \
    .mode("overwrite") \
    .option("header", "true") \
    .csv("abfss://export@yourstorageaccount.dfs.core.windows.net/sales/csv/")

๐Ÿ“Œ Notes

  • repartition(1) should only be used for small to medium datasets due to single-threaded write.
  • Azure Storage Explorer can be used to rename the output file (e.g., from part-00000.csv to sales_report.csv).
  • Make sure Synapse workspace has access permissions (RBAC or SAS).

๐ŸŽฏ Use Cases

  • Exporting summarized data for business intelligence
  • Staging datasets for Power BI or external tools
  • Archiving daily/weekly sales reports

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

PySpark Tutorial : Master PySpark Sorting: sort(), asc(), desc() Explained with Examples #pyspark

Sort Functions in PySpark Explained with Examples

Sort Functions in PySpark Explained with Examples

In this post, you'll learn how to use various sort functions in PySpark to order data by ascending/descending, and control the handling of nulls. This guide is perfect for anyone working with big data in Spark!

1️⃣ Setup

from pyspark.sql import SparkSession
from pyspark.sql.functions import asc, desc, asc_nulls_first, desc_nulls_last

spark = SparkSession.builder.appName("SortFunctionsDemo").getOrCreate()

2️⃣ Sample Data

sample_data = [
    ("Alice", 5000),
    ("Bob", None),
    ("Cara", 6200),
    ("Dan", None),
    ("Eli", 4500),
    ("Fay", 7000)
]

columns = ["employee", "sales"]
df = spark.createDataFrame(sample_data, columns)
df.show()

Output:

+--------+-----+
|employee|sales|
+--------+-----+
|   Alice| 5000|
|     Bob| null|
|    Cara| 6200|
|     Dan| null|
|     Eli| 4500|
|     Fay| 7000|
+--------+-----+

3️⃣ Sort by Ascending

df.orderBy(asc("sales")).show()

4️⃣ Sort with Nulls First

df.orderBy(asc_nulls_first("sales")).show()

5️⃣ Sort with Nulls Last

df.orderBy(desc_nulls_last("sales")).show()

6️⃣ Descending Order

df.orderBy(desc("sales")).show()

๐Ÿ“บ Video Tutorial

© 2024 Aamir Shahzad — All rights reserved.
Some of the contents in this website were created with assistance from ChatGPT and Gemini.

Insert, Update & Delete in Fabric Warehouse Full SQL Demo for Beginners | Microsoft Fabric Tutorial

Insert, Update & Delete in Fabric Warehouse – Full SQL Demo for Beginners

Insert, Update & Delete in Fabric Warehouse – Full SQL Demo for Beginners

In this tutorial, you'll learn how to use the core DML (Data Manipulation Language) commands — INSERT, UPDATE, and DELETE — in Microsoft Fabric Warehouse. These are the foundation of interacting with structured data in your SQL environment.

๐Ÿ“Œ What Are DML Commands?

  • INSERT: Adds new rows into a table
  • UPDATE: Modifies existing rows
  • DELETE: Removes rows

Fabric Warehouse fully supports these operations using standard T-SQL syntax.

๐Ÿ—️ Step 1: Create a Sample Table

CREATE TABLE dbo.Products (
    ProductID INT,
    ProductName VARCHAR(100),
    Category VARCHAR(50),
    Price DECIMAL(10, 2),
    Stock INT
);

๐Ÿ’พ Step 2: Insert Sample Data

INSERT INTO dbo.Products (ProductID, ProductName, Category, Price, Stock) VALUES
(1, 'Laptop', 'Electronics', 1200.00, 10),
(2, 'Smartphone', 'Electronics', 800.00, 25),
(3, 'Desk Chair', 'Furniture', 150.00, 40),
(4, 'Monitor', 'Electronics', 300.00, 15);

Verify Insert:

SELECT * FROM dbo.Products;

✏️ Step 3: Update Data

Increase the price of all Electronics items by 10%:

UPDATE dbo.Products
SET Price = Price * 1.10
WHERE Category = 'Electronics';

Verify Update:

SELECT * FROM dbo.Products;

❌ Step 4: Delete Data

Delete all products with stock less than 20:

DELETE FROM dbo.Products
WHERE Stock < 20;

Verify Delete:

SELECT * FROM dbo.Products;

๐ŸŽฌ Watch the Full Tutorial

Blog post written with the help of ChatGPT.

Read CSV File & Load to Dedicated SQL Pool Table Using Apache Spark Pool Notebook | Azure Synapse Analytics Tutorial

Read CSV File & Load to Dedicated SQL Pool Table | Synapse Notebook

Read CSV File & Load to Dedicated SQL Pool Table Using Apache Spark Pool Notebook

๐Ÿ“˜ Overview

In this tutorial, you'll learn how to use Apache Spark Pools in Azure Synapse Analytics to read a CSV file from ADLS Gen2 and write that data into a Dedicated SQL Pool table. This is useful for ETL and data ingestion pipelines within your Synapse workspace.

๐Ÿ› ️ Step-by-Step Workflow

✅ Step 1: Read CSV File from ADLS Gen2

%%pyspark
csv_df = spark.read.option("header", "true") \
    .csv("abfss://data@yourstorageaccount.dfs.core.windows.net/input/employees.csv")
csv_df.show()

✅ Step 2: Write Data to Dedicated SQL Pool

csv_df.write \
    .synapsesql("yourdedicatedpool.dbo.Employee") \
    .mode("overwrite") \
    .save()

๐Ÿ“ Note: The `.synapsesql()` method is available by default when using Spark Pools in Synapse with built-in connectors.

✅ Step 3: Validate Inserted Data

%%sql
SELECT * FROM yourdedicatedpool.dbo.Employee;

๐Ÿ“Œ Tips

  • Ensure the destination table exists before writing
  • Use overwrite or append mode depending on your use case
  • Review schema compatibility between DataFrame and SQL table

๐ŸŽฏ Use Cases

  • Bulk load historical data into data warehouse
  • ETL transformation using PySpark before loading
  • Automated data ingestion from CSV sources

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.

PySpark Window Functions Explained | Rank, Dense_Rank, Lead, Lag, NTILE | Real-World Demo | PySpark Tutorial

Mastering Window Functions in PySpark | Rank, Lag, Lead, Dense_Rank, Ntile

๐Ÿš€ Mastering Window Functions in PySpark

This blog post will walk you through essential PySpark window functions like rank(), lag(), lead(), dense_rank(), and ntile() using practical examples. These functions are crucial for complex data analytics, ordering, and partitioning scenarios.

๐Ÿ“˜ Sample Dataset

+-------+------------+-------+------------+
|Employee|Department |Sales  |Sale_Date   |
+--------+-----------+--------+-----------+
|Alice   |Sales      |5000    |2024-01-01 |
|Ben     |Sales      |5500    |2024-01-01 |
|Cara    |Marketing  |4800    |2024-01-01 |
|Dan     |HR         |5300    |2024-01-01 |
|Alice   |Sales      |6200    |2024-01-02 |
|Ben     |Sales      |5500    |2024-01-02 |
|Cara    |Marketing  |5100    |2024-01-02 |
|Dan     |HR         |6000    |2024-01-02 |
|Alice   |Sales      |7000    |2024-01-03 |
|Ben     |Sales      |6400    |2024-01-03 |
+--------+-----------+--------+-----------+

⚙️ Code Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, row_number, rank, dense_rank, percent_rank, lag, lead, ntile
from pyspark.sql.window import Window

# Start Spark session
spark = SparkSession.builder.appName("WindowFunctionsDemo").getOrCreate()

# Sample DataFrame
data = [
    ("Alice", "Sales", 5000, "2024-01-01"),
    ("Ben", "Sales", 5500, "2024-01-01"),
    ("Cara", "Marketing", 4800, "2024-01-01"),
    ("Dan", "HR", 5300, "2024-01-01"),
    ("Alice", "Sales", 6200, "2024-01-02"),
    ("Ben", "Sales", 5500, "2024-01-02"),
    ("Cara", "Marketing", 5100, "2024-01-02"),
    ("Dan", "HR", 6000, "2024-01-02"),
    ("Alice", "Sales", 7000, "2024-01-03"),
    ("Ben", "Sales", 6400, "2024-01-03"),
]
schema = ["employee", "department", "sales", "sale_date"]
df = spark.createDataFrame(data, schema)
df.show()

๐Ÿ“Š Applying Window Functions

window_spec = Window.partitionBy("employee").orderBy("sale_date")

df.withColumn("rank", rank().over(window_spec)).show()
df.withColumn("dense_rank", dense_rank().over(window_spec)).show()
df.withColumn("percent_rank", percent_rank().over(window_spec)).show()
df.withColumn("lag", lag("sales", 1).over(window_spec)).show()
df.withColumn("lead", lead("sales", 1).over(window_spec)).show()
df.withColumn("ntile", ntile(2).over(window_spec)).show()

๐ŸŽฌ Watch Full Tutorial

๐Ÿ“ Some of the contents in this website were created with assistance from ChatGPT and Gemini.

How to Write, Read & Append Data to Dedicated SQL Pool Table Using Notebook in Azure Synapse | Azure Synapse Analytics Tutorial

Write, Read & Append Data to Dedicated SQL Pool Table Using Notebook in Azure Synapse

How to Write, Read & Append Data to Dedicated SQL Pool Table Using Notebook in Azure Synapse

๐Ÿ“˜ Overview

Azure Synapse Notebooks allow seamless interaction with Dedicated SQL Pools using Spark. You can create and manage SQL tables from PySpark DataFrames, perform bulk writes, and query them using Spark SQL or Synapse SQL on-demand.

๐Ÿ› ️ Step-by-Step Examples

✅ 1. Create DataFrame

%%pyspark
data = [(1, "Aamir", 5000), (2, "Lisa", 6000)]
columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)

✅ 2. Write Data to Dedicated SQL Pool

Use the write.synapsesql format:

df.write \
  .format("com.databricks.spark.sqldw") \
  .option("url", "jdbc:sqlserver://<server>.database.windows.net:1433;database=<dbname>") \
  .option("dbtable", "dbo.Employee") \
  .option("user", "<username>") \
  .option("password", "<password>") \
  .mode("overwrite") \
  .save()

✅ 3. Read Data from SQL Pool Table

df_read = spark.read \
  .format("com.databricks.spark.sqldw") \
  .option("url", "jdbc:sqlserver://...") \
  .option("dbtable", "dbo.Employee") \
  .option("user", "...") \
  .option("password", "...") \
  .load()

df_read.show()

✅ 4. Append Data to Existing Table

df_append = spark.createDataFrame([(3, "John", 7000)], ["id", "name", "salary"])
df_append.write \
  .format("com.databricks.spark.sqldw") \
  .option("url", "jdbc:...") \
  .option("dbtable", "dbo.Employee") \
  .option("user", "...") \
  .option("password", "...") \
  .mode("append") \
  .save()

๐Ÿ“Œ Tips

  • Use overwrite to replace and append to add rows
  • Validate schema compatibility before appending
  • Check Synapse firewall rules for access

๐ŸŽฏ Use Cases

  • ETL workflows from Spark to Dedicated SQL Pool
  • Reading warehouse data for ML or transformation
  • Bulk appending logs or batch jobs

๐Ÿ“บ Watch the Video Tutorial

๐Ÿ“š Credit: Content created with the help of ChatGPT and Gemini.