๐ How to Perform JOINs Between Files in Azure Synapse Serverless SQL Pool
In Azure Synapse Analytics Serverless SQL Pool, you can join multiple CSV, Parquet, or JSON files from a Data Lake without loading them into traditional tables. This is a powerful feature for data exploration and building reports on top of raw data.
๐ What is a JOIN?
A JOIN
clause is used to combine rows from two or more sources based on a related column between them. Common types include:
- INNER JOIN – returns only matching rows
- LEFT JOIN – returns all rows from the left source and matched rows from the right
- RIGHT JOIN – returns all rows from the right source and matched rows from the left
- FULL JOIN – returns all matched and unmatched rows from both sides
๐ Joining External Files in Synapse
You can use OPENROWSET
to read external files stored in Azure Data Lake and perform joins just like in regular T-SQL.
๐งช Example: INNER JOIN Between Two CSV Files
-- ==========================================
-- ๐ What is JOIN in SQL?
-- ==========================================
-- A JOIN clause is used to combine rows from two or more tables,
-- based on a related column between them.
-- ==========================================
-- ๐ฅ Types of Joins Supported in Serverless SQL Pool (with Definitions)
-- ==========================================
-- ✅ INNER JOIN: Returns only the matching rows from both tables based on the join condition.
-- ✅ LEFT JOIN (LEFT OUTER JOIN): Returns all rows from the left table and matching rows from the right table; fills NULLs if no match.
-- ✅ RIGHT JOIN (RIGHT OUTER JOIN): Returns all rows from the right table and matching rows from the left table; fills NULLs if no match.
-- ✅ FULL OUTER JOIN: Returns all rows when there is a match in either the left or right table; fills NULLs where there is no match.
-- ❗ CROSS JOIN (supported but rarely used in files): Returns the Cartesian product of the two tables (every row from first table combined with every row from second table).
-- ==========================================
-- ✅ How to JOIN using External Tables
-- ==========================================
-- Assume we have two external tables:
-- dbo.Customer (customer_id, fname, lastname)
-- dbo.Orders (order_id, customer_id, order_date)
SELECT
c.customer_id,
c.fname,
c.lastname,
o.order_id,
o.order_date
FROM dbo.Customer c
INNER JOIN dbo.Orders o
ON c.customer_id = o.customer_id;
-- ==========================================
-- ✅ How to JOIN using OPENROWSET Directly
-- ==========================================
SELECT
c.customer_id,
c.fname,
c.lastname,
o.order_id,
o.order_date
FROM
OPENROWSET(
BULK 'https://techbrotherssynapsestg.dfs.core.windows.net/synpasecontainer/input/customer.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW=true
) AS c
LEFT JOIN
OPENROWSET(
BULK 'https://techbrotherssynapsestg.dfs.core.windows.net/synpasecontainer/input/orders.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW=true
) AS o
ON
c.customer_id = o.customer_id;
-- ==========================================
-- ⚡ Notes:
-- - For OPENROWSET, everything is read as VARCHAR, so CAST if needed.
-- - FIRSTROW = 2 means skipping the CSV header.
-- - Ensure customer_id formats match when joining.
-- - Use LEFT JOIN or FULL OUTER JOIN depending on your data requirement.
⚙️ Best Practices
- Use filters to minimize scanned data
- Select only necessary columns
- Preview schemas before joining using
TOP 10
orsp_describe_first_result_set
๐บ Watch the Tutorial
Credit: This post was assisted by ChatGPT and Gemini.
No comments:
Post a Comment