How to Perform JOINs Between Files in Azure Synapse Serverless SQL Pool

🔗 How to Perform JOINs Between Files in Azure Synapse Serverless SQL Pool

In Azure Synapse Analytics Serverless SQL Pool, you can join multiple CSV, Parquet, or JSON files from a Data Lake without loading them into traditional tables. This is a powerful feature for data exploration and building reports on top of raw data.

🔍 What is a JOIN?

A JOIN clause is used to combine rows from two or more sources based on a related column between them. Common types include:

INNER JOIN – returns only matching rows
LEFT JOIN – returns all rows from the left source and matched rows from the right
RIGHT JOIN – returns all rows from the right source and matched rows from the left
FULL JOIN – returns all matched and unmatched rows from both sides

📂 Joining External Files in Synapse

You can use OPENROWSET to read external files stored in Azure Data Lake and perform joins just like in regular T-SQL.

🧪 Example: INNER JOIN Between Two CSV Files

-- ==========================================
-- 📚 What is JOIN in SQL?
-- ==========================================
-- A JOIN clause is used to combine rows from two or more tables, 
-- based on a related column between them.

-- ==========================================
-- 🔥 Types of Joins Supported in Serverless SQL Pool (with Definitions)
-- ==========================================
-- ✅ INNER JOIN: Returns only the matching rows from both tables based on the join condition.

-- ✅ LEFT JOIN (LEFT OUTER JOIN): Returns all rows from the left table and matching rows from the right table; fills NULLs if no match.

-- ✅ RIGHT JOIN (RIGHT OUTER JOIN): Returns all rows from the right table and matching rows from the left table; fills NULLs if no match.

-- ✅ FULL OUTER JOIN: Returns all rows when there is a match in either the left or right table; fills NULLs where there is no match.

-- ❗ CROSS JOIN (supported but rarely used in files): Returns the Cartesian product of the two tables (every row from first table combined with every row from second table).

-- ==========================================
-- ✅ How to JOIN using External Tables
-- ==========================================
-- Assume we have two external tables:
-- dbo.Customer (customer_id, fname, lastname)
-- dbo.Orders (order_id, customer_id, order_date)

SELECT 
    c.customer_id,
    c.fname,
    c.lastname,
    o.order_id,
    o.order_date
FROM dbo.Customer c
INNER JOIN dbo.Orders o
ON c.customer_id = o.customer_id;

-- ==========================================
-- ✅ How to JOIN using OPENROWSET Directly
-- ==========================================
SELECT
    c.customer_id,
    c.fname,
    c.lastname,
    o.order_id,
     o.order_date
FROM
    OPENROWSET(
        BULK 'https://techbrotherssynapsestg.dfs.core.windows.net/synpasecontainer/input/customer.csv',
        FORMAT = 'CSV',
        PARSER_VERSION = '2.0',
        HEADER_ROW=true
    ) AS c
LEFT JOIN
    OPENROWSET(
        BULK 'https://techbrotherssynapsestg.dfs.core.windows.net/synpasecontainer/input/orders.csv',
        FORMAT = 'CSV',
        PARSER_VERSION = '2.0',
        HEADER_ROW=true
    ) AS o
ON
    c.customer_id = o.customer_id;

-- ==========================================
-- ⚡ Notes:
-- - For OPENROWSET, everything is read as VARCHAR, so CAST if needed.
-- - FIRSTROW = 2 means skipping the CSV header.
-- - Ensure customer_id formats match when joining.
-- - Use LEFT JOIN or FULL OUTER JOIN depending on your data requirement.

⚙️ Best Practices

Use filters to minimize scanned data
Select only necessary columns
Preview schemas before joining using TOP 10 or sp_describe_first_result_set

📺 Watch the Tutorial

Credit: This post was assisted by ChatGPT and Gemini.

Welcome To TechBrothersIT

Label