How to Perform JOINs Between Files in Azure Synapse Serverless SQL Pool | Azure Synapse Analytics Tutorial

How to Perform JOINs Between Files in Azure Synapse Serverless SQL Pool

๐Ÿ”— How to Perform JOINs Between Files in Azure Synapse Serverless SQL Pool

In Azure Synapse Analytics Serverless SQL Pool, you can join multiple CSV, Parquet, or JSON files from a Data Lake without loading them into traditional tables. This is a powerful feature for data exploration and building reports on top of raw data.

๐Ÿ” What is a JOIN?

A JOIN clause is used to combine rows from two or more sources based on a related column between them. Common types include:

  • INNER JOIN – returns only matching rows
  • LEFT JOIN – returns all rows from the left source and matched rows from the right
  • RIGHT JOIN – returns all rows from the right source and matched rows from the left
  • FULL JOIN – returns all matched and unmatched rows from both sides

๐Ÿ“‚ Joining External Files in Synapse

You can use OPENROWSET to read external files stored in Azure Data Lake and perform joins just like in regular T-SQL.

๐Ÿงช Example: INNER JOIN Between Two CSV Files

-- ==========================================
-- ๐Ÿ“š What is JOIN in SQL?
-- ==========================================
-- A JOIN clause is used to combine rows from two or more tables, 
-- based on a related column between them.

-- ==========================================
-- ๐Ÿ”ฅ Types of Joins Supported in Serverless SQL Pool (with Definitions)
-- ==========================================
-- ✅ INNER JOIN: Returns only the matching rows from both tables based on the join condition.

-- ✅ LEFT JOIN (LEFT OUTER JOIN): Returns all rows from the left table and matching rows from the right table; fills NULLs if no match.

-- ✅ RIGHT JOIN (RIGHT OUTER JOIN): Returns all rows from the right table and matching rows from the left table; fills NULLs if no match.

-- ✅ FULL OUTER JOIN: Returns all rows when there is a match in either the left or right table; fills NULLs where there is no match.

-- ❗ CROSS JOIN (supported but rarely used in files): Returns the Cartesian product of the two tables (every row from first table combined with every row from second table).

-- ==========================================
-- ✅ How to JOIN using External Tables
-- ==========================================
-- Assume we have two external tables:
-- dbo.Customer (customer_id, fname, lastname)
-- dbo.Orders (order_id, customer_id, order_date)

SELECT 
    c.customer_id,
    c.fname,
    c.lastname,
    o.order_id,
    o.order_date
FROM dbo.Customer c
INNER JOIN dbo.Orders o
ON c.customer_id = o.customer_id;

-- ==========================================
-- ✅ How to JOIN using OPENROWSET Directly
-- ==========================================
SELECT
    c.customer_id,
    c.fname,
    c.lastname,
    o.order_id,
     o.order_date
FROM
    OPENROWSET(
        BULK 'https://techbrotherssynapsestg.dfs.core.windows.net/synpasecontainer/input/customer.csv',
        FORMAT = 'CSV',
        PARSER_VERSION = '2.0',
        HEADER_ROW=true
    ) AS c
LEFT JOIN
    OPENROWSET(
        BULK 'https://techbrotherssynapsestg.dfs.core.windows.net/synpasecontainer/input/orders.csv',
        FORMAT = 'CSV',
        PARSER_VERSION = '2.0',
        HEADER_ROW=true
    ) AS o
ON
    c.customer_id = o.customer_id;

-- ==========================================
-- ⚡ Notes:
-- - For OPENROWSET, everything is read as VARCHAR, so CAST if needed.
-- - FIRSTROW = 2 means skipping the CSV header.
-- - Ensure customer_id formats match when joining.
-- - Use LEFT JOIN or FULL OUTER JOIN depending on your data requirement.

⚙️ Best Practices

  • Use filters to minimize scanned data
  • Select only necessary columns
  • Preview schemas before joining using TOP 10 or sp_describe_first_result_set

๐Ÿ“บ Watch the Tutorial

Credit: This post was assisted by ChatGPT and Gemini.

No comments:

Post a Comment