How to Join Parquet File with CSV Using OPENROWSET in Serverless SQL Pool-Azure Synapse Tutorial

How to Join Parquet File with CSV Using OPENROWSET in Serverless SQL Pool - Azure Synapse

๐Ÿ”— How to Join Parquet File with CSV Using OPENROWSET in Serverless SQL Pool - Azure Synapse

Azure Synapse Analytics Serverless SQL Pool allows you to query data stored in various file formats such as CSV and Parquet directly from Azure Data Lake using the OPENROWSET function. This blog post shows how you can join data between a Parquet file and a CSV file without ingesting them into a database.

๐Ÿง  File Format Overview

  • CSV (Comma Separated Values): A flat file format used to store tabular data, often with headers and delimited values.
  • Parquet: A columnar storage file format optimized for analytical workloads — fast to read specific columns and space-efficient.

๐Ÿ“˜ JOINing CSV and Parquet Files using OPENROWSET

Both files are read directly from Azure Data Lake using their respective OPENROWSET definitions, then joined using T-SQL.

๐Ÿงช Example Query


SELECT 
    csv.CustomerID,
    csv.Name,
    prq.Country,
    prq.TotalSpent
FROM 
    OPENROWSET(
        BULK 'https://yourlakehouse.dfs.core.windows.net/sales/customer.csv',
        FORMAT = 'CSV',
        PARSER_VERSION = '2.0',
        HEADER_ROW = TRUE
    ) AS csv
JOIN 
    OPENROWSET(
        BULK 'https://yourlakehouse.dfs.core.windows.net/sales/customer_spending.parquet',
        FORMAT = 'PARQUET'
    ) AS prq
ON csv.CustomerID = prq.CustomerID;

✅ Benefits of This Approach

  • No need to load files into tables
  • Low-cost, pay-per-query model
  • Supports ad hoc exploration of raw data

⚠️ Best Practices

  • Ensure that both files have compatible data types for join keys
  • Use SELECT only on required columns to reduce data scanned
  • Preview file schemas with TOP 10 or sp_describe_first_result_set before joining

๐Ÿ“บ Watch the Full Tutorial

Credit: This blog post was created with the help of ChatGPT and Gemini.

No comments:

Post a Comment