๐ How to Join Parquet File with CSV Using OPENROWSET in Serverless SQL Pool - Azure Synapse
Azure Synapse Analytics Serverless SQL Pool allows you to query data stored in various file formats such as CSV and Parquet directly from Azure Data Lake using the OPENROWSET
function. This blog post shows how you can join data between a Parquet file and a CSV file without ingesting them into a database.
๐ง File Format Overview
- CSV (Comma Separated Values): A flat file format used to store tabular data, often with headers and delimited values.
- Parquet: A columnar storage file format optimized for analytical workloads — fast to read specific columns and space-efficient.
๐ JOINing CSV and Parquet Files using OPENROWSET
Both files are read directly from Azure Data Lake using their respective OPENROWSET
definitions, then joined using T-SQL.
๐งช Example Query
SELECT
csv.CustomerID,
csv.Name,
prq.Country,
prq.TotalSpent
FROM
OPENROWSET(
BULK 'https://yourlakehouse.dfs.core.windows.net/sales/customer.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE
) AS csv
JOIN
OPENROWSET(
BULK 'https://yourlakehouse.dfs.core.windows.net/sales/customer_spending.parquet',
FORMAT = 'PARQUET'
) AS prq
ON csv.CustomerID = prq.CustomerID;
✅ Benefits of This Approach
- No need to load files into tables
- Low-cost, pay-per-query model
- Supports ad hoc exploration of raw data
⚠️ Best Practices
- Ensure that both files have compatible data types for join keys
- Use
SELECT
only on required columns to reduce data scanned - Preview file schemas with
TOP 10
orsp_describe_first_result_set
before joining
๐บ Watch the Full Tutorial
Credit: This blog post was created with the help of ChatGPT and Gemini.
No comments:
Post a Comment