Understanding IDENTITY Behavior in Azure Synapse SQL Pool | Non-Sequential IDs Explained
📘 What is a Surrogate Key?
A surrogate key is an artificial, system-generated unique identifier used to represent each row in a table. It is not derived from actual business data and is commonly used in data warehouse dimension tables to simplify joins with fact tables.
🔹 Using IDENTITY in Synapse SQL Pools
In Azure Synapse Dedicated SQL Pools, you can use the IDENTITY
property to automatically generate surrogate keys. However, due to the distributed nature of Synapse, IDENTITY values are not guaranteed to be sequential.
✅ Create Table with IDENTITY
CREATE TABLE dbo.DimCustomer
(
CustomerSK INT IDENTITY(1,1) NOT NULL,
CustomerName NVARCHAR(100),
Country NVARCHAR(50)
)
WITH (
DISTRIBUTION = HASH(CustomerName),
CLUSTERED COLUMNSTORE INDEX
);
🚀 Insert Sample Data
INSERT INTO dbo.DimCustomer (CustomerName, Country) VALUES ('Alice', 'USA');
INSERT INTO dbo.DimCustomer (CustomerName, Country) VALUES ('Bob', 'Canada');
INSERT INTO dbo.DimCustomer (CustomerName, Country) VALUES ('Charlie', 'UK');
🛠️ Insert Custom IDENTITY Value
Sometimes you might need to insert a specific surrogate key (e.g., -1
for "Unknown"). You can do that by temporarily enabling IDENTITY_INSERT
:
SET IDENTITY_INSERT dbo.DimCustomer ON;
INSERT INTO dbo.DimCustomer (CustomerSK, CustomerName, Country)
VALUES (-1, 'Unknown', 'N/A');
SET IDENTITY_INSERT dbo.DimCustomer OFF;
🔎 Query Metadata About IDENTITY Column
SELECT sm.name AS schema_name,
tb.name AS table_name,
co.name AS column_name,
CASE WHEN ic.column_id IS NOT NULL THEN 1 ELSE 0 END AS is_identity
FROM sys.schemas sm
JOIN sys.tables tb ON sm.schema_id = tb.schema_id
JOIN sys.columns co ON tb.object_id = co.object_id
LEFT JOIN sys.identity_columns ic ON co.object_id = ic.object_id AND co.column_id = ic.column_id
WHERE sm.name = 'dbo' AND tb.name = 'DimCustomer';
📏 Check Seed & Increment Values
SELECT ic.seed_value, ic.increment_value
FROM sys.identity_columns ic
JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
JOIN sys.tables t ON t.object_id = c.object_id
WHERE t.name = 'DimCustomer';
⚠️ IDENTITY Behavior Caveats
- IDENTITY values may skip or repeat due to parallelism.
- Do not use IDENTITY on the distribution column.
CTAS
andSELECT INTO
do not preserve IDENTITY properties.- Use
INSERT INTO ... SELECT
when populating IDENTITY tables.
💡 Best Practices
- Use IDENTITY columns for dimension surrogate keys.
- Do not rely on IDENTITY values being sequential across rows.
- For reporting or incremental loading, always use a
MAX(ID)
lookup strategy.
📺 Watch the Video Tutorial
📚 Credit: Content created with the help of ChatGPT and Gemini.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.