Understanding IDENTITY Behavior in Azure Synapse SQL Pool | Non-Sequential IDs Explained | Azure Synapse Analytics Tutorial

Understanding IDENTITY Behavior in Azure Synapse SQL Pool | Non-Sequential IDs Explained

Understanding IDENTITY Behavior in Azure Synapse SQL Pool | Non-Sequential IDs Explained

📘 What is a Surrogate Key?

A surrogate key is an artificial, system-generated unique identifier used to represent each row in a table. It is not derived from actual business data and is commonly used in data warehouse dimension tables to simplify joins with fact tables.

🔹 Using IDENTITY in Synapse SQL Pools

In Azure Synapse Dedicated SQL Pools, you can use the IDENTITY property to automatically generate surrogate keys. However, due to the distributed nature of Synapse, IDENTITY values are not guaranteed to be sequential.

✅ Create Table with IDENTITY

CREATE TABLE dbo.DimCustomer
(
    CustomerSK INT IDENTITY(1,1) NOT NULL,
    CustomerName NVARCHAR(100),
    Country NVARCHAR(50)
)
WITH (
    DISTRIBUTION = HASH(CustomerName),
    CLUSTERED COLUMNSTORE INDEX
);

🚀 Insert Sample Data

INSERT INTO dbo.DimCustomer (CustomerName, Country) VALUES ('Alice', 'USA');
INSERT INTO dbo.DimCustomer (CustomerName, Country) VALUES ('Bob', 'Canada');
INSERT INTO dbo.DimCustomer (CustomerName, Country) VALUES ('Charlie', 'UK');

🛠️ Insert Custom IDENTITY Value

Sometimes you might need to insert a specific surrogate key (e.g., -1 for "Unknown"). You can do that by temporarily enabling IDENTITY_INSERT:

SET IDENTITY_INSERT dbo.DimCustomer ON;
INSERT INTO dbo.DimCustomer (CustomerSK, CustomerName, Country)
VALUES (-1, 'Unknown', 'N/A');
SET IDENTITY_INSERT dbo.DimCustomer OFF;

🔎 Query Metadata About IDENTITY Column

SELECT sm.name AS schema_name,
       tb.name AS table_name,
       co.name AS column_name,
       CASE WHEN ic.column_id IS NOT NULL THEN 1 ELSE 0 END AS is_identity
FROM sys.schemas sm
JOIN sys.tables tb ON sm.schema_id = tb.schema_id
JOIN sys.columns co ON tb.object_id = co.object_id
LEFT JOIN sys.identity_columns ic ON co.object_id = ic.object_id AND co.column_id = ic.column_id
WHERE sm.name = 'dbo' AND tb.name = 'DimCustomer';

📏 Check Seed & Increment Values

SELECT ic.seed_value, ic.increment_value
FROM sys.identity_columns ic
JOIN sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
JOIN sys.tables t ON t.object_id = c.object_id
WHERE t.name = 'DimCustomer';

⚠️ IDENTITY Behavior Caveats

  • IDENTITY values may skip or repeat due to parallelism.
  • Do not use IDENTITY on the distribution column.
  • CTAS and SELECT INTO do not preserve IDENTITY properties.
  • Use INSERT INTO ... SELECT when populating IDENTITY tables.

💡 Best Practices

  • Use IDENTITY columns for dimension surrogate keys.
  • Do not rely on IDENTITY values being sequential across rows.
  • For reporting or incremental loading, always use a MAX(ID) lookup strategy.

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.