How to Find & Delete Duplicate Records on All Tables in SQL Server Database - SQL Server / TSQL Tutorial

Scenario: Download Script

You are working as SQL Server developer / TSQL developer for Credit Card Company. You are working on one of the big project that requires you to find the duplicate records in all the tables in SQL Server Database and delete them. You started with writing query and soon realize that there are hundreds of tables and it will take you forever to write query for each of the table. You want to find the script that can find duplicate records for you and then log the information and delete them. If that is the case you are at right place! 


Solution:

As we need to find the duplicate records in all the tables, we need loop. In our case we will be using Cursor. We will loop through list of schema and table and then prepare our find duplicate query and run it and log the information in temp table #Result.

By doing that we will be able to provide details which records we have deleted. 

In second part, we will prepare our query to delete those duplicate records. We are using Common Table Expressions to find duplicate records. 

The log table will return us DataBase Name, Schema Name, Table Name, Column List, Duplicate Value, Total Row Count for Table and Duplicate Row Count.

Few things to consider before you run this script in production
1) As this script is going to run on all the tables in a database, this can be expensive process. Make sure you test in lower environments such as QA,UAT before run in Production.
2) As delete operation is involved, you want to make sure there is very less activity on database. Find a time such as weekend or off hours for this operation.


USE yourdbname 

go 

DECLARE @SchemaName VARCHAR(100) 
DECLARE @TableName VARCHAR(100) 
DECLARE @DatabaseName VARCHAR(100) 

--Create Temp Table to Save Results  
IF Object_id('tempdb..#Results') IS NOT NULL 
  DROP TABLE #results 

CREATE TABLE #results 
  ( 
     databasename       VARCHAR(100), 
     schemaname         VARCHAR(100), 
     tablename          VARCHAR(100), 
     columnlist         VARCHAR(max), 
     duplicatevalue     VARCHAR(max), 
     totaltablerowcount INT, 
     duplicaterowcnt    INT 
  ) 

DECLARE cur CURSOR FOR 
  SELECT table_catalog, 
         table_schema, 
         table_name 
  FROM   information_schema.tables 
  WHERE  table_type = 'BASE TABLE' 

OPEN cur 

FETCH next FROM cur INTO @DatabaseName, @SchemaName, @TableName 

WHILE @@FETCH_STATUS = 0 
  BEGIN 
      --Get List of the Columns from Table without Identity Column  
      DECLARE @ColumnList NVARCHAR(max)=NULL 

      SELECT @ColumnList = COALESCE(@ColumnList + '],[', '') + c.NAME 
      FROM   sys.columns c 
             INNER JOIN sys.tables t 
                     ON c.object_id = t.object_id 
      WHERE  Object_name(c.object_id) = @TableName 
             AND Schema_name(schema_id) = @SchemaName 
             AND is_identity = 0 

      SET @ColumnList='[' + @ColumnList + ']' 

      --Print @ColumnList  
      DECLARE @ColumnListConcat VARCHAR(max)=NULL 

      SET @ColumnListConcat=Replace(Replace(Replace(Replace(@ColumnList, '[', 
                                                    'ISNULL(Cast(['), ']', 
                                            '] AS VARCHAR(MAX)),''NULL'')'), 
                                            ',ISNULL', '+ISNULL'), '+', 
                            '+'',''+') 

      --Create Dynamic Query for Finding duplicate Records  
      DECLARE @DuplicateSQL NVARCHAR(max)=NULL 
      DECLARE @DeleteDuplicateSQL NVARCHAR(max)=NULL 

      SET @DuplicateSQL= ';With CTE as   (select  ''' 
                         + @DatabaseName + ''' AS DBName,' + '''' 
                         + @SchemaName + ''' AS SchemaName,' + '''' 
                         + @TableName + ''' AS TableName,' + '''' 
                         + @ColumnList + ''' AS ColumnList,' 
                         + @ColumnListConcat 
                         + ' AS ColumnConcat,    (Select count(*) from [' 
                         + @SchemaName + '].[' + @TableName 
                         + '] With (Nolock))             AS TotalTableRowCount    ,RN = row_number()      over(PARTITION BY ' + @ColumnList 
                         + '  order by ' + @ColumnList 
                         + ')             from [' + @SchemaName + '].[' 
                         + @TableName 
                         + ']  ) Select * From CTE WHERE RN>1' 
      -- PRINT @DuplicateSQL  
      --Prepate Delete Duplicate Records Query 
      SET @DeleteDuplicateSQL= 
      ';With CTE as   (select  RN = row_number()             over(PARTITION BY ' 
      + @ColumnList + '  order by ' + @ColumnList 
      + ')             from [' + @SchemaName + '].[' 
      + @TableName 
      + ']  ) Delete from  CTE WHERE RN>1' 

      --Log Duplicate Record information in Temp Table             
      INSERT INTO #results 
      EXEC(@DuplicateSQL) 

      --Run Dynamic SQL to Delete Duplicate 
      EXEC(@DeleteDuplicateSQL) 

      FETCH next FROM cur INTO @DatabaseName, @SchemaName, @TableName 
  END 

CLOSE cur 

DEALLOCATE cur 

SELECT * 
FROM   #results 
--drop table #Results  


I have execute above script on my database after inserting few duplicate records in my table and here is my output.
How to find and Delete duplicate records in all the Tables in SQL server Database - SQL Server / T-SQL Tutorial


Check out our other posts on Data Analysis / Data Validation / Data Cleansing
 

1 comment:

  1. Do you have video for this How to Find & Delete Duplicate Records on All Tables in SQL Server Database - SQL Server / TSQL Tutorial

    ReplyDelete