Remove Full Duplicates by Using row_number() Function and Common Table Expression

This blog article is inspired by another very informative one which explains really well the meaning of the new row_number() function introduced since SQL2005. Since the original article did not give a full answer to how to delete multiple fully duplicated rows in a table, I am putting my code here for the test.

The good thing about row_number() function is that it can be based on partition of columns. New row number count starts for each partition of full duplicates and those rows with row number bigger than 1 can be easily spotted and removed. Here is the example that explains everything I am talking about here.

First of all, create a table in a test db

USE [test_db]
GO
CREATE TABLE [dbo].[Employee](
[EMPID] [int] NULL,
[FNAME] [varchar](50) NULL,
[LNAME] [varchar](50) NULL
) ON [PRIMARY]

Then let’s insert many duplicated rows:

INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (2021110, 'MICHAEL', 'POLAND')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (2021110, 'MICHAEL', 'POLAND')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (2021115, 'JIM', 'KENNEDY')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (2021115, 'JIM', 'KENNEDY')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (2021115, 'JIM', 'KENNEDY')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (2121000, 'JAMES', 'SMITH')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (2121000, 'JAMES', 'SMITH')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (2011111, 'ADAM', 'ACKERMAN')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (3015670, 'MARTHA', 'LEDERER')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (3015670, 'MARTHA', 'LEDERER')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (3015670, 'MARTHA', 'LEDERER')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (3015670, 'MARTHA', 'LEDERER')
GO
INSERT INTO EMPLOYEE (EMPID, FNAME, LNAME)
VALUES (1021710, 'MARIAH', 'MANDEZ')
GO

If you selete everything from this table now, the following should be there:

EMPID FNAME LNAME
1021710 MARIAH MANDEZ
2011111 ADAM ACHERMAN
2021110 MICHAEL POLAND
2021110 MICHAEL POLAND
2021115 JIM KENNEDY
2021115 JIM KENNEDY
2021115 JIM KENNEDY
2121000 JAMES SMITH
2121000 JAMES SMITH
3015670 MARTHA LEDERER
3015670 MARTHA LEDERER
3015670 MARTHA LEDERER
3015670 MARTHA LEDERER

And here comes the de-dup script with the help of common table expression:

WITH temp_employee AS
(SELECT ROW_NUMBER() OVER (PARTITION BY EMPID, FNAME, LNAME ORDER BY EMPID) AS rowid, * FROM EMPLOYEE)
DELETE temp_employee WHERE rowid > 1

Hope this article is illustrative enough.

Advertisements

About Nick Xu

Not a DBA anymore
This entry was posted in Common Table Expression, T-SQL Script and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s