A few months ago I posted a little query about finding duplicate rows in a database table. I’m revisiting this because I helped out Doogie with a similar query last night but with some complications.
Let’s start with the original simple scenario of checking duplicates in a single column.
Some example data, a Users table:
+----+----------------+ | Id | Email | +----+----------------+ | 1 | joe@bloggs.com | | 2 | joe@bloggs.com | | 3 | joe@bloggs.com | | 4 | jane@doe.com | | 5 | jane@doe.com | | 6 | john@doe.com | +----+----------------+
You can see that joe@bloggs.com and jane@doe.com have been duplicated. This could have been prevented by putting a unique index on the Email column.
So to find what emails have duplicates in our table:
SELECT Email, COUNT(Email) AS Duplicates FROM `Users` GROUP BY Email HAVING ( Duplicates > 1 )
Results:
+----------------+------------+ | Email | Duplicates | +----------------+------------+ | jane@doe.com | 2 | | joe@bloggs.com | 3 | +----------------+------------+
So, to help us manually correct our data, what are the Ids of the duplicates? In MySQL (4.1+), we can use GROUP_CONCAT (after casting the numerical Id to a character string):
SELECT Email, COUNT(Email) AS Duplicates, GROUP_CONCAT( CAST(Id AS CHAR) ) AS Culprits FROM `Users` GROUP BY Email HAVING ( Duplicates > 1 )
Our results:
+----------------+------------+----------+ | Email | Duplicates | Culprits | +----------------+------------+----------+ | jane@doe.com | 2 | 4,5 | | joe@bloggs.com | 3 | 1,2,3 | +----------------+------------+----------+
That’s quite handy, but what about just a list of the duplicates we can go through, instead of these rows of comma-separated Ids?
This fugly query will do that for us: (I’m sure I could do this a better way but I’m tired and this works!)
SELECT Id, Email FROM `Users` WHERE Email IN (SELECT Email FROM `Users` GROUP BY Email HAVING ( COUNT(Email) > 1 )) ORDER BY Email
+----+----------------+ | Id | Email | +----+----------------+ | 4 | jane@doe.com | | 5 | jane@doe.com | | 1 | joe@bloggs.com | | 2 | joe@bloggs.com | | 3 | joe@bloggs.com | +----+----------------+
Now you can edit / delete the rows you want to get rid of if you ran the query in something like phpMyAdmin.
And don’t forget, after the clean-up job, add that index to prevent duplicates re-appearing:
ALTER TABLE `Users` ADD UNIQUE (`Email`)
Now, the new scenario. What about duplicates across multiple columns? For example, our Locations table:
+----+-------------+----------+--------+ | Id | CountryCode | AreaCode | Prefix | +----+-------------+----------+--------+ | 1 | 64 | 9 | 489 | | 2 | 64 | 9 | 489 | | 3 | 64 | 9 | 489 | | 4 | 64 | 3 | 942 | | 5 | 64 | 3 | 942 | | 6 | 64 | 9 | 536 | +----+-------------+----------+--------+
Here, we want to find duplicates that have the same values in the 3 columns. For example, you can see that 64-9-489 is duplicated three times, and 64-3-942 two times.
We can do this without much alteration to our original queries:
SELECT CountryCode, AreaCode, Prefix, COUNT(*) AS Duplicates FROM `Locations` GROUP BY CountryCode, AreaCode, Prefix HAVING ( Duplicates > 1 )
+-------------+----------+--------+------------+ | CountryCode | AreaCode | Prefix | Duplicates | +-------------+----------+--------+------------+ | 64 | 3 | 942 | 2 | | 64 | 9 | 489 | 3 | +-------------+----------+--------+------------+
Then to get the Ids:
SELECT CountryCode, AreaCode, Prefix, COUNT(*) AS Duplicates, GROUP_CONCAT( CAST(Id AS CHAR) ) AS Culprits FROM `Locations` GROUP BY CountryCode, AreaCode, Prefix HAVING ( Duplicates > 1 )
+-------------+----------+--------+------------+----------+ | CountryCode | AreaCode | Prefix | Duplicates | Culprits | +-------------+----------+--------+------------+----------+ | 64 | 3 | 942 | 2 | 4,5 | | 64 | 9 | 489 | 3 | 1,2,3 | +-------------+----------+--------+------------+----------+
I think you’re getting the point. Here’s to get the rows for the culprits:
SELECT Id, CountryCode, AreaCode, Prefix FROM `Locations` WHERE Id NOT IN (SELECT Id FROM `Locations` GROUP BY CountryCode, AreaCode, Prefix HAVING ( COUNT(CountryCode) = 1 )) ORDER BY CountryCode, AreaCode, Prefix
+----+-------------+----------+--------+ | Id | CountryCode | AreaCode | Prefix | +----+-------------+----------+--------+ | 4 | 64 | 3 | 942 | | 5 | 64 | 3 | 942 | | 1 | 64 | 9 | 489 | | 2 | 64 | 9 | 489 | | 3 | 64 | 9 | 489 | +----+-------------+----------+--------+
Again I’m sure there’d be an easier way to do that, but hey, it works, and for something that should be a one-off.
So how to prevent the duplicated data in our second scenario? Add a composite unique key on those columns:
ALTER TABLE `locations` ADD UNIQUE (`CountryCode`, `AreaCode`, `Prefix`)