<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>David&#039;s Blog &#187; column</title>
	<atom:link href="http://www.davidmoore.info/tag/column/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.davidmoore.info</link>
	<description>Computer says no</description>
	<lastBuildDate>Thu, 26 Jan 2012 21:49:25 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Finding duplicated data across one or more columns in a database table</title>
		<link>http://www.davidmoore.info/2009/02/28/finding-duplicated-data-across-one-or-more-columns-in-a-database-table/</link>
		<comments>http://www.davidmoore.info/2009/02/28/finding-duplicated-data-across-one-or-more-columns-in-a-database-table/#comments</comments>
		<pubDate>Fri, 27 Feb 2009 22:59:09 +0000</pubDate>
		<dc:creator>David</dc:creator>
				<category><![CDATA[How To]]></category>
		<category><![CDATA[Software Development]]></category>
		<category><![CDATA[column]]></category>
		<category><![CDATA[columns]]></category>
		<category><![CDATA[duplicate]]></category>
		<category><![CDATA[duplicated]]></category>
		<category><![CDATA[duplicates]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[sql]]></category>
		<category><![CDATA[table]]></category>

		<guid isPermaLink="false">http://davidmoore.info/?p=101</guid>
		<description><![CDATA[<p>A few months ago I posted a little query about finding duplicate rows in a database table. I&#8217;m revisiting this because I helped out Doogie with a similar query last night but with some complications.</p>
<p>Let&#8217;s start with the original simple scenario of checking duplicates in a single column.</p>
<p>Some example data, a Users table:</p>
+----+----------------+
&#124; Id &#124; Email <span style="color:#777"> . . . &#8594; Read More: <a href="http://www.davidmoore.info/2009/02/28/finding-duplicated-data-across-one-or-more-columns-in-a-database-table/">Finding duplicated data across one or more columns in a database table</a></span>]]></description>
			<content:encoded><![CDATA[<p>A few months ago I posted a little query about <a title="Finding duplicate rows in a database table" href="/2008/10/17/finding-duplicate-rows-in-the-database/" target="_blank">finding duplicate rows in a database table</a>. I&#8217;m revisiting this because I helped out Doogie with a similar query last night but with some complications.</p>
<p>Let&#8217;s start with the original simple scenario of checking duplicates in a single column.</p>
<p>Some example data, a Users table:</p>
<pre>+----+----------------+
| Id | Email          |
+----+----------------+
|  1 | joe@bloggs.com |
|  2 | joe@bloggs.com |
|  3 | joe@bloggs.com |
|  4 | jane@doe.com   |
|  5 | jane@doe.com   |
|  6 | john@doe.com   |
+----+----------------+</pre>
<p>You can see that joe@bloggs.com and jane@doe.com have been duplicated. This could have been prevented by putting a unique index on the Email column.</p>
<p>So to find what emails have duplicates in our table:</p>
<p><span id="more-101"></span></p>
<pre>SELECT Email, COUNT(Email) AS Duplicates
FROM `Users`
GROUP BY Email
HAVING ( Duplicates &gt; 1 )</pre>
<p>Results:</p>
<pre>+----------------+------------+
| Email          | Duplicates |
+----------------+------------+
| jane@doe.com   |          2 |
| joe@bloggs.com |          3 |
+----------------+------------+</pre>
<p>So, to help us manually correct our data, what are the Ids of the duplicates? In MySQL (4.1+), we can use GROUP_CONCAT (after casting the numerical Id to a character string):</p>
<pre>SELECT Email, COUNT(Email) AS Duplicates, GROUP_CONCAT( CAST(Id AS CHAR) ) AS Culprits
FROM `Users`
GROUP BY Email
HAVING ( Duplicates &gt; 1 )</pre>
<p>Our results:</p>
<pre>+----------------+------------+----------+
| Email          | Duplicates | Culprits |
+----------------+------------+----------+
| jane@doe.com   |          2 | 4,5      |
| joe@bloggs.com |          3 | 1,2,3    |
+----------------+------------+----------+</pre>
<p>That&#8217;s quite handy, but what about just a list of the duplicates we can go through, instead of these rows of comma-separated Ids?</p>
<p>This fugly query will do that for us: (I&#8217;m sure I could do this a better way but I&#8217;m tired and this works!)</p>
<pre>SELECT Id, Email FROM `Users` WHERE Email IN
(SELECT Email FROM `Users` GROUP BY Email HAVING ( COUNT(Email) &gt; 1 ))
ORDER BY Email</pre>
<pre>+----+----------------+
| Id | Email          |
+----+----------------+
|  4 | jane@doe.com   |
|  5 | jane@doe.com   |
|  1 | joe@bloggs.com |
|  2 | joe@bloggs.com |
|  3 | joe@bloggs.com |
+----+----------------+</pre>
<p>Now you can edit / delete the rows you want to get rid of if you ran the query in something like phpMyAdmin.</p>
<p>And don&#8217;t forget, after the clean-up job, add that index to prevent duplicates re-appearing:</p>
<pre>ALTER TABLE `Users` ADD UNIQUE (`Email`)</pre>
<p>Now, the new scenario. What about duplicates across multiple columns? For example, our Locations table:</p>
<pre>+----+-------------+----------+--------+
| Id | CountryCode | AreaCode | Prefix |
+----+-------------+----------+--------+
|  1 | 64          | 9        | 489    |
|  2 | 64          | 9        | 489    |
|  3 | 64          | 9        | 489    |
|  4 | 64          | 3        | 942    |
|  5 | 64          | 3        | 942    |
|  6 | 64          | 9        | 536    |
+----+-------------+----------+--------+</pre>
<p>Here, we want to find duplicates that have the same values in the 3 columns. For example, you can see that 64-9-489 is duplicated three times, and 64-3-942 two times.</p>
<p>We can do this without much alteration to our original queries:</p>
<pre>SELECT <strong>CountryCode, AreaCode, Prefix</strong>, COUNT(<strong>*</strong>) AS Duplicates
FROM `Locations`
GROUP BY <strong>CountryCode, AreaCode, Prefix</strong>
HAVING ( Duplicates &gt; 1 )</pre>
<pre>+-------------+----------+--------+------------+
| CountryCode | AreaCode | Prefix | Duplicates |
+-------------+----------+--------+------------+
| 64          | 3        | 942    |          2 |
| 64          | 9        | 489    |          3 |
+-------------+----------+--------+------------+</pre>
<p>Then to get the Ids:</p>
<pre>SELECT CountryCode, AreaCode, Prefix, COUNT(*) AS Duplicates, <strong>GROUP_CONCAT( CAST(Id AS CHAR) ) AS Culprits</strong>
FROM `Locations`
GROUP BY CountryCode, AreaCode, Prefix
HAVING ( Duplicates &gt; 1 )</pre>
<pre>+-------------+----------+--------+------------+----------+
| CountryCode | AreaCode | Prefix | Duplicates | Culprits |
+-------------+----------+--------+------------+----------+
| 64          | 3        | 942    |          2 | 4,5      |
| 64          | 9        | 489    |          3 | 1,2,3    |
+-------------+----------+--------+------------+----------+</pre>
<p>I think you&#8217;re getting the point. Here&#8217;s to get the rows for the culprits:</p>
<pre>SELECT Id, CountryCode, AreaCode, Prefix FROM `Locations` WHERE Id NOT IN
(SELECT Id FROM `Locations` GROUP BY CountryCode, AreaCode, Prefix HAVING ( COUNT(CountryCode) = 1 ))
ORDER BY CountryCode, AreaCode, Prefix</pre>
<pre>+----+-------------+----------+--------+
| Id | CountryCode | AreaCode | Prefix |
+----+-------------+----------+--------+
|  4 | 64          | 3        | 942    |
|  5 | 64          | 3        | 942    |
|  1 | 64          | 9        | 489    |
|  2 | 64          | 9        | 489    |
|  3 | 64          | 9        | 489    |
+----+-------------+----------+--------+</pre>
<p>Again I&#8217;m sure there&#8217;d be an easier way to do that, but hey, it works, and for something that should be a one-off.</p>
<p>So how to prevent the duplicated data in our second scenario? Add a composite unique key on those columns:</p>
<pre>ALTER TABLE `locations` ADD UNIQUE (`CountryCode`, `AreaCode`, `Prefix`)</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.davidmoore.info/2009/02/28/finding-duplicated-data-across-one-or-more-columns-in-a-database-table/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

