I do not think I’ve ever published in this blog the simple way of removing duplicates from the table. I’ve faced this problem multiple times in different environments, first coming up with a solution during my consulting engagement with the City of Chicago and later virtually everywhere I worked. I believe now is the time to finally document this solution.
First, let’s define the problem, and let’s specify more precisely what “duplicates” mean in this context. First, let’s assume that the table with duplicates has a primary key. Although the SQL standard does not require to have a PK constraint on each table, it’s considered a good practice to have one, even if it is just a sequential number assigned to the record on insertion. If all other attributes, except the PK are the same in two records, let’s consider them duplicates.
Most of the time duplicates are really not intended to happen and they occur due to some programmatic error in the insertion process. After the error is discovered and corrected, we usually want to remove the duplicates, leaving just one copy of each set of attributes. Yes, we can select distinct values into a temporal table, truncate and re-insert, but this process would initiate a long exclusive lock (possibly prohibitively long) on the table, which is not always possible in production environment. Besides, if there may be memory limitations, id the table is big enough.
To remove duplicates fast, I’ve developed a very efficient algorithm, which can be schematically represented as follows:
v_prev_record :={0,' ',' ',' ' ,' ');
FOR rec in (SELECT t.pk, t.attr1, t.attr2, t.attr3 FROM table_name t
ORDER BY 2,3,4,1 LOOP
IF v_prev_record.attr1=rec.attr1 AND
v_prev_record.attr2=rec.attr2 AND
v_prev_record.attr3=rec.attr3
THEN
DELETE table_name WHERE pk=rec.pk;
END IF;
v_prev_record.attr1:=rec.attr1;
v_prev_record.attr2:=rec.attr2;
v_prev_record.attr3:=rec.attr3;
END LOOP;
This way only “the first” of identical records will remain in the table.
When we are just removing dups, this might look like somewhat excessive amount of work, but often the table with the dups also serves as a reference in the foreign key constraint, and then we need to modify a dependent table as well, and this might REALLY take long time! Now check out, how a very small modification to the code above can make this work easy!
v_prev_record :={0,' ',' ',' ' ,' ');
v_pk_keep :=0;
FOR rec in (SELECT t.pk, t.attr1, t.attr2, t.attr3 FROM table_name t
ORDER BY 2,3,4,1 LOOP
IF v_prev_record.attr1=rec.attr1 AND
v_prev_record.attr2=rec.attr2 AND
v_prev_record.attr3=rec.attr3
THEN
UPDATE dependent_table SET fk_attr:=v_pk_to_keep
WHERE fk_attr=rec.pk; --we are about to delete this dup
DELETE table_name WHERE pk=rec.pk;
ELSE
v_pk_to_keep:=rec.pk;--this is a "new" value we keep
END IF;
v_prev_record.attr1:=rec.attr1;
v_prev_record.attr2:=rec.attr2;
v_prev_record.attr3:=rec.attr3;
END LOOP;
Isn’t it just beautiful?!