The OraTrainer: Deleting Duplicate Rows in Table

Deleting duplicate rows from tables is one of the common task oracle developers come across. The data might get duplicated because of missing primary/unique key on the table or batch file getting loaded multiple times. Here I have tried to summarize different ways of deleting this duplicated data. Please note that this is not an extensive list of all available methods but the ones I was able to figure out. This should serve as a handy reference while at work.

CREATE TABLE tbl_test(

     SER_NO NUMBER,

     FST_NM VARCHAR2(30),

     DEPTID NUMBER,

     CMNT   VARCHAR2(30));

INSERT INTO tbl_test VALUES(1, 'aaaaa', 2004, 'xxx');

INSERT INTO tbl_test VALUES(2, 'bbbbb', 2005, 'yyy');

INSERT INTO tbl_test VALUES(1, 'aaaaa', 2004, 'xxx');

INSERT INTO tbl_test VALUES(1, 'aaaaa', 2004, 'xxx');

INSERT INTO tbl_test VALUES(3, 'ccccc', 2005, 'zzz');

INSERT INTO tbl_test VALUES(2, 'bbbbb', 2005, 'yyy');

1. Using MIN(rowid) : The most common method of removing duplicate rows.

DELETE FROM tbl_test

      WHERE ROWID NOT IN (SELECT   MIN (ROWID)

                              FROM tbl_test

                          GROUP BY ser_no, fst_nm, deptid, cmnt);

Comment: This will take hours & hours if the table is large (records in million).

2. Using MIN(rowid) & Join: More or less the same as first one

DELETE FROM tbl_test t

      WHERE t.ROWID NOT IN (SELECT MIN (b.ROWID)

                              FROM tbl_test b

                             WHERE b.ser_no = t.ser_no

                               AND b.fst_nm = t.fst_nm

                               AND b.deptid = t.deptid

                               AND b.cmnt   = t.cmnt);

3. Using Subquery: This is an interesting one

DELETE FROM tbl_test

      WHERE ser_no IN (SELECT ser_no FROM tbl_test GROUP BY ser_no, fst_nm, deptid, cmnt HAVING COUNT (*) > 1)

        AND fst_nm IN (SELECT fst_nm FROM tbl_test GROUP BY ser_no, fst_nm, deptid, cmnt HAVING COUNT (*) > 1)

        AND deptid IN (SELECT deptid FROM tbl_test GROUP BY ser_no, fst_nm, deptid, cmnt HAVING COUNT (*) > 1)

        AND cmnt   IN (SELECT cmnt   FROM tbl_test GROUP BY ser_no, fst_nm, deptid, cmnt HAVING COUNT (*) > 1)

        AND ROWID NOT IN (SELECT   MIN (ROWID) 

                              FROM tbl_test

                          GROUP BY ser_no, fst_nm, deptid, cmnt

                            HAVING COUNT (*) > 1)

Comment: A complicated way of performing the same task. Not efficient.

4. Using Nested Subqueries:

DELETE FROM tbl_test a

      WHERE (a.ser_no, a.fst_nm, a.deptid, a.cmnt) IN (SELECT b.ser_no, b.fst_nm, b.deptid, b.cmnt

                                                     FROM tbl_test b

                                                    WHERE a.ser_no = b.ser_no

                                                      AND a.fst_nm = b.fst_nm

                                                      AND a.deptid = b.deptid

                                                      AND a.cmnt   = b.cmnt

                                                      AND a.ROWID  > b.ROWID);

Comment: Will work but for large tables, this is not efficient.

5. Using Analytic Fucntions:

DELETE FROM tbl_test

      WHERE ROWID IN (

               SELECT rid

                 FROM (SELECT ROWID rid,

                              ROW_NUMBER () OVER (PARTITION BY ser_no, fst_nm, deptid, cmnt ORDER BY ROWID) rn

                         FROM tbl_test)

                WHERE rn <> 1);

Comments: This is by far one of the best solutions if the table is really really large. Using the invaluable power of Analytics.

The OraTrainer

Wednesday, October 31, 2012

Deleting Duplicate Rows in Table

No comments:

Post a Comment

Total Pageviews

Linked In

Search This Blog

Wikipedia