Deleting duplicate rows from tables is one of the common task oracle developers come across. The data might get duplicated because of missing primary/unique key on the table or batch file getting loaded multiple times. Here I have tried to summarize different ways of deleting this duplicated data. Please note that this is not an extensive list of all available methods but the ones I was able to figure out. This should serve as a handy reference while at work.
CREATE TABLE tbl_test(
SER_NO NUMBER,
FST_NM VARCHAR2(30),
DEPTID NUMBER,
CMNT VARCHAR2(30));
INSERT INTO tbl_test VALUES (1, 'aaaaa' , 2004, 'xxx' );
INSERT INTO tbl_test VALUES (2, 'bbbbb' , 2005, 'yyy' );
INSERT INTO tbl_test VALUES (1, 'aaaaa' , 2004, 'xxx' );
INSERT INTO tbl_test VALUES (1, 'aaaaa' , 2004, 'xxx' );
INSERT INTO tbl_test VALUES (3, 'ccccc' , 2005, 'zzz' );
INSERT INTO tbl_test VALUES (2, 'bbbbb' , 2005, 'yyy' );
|
1. Using MIN(rowid) : The most common method of removing duplicate rows.
DELETE FROM tbl_test
WHERE ROWID NOT IN ( SELECT MIN (ROWID)
FROM tbl_test
GROUP BY ser_no, fst_nm, deptid, cmnt);
|
Comment: This will take hours & hours if the table is large (records in million).
2. Using MIN(rowid) & Join: More or less the same as first one
DELETE FROM tbl_test t
WHERE t.ROWID NOT IN ( SELECT MIN (b.ROWID)
FROM tbl_test b
WHERE b.ser_no = t.ser_no
AND b.fst_nm = t.fst_nm
AND b.deptid = t.deptid
AND b.cmnt = t.cmnt);
|
3. Using Subquery: This is an interesting one
DELETE FROM tbl_test
WHERE ser_no IN ( SELECT ser_no FROM tbl_test GROUP BY ser_no, fst_nm, deptid, cmnt HAVING COUNT (*) > 1)
AND fst_nm IN ( SELECT fst_nm FROM tbl_test GROUP BY ser_no, fst_nm, deptid, cmnt HAVING COUNT (*) > 1)
AND deptid IN ( SELECT deptid FROM tbl_test GROUP BY ser_no, fst_nm, deptid, cmnt HAVING COUNT (*) > 1)
AND cmnt IN ( SELECT cmnt FROM tbl_test GROUP BY ser_no, fst_nm, deptid, cmnt HAVING COUNT (*) > 1)
AND ROWID NOT IN ( SELECT MIN (ROWID)
FROM tbl_test
GROUP BY ser_no, fst_nm, deptid, cmnt
HAVING COUNT (*) > 1)
|
Comment: A complicated way of performing the same task. Not efficient.
4. Using Nested Subqueries:
DELETE FROM tbl_test a
WHERE (a.ser_no, a.fst_nm, a.deptid, a.cmnt) IN ( SELECT b.ser_no, b.fst_nm, b.deptid, b.cmnt
FROM tbl_test b
WHERE a.ser_no = b.ser_no
AND a.fst_nm = b.fst_nm
AND a.deptid = b.deptid
AND a.cmnt = b.cmnt
AND a.ROWID > b.ROWID);
|
Comment: Will work but for large tables, this is not efficient.
5. Using Analytic Fucntions:
DELETE FROM tbl_test
WHERE ROWID IN (
SELECT rid
FROM ( SELECT ROWID rid,
ROW_NUMBER () OVER (PARTITION BY ser_no, fst_nm, deptid, cmnt ORDER BY ROWID) rn
FROM tbl_test)
WHERE rn <> 1);
|
Comments: This is by far one of the best solutions if the table is really really large. Using the invaluable power of Analytics.
|
No comments:
Post a Comment