Deleting duplicate rows from tables is one of the common task oracle developers come across. The data might get duplicated because of missing primary/unique key on the table or batch file getting loaded multiple times. Here I have tried to summarize different ways of deleting this duplicated data. Please note that this is not an extensive list of all available methods but the ones I was able to figure out. This should serve as a handy reference while at work.
| 
CREATETABLEtbl_test( 
     SER_NO NUMBER, 
     FST_NM VARCHAR2(30), 
     DEPTID NUMBER, 
     CMNT   VARCHAR2(30)); 
 
INSERTINTOtbl_test VALUES(1, 'aaaaa', 2004, 'xxx'); 
INSERTINTOtbl_test VALUES(2, 'bbbbb', 2005, 'yyy'); 
INSERTINTOtbl_test VALUES(1, 'aaaaa', 2004, 'xxx'); 
INSERTINTOtbl_test VALUES(1, 'aaaaa', 2004, 'xxx'); 
INSERTINTOtbl_test VALUES(3, 'ccccc', 2005, 'zzz'); 
INSERTINTOtbl_test VALUES(2, 'bbbbb', 2005, 'yyy'); | 
 
 
| 
 
 
1. Using MIN(rowid) : The most common method of removing duplicate rows. 
| 
DELETEFROMtbl_test 
      WHEREROWID NOTIN(SELECTMIN(ROWID) 
                              FROMtbl_test 
                          GROUPBYser_no, fst_nm, deptid, cmnt); |  
Comment: This will take hours & hours if the table is large (records in million). 
2. Using MIN(rowid) & Join: More or less the same as first one 
| 
DELETEFROMtbl_test t 
      WHEREt.ROWID NOTIN(SELECTMIN(b.ROWID) 
                              FROMtbl_test b 
                             WHEREb.ser_no = t.ser_no 
                               ANDb.fst_nm = t.fst_nm 
                               ANDb.deptid = t.deptid 
                               ANDb.cmnt   = t.cmnt); |  
3. Using Subquery: This is an interesting one 
| 
DELETEFROMtbl_test 
      WHEREser_no IN(SELECTser_no FROMtbl_test GROUPBYser_no, fst_nm, deptid, cmnt HAVINGCOUNT(*) > 1) 
        ANDfst_nm IN(SELECTfst_nm FROMtbl_test GROUPBYser_no, fst_nm, deptid, cmnt HAVINGCOUNT(*) > 1) 
        ANDdeptid IN(SELECTdeptid FROMtbl_test GROUPBYser_no, fst_nm, deptid, cmnt HAVINGCOUNT(*) > 1) 
        ANDcmnt   IN(SELECTcmnt   FROMtbl_test GROUPBYser_no, fst_nm, deptid, cmnt HAVINGCOUNT(*) > 1) 
        ANDROWID NOTIN(SELECTMIN(ROWID)  
                              FROMtbl_test 
                          GROUPBYser_no, fst_nm, deptid, cmnt 
                            HAVINGCOUNT(*) > 1) |  
Comment: A complicated way of performing the same task. Not efficient. 
4. Using Nested Subqueries: 
| 
DELETEFROMtbl_test a 
      WHERE(a.ser_no, a.fst_nm, a.deptid, a.cmnt) IN(SELECTb.ser_no, b.fst_nm, b.deptid, b.cmnt 
                                                     FROMtbl_test b 
                                                    WHEREa.ser_no = b.ser_no 
                                                      ANDa.fst_nm = b.fst_nm 
                                                      ANDa.deptid = b.deptid 
                                                      ANDa.cmnt   = b.cmnt 
                                                      ANDa.ROWID  > b.ROWID); |  
Comment: Will work but for large tables, this is not efficient. 
5. Using Analytic Fucntions: 
| 
DELETEFROMtbl_test 
      WHEREROWID IN( 
               SELECTrid 
                 FROM(SELECTROWID rid, 
                              ROW_NUMBER () OVER (PARTITION BYser_no, fst_nm, deptid, cmnt ORDERBYROWID) rn 
                         FROMtbl_test) 
                WHERErn <> 1); |  
Comments: This is by far one of the best solutions if the table is really really large. Using the invaluable power of Analytics. | 
 
 
No comments:
Post a Comment