Home > Archives > IJSRST162252 IJSRST-Library

Effective Progressive Algorithm for Duplicate Detection on Large Dataset

Authors(4) :-R. Ramesh Kannan, D. R. Abarna, G. Aswini, P. Hemavathy

Effective progressive algorithm for duplicate detection on large dataset is the process of detecting unwanted files in the document in very short time.It does not affect the file without any quality change and also it used for cleaning process. The main advantage is very efficient and very speed.It has unlimited large data sets. The system alerts the user about potential duplicates when the user tries to create new records or update existing records. To maintain data quality, you can schedule a duplicate detection job to check for duplicates for all records that match a certain criteria. You can clean the data by deleting, deactivating, or merging the duplicates reported by a duplicate detection. We propose two novel, progressive duplicate detection algorithms namely progressive sorted neighborhood method (PSNM), which performs best on small and almost clean datasets, and progressive blocking (PB), which performs best on large and very dirty datasets. Both enhance the efficiency of duplicate detection even on very large datasets. which expose different strengths and outperform current approaches. We exhaustively evaluate on several real-world datasets testing our own and previous algorithms.
R. Ramesh Kannan, D. R. Abarna, G. Aswini, P. Hemavathy
Datacleaning, Duplicatedetection, Entity Resolution, Progressiveness
  1. S. E. Whang, D. Marmaros, and H. Garcia-Molina, “Pay-as-you-go entity resolution,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 5, pp. 1111–1124, May 2012.
  2. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record detection: A survey,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 1, pp. 1–16, Jan. 2007.
  3. F. Naumann and M. Herschel, An Introduction to Duplicate Detection. San Rafael, CA, USA: Morgan & Claypool, 2010.
  4. H. B. Newcombe and J. M. Kennedy, “Record linkage: Making maximum use of the discriminating power of identifying information,” Commun. ACM, vol. 5, no. 11, pp. 563–566, 1962.
  5. M. A. Hernandez and S. J. Stolfo, “Real-world data is dirty: Data cleansing and the merge/purge problem,” Data Mining Knowl. Discovery, vol. 2, no. 1, pp. 9–37, 1998.
  6. X. Dong, A. Halevy, and J. Madhavan, “Reference reconciliation in complex information spaces,” in Proc. Int. Conf. Manage. Data, 2005, pp. 85–96.
  7. O. Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller, “Framework for evaluating clustering algorithms in duplicate detection,” Proc. Very Large Databases Endowment, vol. 2, pp. 1282– 1293, 2009.
  8. O. Hassanzadeh and R. J. Miller, “Creating probabilistic databases from duplicated data,” VLDB J., vol. 18, no. 5, pp. 1141–1166, 2009.
  9. U. Draisbach, F. Naumann, S. Szott, and O. Wonneberg, “Adaptive windows for duplicate detection,” in Proc. IEEE 28th Int. Conf. Data Eng., 2012, pp. 1073–1083.
  10. S. Yan, D. Lee, M.-Y. Kan, and L. C. Giles, “Adaptive sorted neighborhood methods for efficient record linkage,” in Proc. 7th ACM/ IEEE Joint Int. Conf. Digit. Libraries, 2007, pp. 185–194.
  11. J. Madhavan, S. R. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy, “Web-scale data integration: You can only afford to pay as you go,” in Proc. Conf. Innovative Data Syst. Res., 2007.
  12. S. R. Jeffery, M. J. Franklin, and A. Y. Halevy, “Pay-as-you-go user feedback for dataspace systems,” in Proc. Int. Conf. Manage. Data, 2008, pp. 847–860.
  13. C. Xiao, W. Wang, X. Lin, and H. Shang, “Top-k set similarity joins,” in Proc. IEEE Int. Conf. Data Eng., 2009, pp. 916–927.
  14. P. Indyk, “A small approximately min-wise independent family of hash functions,” in Proc. 10th Annu. ACM-SIAM Symp. Discrete Algorithms, 1999, pp. 454–456.
  15. U. Draisbach and F. Naumann, “A generalization of blocking and windowing algorithms for duplicate detection,” in Proc. Int. Conf. Data Knowl. Eng., 2011, pp. 18–24.
  16. H. S. Warren, Jr., “A modification of Warshall’s algorithm for the transitive closure of binary relations,” Commun. ACM, vol. 18, no. 4, pp. 218–220, 1975.
  17. M. Wallace and S. Kollias, “Computationally efficient incremental transitive closure of sparse fuzzy binary relations,” in Proc. IEEE Int. Conf. Fuzzy Syst., 2004, pp. 1561–1565.
  18. F. J. Damerau, “A technique for computer detection and correction of spelling errors,” Commun. ACM, vol. 7, no. 3, pp. 171–176, 1964.
  19. P. Christen, “A survey of indexing techniques for scalable record linkage and deduplication,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 9, pp. 1537–1555, Sep. 2012.
  20. B. Kille, F. Hopfgartner, T. Brodt, and T. Heintz, “The Plista dataset,” in Proc. Int. Workshop Challenge News Recommender Syst., 2013, pp. 16–23.
  21. L. Kolb, A. Thor, and E. Rahm, “Parallel sorted neighborhood blocking with MapReduce,” in Proc. Conf. Datenbanksysteme in B€uro, Technik und Wissenschaft, 2011.
Publication Details
  Published in : Volume 2 | Issue 2 | March-April 2016
  Date of Publication : 2016-04-30
License:  This work is licensed under a Creative Commons Attribution 4.0 International License.
Page(s) : 105-110
Manuscript Number : IJSRST162252
Publisher : Technoscience Academy
PRINT ISSN : 2395-6011
ONLINE ISSN : 2395-602X
Cite This Article :
R. Ramesh Kannan, D. R. Abarna, G. Aswini, P. Hemavathy, "Effective Progressive Algorithm for Duplicate Detection on Large Dataset", International Journal of Scientific Research in Science and Technology(IJSRST), Print ISSN : 2395-6011, Online ISSN : 2395-602X, Volume 2, Issue 2, pp.105-110, March-April-2016
URL : http://ijsrst.com/IJSRST162252