Effective Progressive Algorithm for Duplicate Detection on Large Dataset

Authors

  • R. Ramesh Kannan  Department of Information Technology, Dhanalakshmi College of Engineering, Chennai, Tamiladu, India
  • D. R. Abarna  Department of Information Technology, Dhanalakshmi College of Engineering, Chennai, Tamiladu, India
  • G. Aswini  Department of Information Technology, Dhanalakshmi College of Engineering, Chennai, Tamiladu, India
  • P. Hemavathy  

Keywords:

Datacleaning, Duplicatedetection, Entity Resolution, Progressiveness

Abstract

Effective progressive algorithm for duplicate detection on large dataset is the process of detecting unwanted files in the document in very short time.It does not affect the file without any quality change and also it used for cleaning process. The main advantage is very efficient and very speed.It has unlimited large data sets. The system alerts the user about potential duplicates when the user tries to create new records or update existing records. To maintain data quality, you can schedule a duplicate detection job to check for duplicates for all records that match a certain criteria. You can clean the data by deleting, deactivating, or merging the duplicates reported by a duplicate detection. We propose two novel, progressive duplicate detection algorithms namely progressive sorted neighborhood method (PSNM), which performs best on small and almost clean datasets, and progressive blocking (PB), which performs best on large and very dirty datasets. Both enhance the efficiency of duplicate detection even on very large datasets. which expose different strengths and outperform current approaches. We exhaustively evaluate on several real-world datasets testing our own and previous algorithms.

References

  1. S. E. Whang, D. Marmaros, and H. Garcia-Molina, “Pay-as-you-go entity resolution,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 5, pp. 1111–1124, May 2012.
  2. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record detection: A survey,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 1, pp. 1–16, Jan. 2007.
  3. F. Naumann and M. Herschel, An Introduction to Duplicate Detection. San Rafael, CA, USA: Morgan & Claypool, 2010.
  4. H. B. Newcombe and J. M. Kennedy, “Record linkage: Making maximum use of the discriminating power of identifying information,” Commun. ACM, vol. 5, no. 11, pp. 563–566, 1962.
  5. M. A. Hernandez and S. J. Stolfo, “Real-world data is dirty: Data cleansing and the merge/purge problem,” Data Mining Knowl. Discovery, vol. 2, no. 1, pp. 9–37, 1998.
  6. X. Dong, A. Halevy, and J. Madhavan, “Reference reconciliation in complex information spaces,” in Proc. Int. Conf. Manage. Data, 2005, pp. 85–96.
  7. O. Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller, “Framework for evaluating clustering algorithms in duplicate detection,” Proc. Very Large Databases Endowment, vol. 2, pp. 1282– 1293, 2009.
  8. O. Hassanzadeh and R. J. Miller, “Creating probabilistic databases from duplicated data,” VLDB J., vol. 18, no. 5, pp. 1141–1166, 2009.
  9. U. Draisbach, F. Naumann, S. Szott, and O. Wonneberg, “Adaptive windows for duplicate detection,” in Proc. IEEE 28th Int. Conf. Data Eng., 2012, pp. 1073–1083.
  10. S. Yan, D. Lee, M.-Y. Kan, and L. C. Giles, “Adaptive sorted neighborhood methods for efficient record linkage,” in Proc. 7th ACM/ IEEE Joint Int. Conf. Digit. Libraries, 2007, pp. 185–194.
  11. J. Madhavan, S. R. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy, “Web-scale data integration: You can only afford to pay as you go,” in Proc. Conf. Innovative Data Syst. Res., 2007.
  12. S. R. Jeffery, M. J. Franklin, and A. Y. Halevy, “Pay-as-you-go user feedback for dataspace systems,” in Proc. Int. Conf. Manage. Data, 2008, pp. 847–860.
  13. C. Xiao, W. Wang, X. Lin, and H. Shang, “Top-k set similarity joins,” in Proc. IEEE Int. Conf. Data Eng., 2009, pp. 916–927.
  14. P. Indyk, “A small approximately min-wise independent family of hash functions,” in Proc. 10th Annu. ACM-SIAM Symp. Discrete Algorithms, 1999, pp. 454–456.
  15. U. Draisbach and F. Naumann, “A generalization of blocking and windowing algorithms for duplicate detection,” in Proc. Int. Conf. Data Knowl. Eng., 2011, pp. 18–24.
  16. H. S. Warren, Jr., “A modification of Warshall’s algorithm for the transitive closure of binary relations,” Commun. ACM, vol. 18, no. 4, pp. 218–220, 1975.
  17. M. Wallace and S. Kollias, “Computationally efficient incremental transitive closure of sparse fuzzy binary relations,” in Proc. IEEE Int. Conf. Fuzzy Syst., 2004, pp. 1561–1565.
  18. F. J. Damerau, “A technique for computer detection and correction of spelling errors,” Commun. ACM, vol. 7, no. 3, pp. 171–176, 1964.
  19. P. Christen, “A survey of indexing techniques for scalable record linkage and deduplication,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 9, pp. 1537–1555, Sep. 2012.
  20. B. Kille, F. Hopfgartner, T. Brodt, and T. Heintz, “The Plista dataset,” in Proc. Int. Workshop Challenge News Recommender Syst., 2013, pp. 16–23.
  21. L. Kolb, A. Thor, and E. Rahm, “Parallel sorted neighborhood blocking with MapReduce,” in Proc. Conf. Datenbanksysteme in B€uro, Technik und Wissenschaft, 2011.

Downloads

Published

2016-04-30

Issue

Section

Research Articles

How to Cite

[1]
R. Ramesh Kannan, D. R. Abarna, G. Aswini, P. Hemavathy, " Effective Progressive Algorithm for Duplicate Detection on Large Dataset, International Journal of Scientific Research in Science and Technology(IJSRST), Online ISSN : 2395-602X, Print ISSN : 2395-6011, Volume 2, Issue 2, pp.105-110, March-April-2016.