A Survey on Deduplication Workload Resource for Big Data Applications

Authors

  • V. Manochitra  Department of Information Technology, Bon Secours College for Women, Thanjavur
  • B. Jackline Jose  Department of Information Technology, Bon Secours College for Women, Thanjavur

Keywords:

Big Data, Deduplication, Hash Indexing, Resource Allocation, Big Data Analysis

Abstract

Deduplication seems to be a appropriate explanation for data detonation in the big data era by 1) slowing down the data growth speed by removing redundant data, and 2) relieving pressure on disk bandwidth by removing dismissed IO accesses. However, deduplication also introduces above to the system. For example, hash indexing needs be performed for every IO request to classify duplicates, which results in slower IO response time. In addition, extra CPU control is required to compute the hash values in each IO request, which leads to progressive vigor consumption. Since the capacity of IO needs is enormous and increasing in big data workloads, the overall performance and energy capability below different deduplication configurations is valuable to be deliberate methodically.

References

  1. EMC Data Domain. http://www.datadomain.com/.
  2. IBM ProtecTIER. http://www-03.ibm.com/systems/storage/news/center/ deduplication/index.html.
  3. Acronis. http://www.acronis.com/backup-recovery/deduplication-roicalculator.html
  4. B. Efron. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Institute of Mathematical Statistics Monographs. Cambridge University Press, 2010.
  5. U. Fayyad. Big Data Analytics: Applications and Opportunities in On-line Predictive Modeling. http:// big-data-mining.org/keynotes/#fayyad, 2012.
  6. Dimitrios Zissis and Dimitrios Lekkas, “Addressing cloud computing security issues” in ELSEVIER - Future Generation Computer Systems 28 (2012) 583?592.
  7. Liang-Jie Zhang, Jia Zhang, Jinan Fiaidhi, J. Morris Chang, “Hot Topics in Cloud Computing” in IEEE Computer Society, ITPro 2012, 1520-9202.
  8. Robert Grossman, “The Case for Cloud Computing” in IEEE Computer Society, IT Pro 2009.
  9. B. Efron. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Institute of Mathematical Statistics Monographs. Cambridge University Press, 2010.
  10. U. Fayyad. Big Data Analytics: Applications and Opportunities in On-line Predictive Modeling. http:// big-data-mining.org/keynotes/#fayyad, 2012.
  11. D. Feldman, M. Schmidt, and C. Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In SODA, 2013.
  12. J. Gama. Knowledge Discovery from Data Streams. Chapman & Hall/Crc Data Mining and Knowledge Discovery. Taylor & Francis Group, 2010.
  13. J. Gantz and D. Reinsel. IDC: The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. December 2012.
  14. Gartner, http://www.gartner.com/it-glossary/bigdata
  15. Offshore Oil and Gas Supply. Working Document of the National Petroleum Council, 2011
  16. The Changing Geospatial Landscape. A Report of the National Geospatial Advisory Committee, 2009
  17. How Big Data Is Changing Astronomy (Again). The Atlantic, 2012
  18. W. Cox, M. Pruett, T. Benson, S. Chiavacci, and F. Thompson III. Development of Camera Technology for Monitoring Nests. USGS Northern Prairie Wildlife Research Center, 2012
  19. http://www.groundcontrol.com/Oil-And-Gas_Satellite.htm.
  20. Bhagwat D, Eshghi K, Long D D E, et al. Extreme binning: Scalable, parallel deduplication for chunk-based file backupCModeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS’09. IEEE International Symposium on. IEEE, 2009: 1-9.
  21. Dong W, Douglis F, Li K, et al. Tradeoffs in Scalable Data Routing for Deduplication ClustersCFAST. 2011: 15-29.
  22. You L L, Pollack K T, Long D D E. Deep Store: An archival storage system architectureCData Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on. IEEE, 2005: 804-815.
  23. Eshghi K, Tang H K. A framework for analyzing and improving contentbased chunking algorithmsJ. Hewlett-Packard Labs Technical Report TR, 2005, 30: 2005.
  24. Liu C, Lu Y, Shi C, et al. ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage SystemCStorage Network Architecture and Parallel I/Os, 2008. SNAPI’08. Fifth IEEE International Workshop on. IEEE, 2008: 29-35.
  25. Bobbarjung D R, Jagannathan S, Dubnicki C. Improving duplicate elimination in storage systemsJ. ACM Transactions on Storage (TOS), 2006, 2(4): 424-448.
  26. Kruus E, Ungureanu C, Dubnicki C. Bimodal content defined chunking for backup streams CProc of the USENIX FAST10, Brekeley, CA:USENIX, 2010: 239-252
  27. Zhu B, Li K, Patterson R H. Avoiding the Disk Bottleneck in the Data Domain Deduplication File SystemCFast. 2008, 8: 1-14. 29Bloom B H. Space/time trade-offs in hash coding with allowable errorsJ. Communications of the ACM, 1970, 13(7): 422-426. 30Lillibridge M, Eshghi K, Bhagwat D, et al. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and LocalityCFast. 2009, 9: 111- 123.
  28. Broder A Z. On the resemblance and containment of documentsC//Compression and Complexity of Sequences 1997. Proceedings. IEEE, 1997: 21-29.
  29. Debnath B, Sengupta S, Li J. ChunkStash: speeding up inline storage deduplication using flash memoryCProceedings of the 2010 USENIX conference on USENIX annual technical conference. USENIX Association, 2010: 16-16.

Downloads

Published

2017-04-30

Issue

Section

Research Articles

How to Cite

[1]
V. Manochitra, B. Jackline Jose, " A Survey on Deduplication Workload Resource for Big Data Applications, International Journal of Scientific Research in Science and Technology(IJSRST), Online ISSN : 2395-602X, Print ISSN : 2395-6011, Volume 3, Issue 5, pp.41-48, May-June-2017.