A Survey on Deduplication Workload Resource for Big Data Applications

V. Manochitra; B. Jackline Jose

doi:10.32628/ICASCT2508

Authors

V. Manochitra Department of Information Technology, Bon Secours College for Women, Thanjavur
B. Jackline Jose Department of Information Technology, Bon Secours College for Women, Thanjavur

Keywords:

Big Data, Deduplication, Hash Indexing, Resource Allocation, Big Data Analysis

Abstract

Deduplication seems to be a appropriate explanation for data detonation in the big data era by 1) slowing down the data growth speed by removing redundant data, and 2) relieving pressure on disk bandwidth by removing dismissed IO accesses. However, deduplication also introduces above to the system. For example, hash indexing needs be performed for every IO request to classify duplicates, which results in slower IO response time. In addition, extra CPU control is required to compute the hash values in each IO request, which leads to progressive vigor consumption. Since the capacity of IO needs is enormous and increasing in big data workloads, the overall performance and energy capability below different deduplication configurations is valuable to be deliberate methodically.

References

EMC Data Domain. http://www.datadomain.com/.
IBM ProtecTIER. http://www-03.ibm.com/systems/storage/news/center/ deduplication/index.html.
Acronis. http://www.acronis.com/backup-recovery/deduplication-roicalculator.html
B. Efron. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Institute of Mathematical Statistics Monographs. Cambridge University Press, 2010.
U. Fayyad. Big Data Analytics: Applications and Opportunities in On-line Predictive Modeling. http:// big-data-mining.org/keynotes/#fayyad, 2012.
Dimitrios Zissis and Dimitrios Lekkas, “Addressing cloud computing security issues” in ELSEVIER - Future Generation Computer Systems 28 (2012) 583?592.
Liang-Jie Zhang, Jia Zhang, Jinan Fiaidhi, J. Morris Chang, “Hot Topics in Cloud Computing” in IEEE Computer Society, ITPro 2012, 1520-9202.
Robert Grossman, “The Case for Cloud Computing” in IEEE Computer Society, IT Pro 2009.
B. Efron. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Institute of Mathematical Statistics Monographs. Cambridge University Press, 2010.
U. Fayyad. Big Data Analytics: Applications and Opportunities in On-line Predictive Modeling. http:// big-data-mining.org/keynotes/#fayyad, 2012.
D. Feldman, M. Schmidt, and C. Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In SODA, 2013.
J. Gama. Knowledge Discovery from Data Streams. Chapman & Hall/Crc Data Mining and Knowledge Discovery. Taylor & Francis Group, 2010.
J. Gantz and D. Reinsel. IDC: The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. December 2012.
Gartner, http://www.gartner.com/it-glossary/bigdata
Offshore Oil and Gas Supply. Working Document of the National Petroleum Council, 2011
The Changing Geospatial Landscape. A Report of the National Geospatial Advisory Committee, 2009
How Big Data Is Changing Astronomy (Again). The Atlantic, 2012
W. Cox, M. Pruett, T. Benson, S. Chiavacci, and F. Thompson III. Development of Camera Technology for Monitoring Nests. USGS Northern Prairie Wildlife Research Center, 2012
http://www.groundcontrol.com/Oil-And-Gas_Satellite.htm.
Bhagwat D, Eshghi K, Long D D E, et al. Extreme binning: Scalable, parallel deduplication for chunk-based file backupCModeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS’09. IEEE International Symposium on. IEEE, 2009: 1-9.
Dong W, Douglis F, Li K, et al. Tradeoffs in Scalable Data Routing for Deduplication ClustersCFAST. 2011: 15-29.
You L L, Pollack K T, Long D D E. Deep Store: An archival storage system architectureCData Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on. IEEE, 2005: 804-815.
Eshghi K, Tang H K. A framework for analyzing and improving contentbased chunking algorithmsJ. Hewlett-Packard Labs Technical Report TR, 2005, 30: 2005.
Liu C, Lu Y, Shi C, et al. ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage SystemCStorage Network Architecture and Parallel I/Os, 2008. SNAPI’08. Fifth IEEE International Workshop on. IEEE, 2008: 29-35.
Bobbarjung D R, Jagannathan S, Dubnicki C. Improving duplicate elimination in storage systemsJ. ACM Transactions on Storage (TOS), 2006, 2(4): 424-448.
Kruus E, Ungureanu C, Dubnicki C. Bimodal content defined chunking for backup streams CProc of the USENIX FAST10, Brekeley, CA:USENIX, 2010: 239-252
Zhu B, Li K, Patterson R H. Avoiding the Disk Bottleneck in the Data Domain Deduplication File SystemCFast. 2008, 8: 1-14. 29Bloom B H. Space/time trade-offs in hash coding with allowable errorsJ. Communications of the ACM, 1970, 13(7): 422-426. 30Lillibridge M, Eshghi K, Bhagwat D, et al. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and LocalityCFast. 2009, 9: 111- 123.
Broder A Z. On the resemblance and containment of documentsC//Compression and Complexity of Sequences 1997. Proceedings. IEEE, 1997: 21-29.
Debnath B, Sengupta S, Li J. ChunkStash: speeding up inline storage deduplication using flash memoryCProceedings of the 2010 USENIX conference on USENIX annual technical conference. USENIX Association, 2010: 16-16.

A Survey on Deduplication Workload Resource for Big Data Applications

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite