On Traffic-Aware Partition and Aggregation in Mapreduce for Big Data Applications

Authors

  • Shaik Inthiyaz  M.Tech Scholar, Department of Mechanical Engineering, SKD Engineering College, Gooty, Anantapur, Andhra Pradesh, India
  • S. G. Nawaz  M.Tech, HOD of CSE Department, SKD Engineering College, Gooty, Anantapur, Andhra Pradesh, India
  • Dr. R. Ramachandra  Principal & Professor, Department of Computer Science & Engineering, SKD Engineering College, Gooty, Anantapur, Andhra Pradesh, India

Keywords:

Map Reduce, Hadoop, Bioinformatics, Cyber Security, Machine Learning, Big Data, Trafficcost

Abstract

The MapReduce programming model simplifies large-scale data processing on commodity cluster by exploiting parallel map tasks and reduce tasks. Although many efforts have been made to improve the performance of MapReduce jobs, they ignore the network traffic generated in the shuffle phase, which plays a critical role in performance enhancement. Traditionally, a hash function is used to partition intermediate data among reduce tasks, which, however, is not traffic-efficient because network topology and data size associated with each key are not taken into consideration. In this paper, we study to reduce network traffic cost for a MapReduce job by designing a novel intermediate data partition scheme. Furthermore, we jointly consider the aggregator placement problem, where each aggregator can reduce merged traffic from multiple map tasks. A decomposition-based distributed algorithm is proposed to deal with the large-scale optimization problem for big data application and an online algorithm is also designed to adjust data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost under both offline and online cases.

References

  1. J. Dean and S. Ghemawat, "Mapreduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, no. 1,pp. 107-113, 2008.
  2. W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang, "Map taskscheduling in mapreduce with data locality: Throughput andheavy-traffic optimality," in INFOCOM, 2013 Proceedings IEEE.IEEE, 2013, pp. 1609-1617.
  3. F. Chen, M. Kodialam, and T. Lakshman, "Joint scheduling of processing and shuffle phases in mapreduce systems," in INFOCOM,2012 Proceedings IEEE. IEEE, 2012, pp. 1143-1151.
  4. Y. Wang, W. Wang, C. Ma, and D. Meng, "Zput: A speedy datauploading approach for the hadoop distributed file system," inCluster Computing (CLUSTER), 2013 IEEE International Conferenceon. IEEE, 2013, pp. 1-5.
  5. T. White, Hadoop: the definitive guide: the definitive guide. " O’Reilly Media, Inc.", 2009.
  6. S. Chen and S. W. Schlosser, "Map-reduce meets wider varietiesof applications," Intel Research Pittsburgh, Tech. Rep. IRP-TR-08-05,2008.
  7. J. Rosen, N. Polyzotis, V. Borkar, Y. Bu, M. J. Carey, M. Weimer,T. Condie, and R. Ramakrishnan, "Iterative mapreduce for largescale machine learning," arXiv preprint arXiv:1303.3517, 2013.
  8. S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, and R. S.Schreiber, "Presto: distributed machine learning and graph processing with sparse matrices," in Proceedings of the 8th ACMEuropean Conference on Computer Systems. ACM, 2013,pp.197-210.
  9. A. Matsunaga, M. Tsugawa, and J. Fortes, "Cloudblast: Combining mapreduce and virtualization on distributed resources forbioinformatics applications," in eScience, 2008. eScience’08. IEEEFourth International Conference on. IEEE, 2008, pp. 222-229.
  10. J. Wang, D. Crawl, I. Altintas, K. Tzoumas, and V. Markl, "Comparison of distributed data-parallelization patterns for big dataanalysis: A bioinformatics case study," in Proceedings of the FourthInternational Workshop on Data Intensive Computing in the Clouds(DataCloud), 2013.
  11. R. Liao, Y. Zhang, J. Guan, and S. Zhou, "Cloudnmf: A mapreduce implementation of nonnegative matrix factorization for largescale biological datasets," Genomics, proteomics & bioinformatics,vol. 12, no. 1, pp. 48-51, 2014.

Downloads

Published

2018-02-28

Issue

Section

Research Articles

How to Cite

[1]
Shaik Inthiyaz, S. G. Nawaz, Dr. R. Ramachandra, " On Traffic-Aware Partition and Aggregation in Mapreduce for Big Data Applications, International Journal of Scientific Research in Science and Technology(IJSRST), Online ISSN : 2395-602X, Print ISSN : 2395-6011, Volume 4, Issue 2, pp.134-138, January-February-2018.