Incremental-Parallel Data Stream Classification in Apache Spark Environment

Authors(2) :-S. Tamilarasi, Dr. KungumaRaj

With notorious domain of big data age, the challenging task on data stream classification is high velocity conceptual infinite stream and perspective statistical properties of data which differs periodically. In this paper, we propose an Incremental Parallel Random Forest (IPRF) algorithm for data streams in spark cloud computing environment. The algorithm incrementally estimates the accuracy for classifying the data streams, which priors to parallelization process in order to reduce the training time and prediction process using random sampling and filtering approach, that improves the dynamic-data allocation and task-scheduling mechanism in a cloud environment. From the perspective of dynamic-data allocation, dynamically changes the data in a data stream environment, to reduce the communication cost, volume data using vertically data-partitioning, data-multiplexing method. From the perspective of task scheduling, an incremental-parallel technique is carried out in the training process of Random Forest and a task directed acyclic graph depends upon resilient distributed data objects as static, redundant, and least data object appending to re-organize the mapping relationship between successor task and slaves. The details and the results of evaluating the proposed mechanism using benchmark datasets are presented in this paper.

Authors and Affiliations

S. Tamilarasi
Department of Computer science,Research Scholar of ,Mother Teresa University, Kodaikanal, Tamilnadu, India
Dr. KungumaRaj
Department of Computer Applications,Head , Assitant Professor of Arulmigu Palaniandavar College for Women Palani, Tamilnadu, India

Big Data, Data Stream Classification, Incremental-Parallel Technique

  1. X. K. Hwang, G.C. Fox, and JJ Dongarra. Distributed and Cloud Computing: From Parallel Processing to the Internet of Things. Morgan Kaufmann, 2012. cited at p. 1, 2, 18]
  2. Hagit Attiya and Jennifer Welch. Distributed Computing: Fundamentals, simulations, and Advanced Topics. Wiley Series on Parallel and Distributed Computing.John Wiley and Sons Inc., 2000. cited at p. 1]
  3. T.V. Gopal, N.S. Nataraj, C. Ramamurthy, and V. Sankaranarayanan. Load balancing in heterogenous distributed systems. Microelectronics Reliability, 36(9):1279-1286, 1996. cited at p. 1, 9]
  4. H.J. Siegel, H.G. Dietz, and J.K. Antonio. Software support for heterogeneous computing. ACM Computing Surveys (CSUR), 28(1):237-239, 1996. cited at p. 1]
  5. Jie Wu. Distributed System Design. CRC press, 1999. cited at p. 1, 3, 5, 6, 19, 25, 101]
  6. A.Y. Zomaya and Y.H. Teh. Observations on using genetic algorithms for dynamic load-balancing. Parallel and Distributed Systems, IEEE Transactions on, 12(9):899-911, 2001. cited at p. 1, 4, 5, 6, 8, 24, 25, 42, 69, 70, 75, 76, 85, 86, 89, 93, 99]
  7. M. Maheswaran and H.J. Siegel. A dynamic matching and scheduling algorithm for heterogeneous computing systems. In Heterogeneous Computing Workshop, 1998.(HCW 98) Proceedings. 1998 Seventh, pages 57-69. IEEE, 1998. cited at p. 2]
  8. M. Maheswaran, S. Ali, HJ Siegal, D. Hensgen, and R.F. Freund. Dynamic match- ing and scheduling of a class of independent tasks onto heterogeneous computing systems. In Heterogeneous Computing Workshop, 1999.(HCW’99) Proceedings. Eighth, pages 30-44. IEEE, 1999. cited at p. 2, 30]
  9. M. Maheswaran, T.D. Braun, and H.J. Siegel. Heterogeneous distributed computing. Wiley Encyclopedia of Electrical and Electronics Engineering, 1999. cited at p. 2t p. 2]
  10. Jean Dollimore George Coulouris and Tim Kindberg. Distributed Operating System-Concepts and Design. Addison Wesley, second edition, 2000. cited at p. 2]
  11. Vijay K. Garg. Elements of Distributed Computing. Wiley-Interscience: JohnWiley and Sons, Inc. Publication, 2006. cited at p.2]
  12. Sukumar Ghosh. Distributed systems: an algorithmic approach. CRC press, 2010. cited at p. 2, 3]
  13. Andrew S. Tanenbaum and Maarten van Steen. Distributed Systems: Principles and Paradigms. Pearson Education, Inc., 2002. cited at p. 2]

Publication Details

Published in : Volume 3 | Issue 5 | May-June 2017
Date of Publication : 2017-04-30
License:  This work is licensed under a Creative Commons Attribution 4.0 International License.
Page(s) : 210-213
Manuscript Number : ICASCT2533
Publisher : Technoscience Academy

Print ISSN : 2395-6011, Online ISSN : 2395-602X

Cite This Article :

S. Tamilarasi, Dr. KungumaRaj, " Incremental-Parallel Data Stream Classification in Apache Spark Environment", International Journal of Scientific Research in Science and Technology(IJSRST), Print ISSN : 2395-6011, Online ISSN : 2395-602X, Volume 3, Issue 5, pp.210-213, May-June-2017.
Journal URL :

Article Preview