Incremental-Parallel Data Stream Classification in Apache Spark Environment

A.Anantha Babu; J.Preethi

doi:10.32628/ICASCT2532

Authors

A.Anantha Babu Department of Computer Science and Engineering, Anna University Regional Campus Coimbatore, Tamil Nadu, India
J.Preethi Department of Computer Science and Engineering, Anna University Regional Campus Coimbatore, Tamil Nadu, India

Keywords:

Big Data, Data Stream Classification, Incremental-Parallel Technique

Abstract

With notorious domain of big data age, the challenging task on data stream classification is high velocity conceptual infinite stream and perspective statistical properties of data which differs periodically. In this paper, we propose an Incremental Parallel Random Forest (IPRF) algorithm for data streams in spark cloud computing environment. The algorithm incrementally estimates the accuracy for classifying the data streams, which priors to parallelization process in order to reduce the training time and prediction process using random sampling and filtering approach, that improves the dynamic-data allocation and task-scheduling mechanism in a cloud environment. From the perspective of dynamic-data allocation, dynamically changes the data in a data stream environment, to reduce the communication cost, volume data using vertically data-partitioning, data-multiplexing method. From the perspective of task scheduling, an incremental-parallel technique is carried out in the training process of Random Forest and a task directed acyclic graph depends upon resilient distributed data objects as static, redundant, and least data object appending to re-organize the mapping relationship between successor task and slaves. The details and the results of evaluating the proposed mechanism using benchmark datasets are presented in this paper.

References

X. Wu, X. Zhu, and G.Q. Wu, “Data mining with big data,” Knowledge and Data Engineering, IEEE Transactions on, vol. 26, no. 1, pp. 97-107, January 2014.
M. M. Masud, Q. Chen, L. Khan, C. C. Aggarwal, J. Gao, J. Han, A. Srivastava, and N. C. Oza, “Classification and adaptive novel class detection of feature-evolving data streams,” Knowledge and Data Engineering, IEEE Transactions on, vol. 25, no. 7, pp. 1484-1497, July 2013.
Apache, “Hadoop,” Website, January 2017, http://hadoop.apache.org .
S. del Rio, V. Lopez, J.M. Benitez, and F. Herrera, “On the use of mapreduce for imbalanced big data using random forest,” Information Sciences, vol. 285, pp. 112-137, November 2014.
Apache, “Spark,” Website, January 2017, http://spark-project.org
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5-32, October 2001.
P. Domingos and G. Hulten, “Mining high speed data streams,” In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD), August 2000, pp 71?80.
A. Tsymbal, “The problem of concept drift: definitions and related work,” Technical report TCD-CS-2004-15, Computer Science Department, Trinity College Dublin, Ireland.
G. Hulten, L. Spencer, and P. Domings, “Mining time-changing data streams,” In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD), August 2001, pp 97?106.
A. Bifet and R. Gavalda, “Adaptive Parameter-free learning from Evolving Data Streams,” Technical report, Polytechnic University of Catalonia, 2009.
H. Wang, W. Fan, P. Yu and J. Han, “Mining concept-drifting data streams using ensemble classifiers,” In: 9th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), August 2003, pp. 226-235.
K. Kanoun and M. van der Schaar, “Big-data streaming applications scheduling with online learning and CDF detection,” In: Proceedings of the 2015 Design, Automation and Test in Europe Conference and Exhibition, EDA Consortium, March 2015, pp. 1547-1550.
A. Ghazikhani, R. Monsefi, and H.S. Yazdi, “Online neural network model for non-stationary and imbalanced data stream classification,” International Journal of Machine Learning and Cybernetics 5, no. 1, pp. 51-62, February 2014.
S. Tarkoma, C.E. Rothenberg, and E. Lagerspetz, “Theory and practice of bloom filters for distributed systems,” Communications Surveys Tutorials 14, IEEE, no. 1, pp.131?155, First 2012.
I. Palit and C. K. Reddy, “Parallelized boosting with map-reduce,” In:Data Mining Workshops (ICDMW), 2010 IEEE International Conference on, pp. 1346-1353, December 2010.
K. M. Svore and C. J. Burges, “Distributed stochastic aware random forest efficient data mining for big data,” in Big data (BigData Congress), 2013 IEEE International Congress on, Cambridge University Press, 2013, pp. 425-426.
D. Warneke and O. Kao, "Exploiting dynamic resource allocation for efficient parallel data processing in the cloud," IEEE transactions on parallel and distributed systems 22, no. 6, pp. 985-997, June 2011.
L. Chen, J. Zhang, L. Cai, Z. Deng, T. He, and X. Wang, "Locality-Aware and Energy-Aware Job Pre-Assignment for Mapreduce," In: Intelligent Networking and Collaborative Systems (INCoS), 2016 International Conference on, pp. 59-65. IEEE, 2016.
S. Liu, K. Ren, K. Deng, and J. Song, "A dynamic resource allocation and task scheduling strategy with uncertain task runtime on IaaS clouds," In: Information Science and Technology (ICIST), 2016 Sixth International Conference on, pp. 174-180. IEEE, 2016.
F. Zhang, J. Cao, W. Tan, S. Khan, K. Li, and A. Zomaya, "Evolutionary scheduling of dynamic multitasking workloads for big-data analytics in elastic cloud," IEEE Transactions on Emerging Topics in Computing 2, no. 3 pp. 338-351, August 2014.
Apache Kafka, A high-throughput distributed messaging system source: http://kafka.apache.org/ accessed July 2015.

Incremental-Parallel Data Stream Classification in Apache Spark Environment

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite