A Neoteric Approach Based on Multi Task Learning Network for Skeletal 3D Action Recognition

T. Seshagiri; S. Varadarajan

doi:10.32628/IJSRST18489

Authors

T. Seshagiri Research Scholar, Rayalaseema University, Kurnool, Associate Professor, Shree Institute of Technical Education, Tirupati, India
S. Varadarajan Professor, Department of Electronics & Communicationengineering, Svu Engineering College, Tirupati, India

Keywords:

Abstract

This paper presents a new representation of skeleton sequences for 3D action recognition. Existing methods based on hand-crafted features or recurrent neural networks cannot adequately capture the complex spatial structures and the long term temporal dynamics of the skeleton sequences, which are very important to recognize the actions. In this paper, we propose to transform each channel of the 3D coordinates of a skeleton sequence into a clip. Each frame of the generated clip represents the temporal information of the entire skeleton sequence, and one particular spatial relationship between the skeleton joints. The entire clip incorporates multiple frames with different spatial relationships, which provide useful spatial structural information of the human skeleton. We also propose a Multi-task Learning Network (MTLN) to learn the generated clips for action recognition. The proposed MTLN processes all the frames of the generated clips in parallel to explore the spatial and temporal information of the skeleton sequences. The proposed method has been extensively tested on challenging benchmark datasets. Experimental results consistently demonstrate the superiority of the proposed learning method for 3D action recognition compared to existing techniques.

References

X. Yang and Y. Tian, "Super normal vector for activity recognition using depth sequences," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 804-811.
G. Zhang, J. Liu, Y. Liu, J. Zhao, L. Tian, and Y. Q. Chen, "Physical blob detector and multi-channel color shape descriptor for human detection," Journal of Visual Communication and Image Representation, 2018.
G. Zhang, J. Liu, H. Li, Y. Q. Chen, and L. S. Davis, "Joint human detection and head pose estimation via multistream networks for rgb-d videos," IEEE Signal Processing Letters, vol. 24, no. 11, pp. 1666-1670, 2017.
P. Wang, W. Li, J. Wan, P. Ogunbona, and X. Liu, "Cooperative training of deep aggregation networks for rgb-d action recognition," arXiv preprint arXiv:1801.01080, 2017.
H. Rahmani, A. Mian, and M. Shah, "Learning a deep model for human action recognition from novel viewpoints," IEEE transactions on pattern analysis and machine intelligence, 2017.
L. Xia, C.-C. Chen, and J. Aggarwal, "View invariant human action recognition using histograms of 3D joints," in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2012, pp. 20-27.
R. Vemulapalli, F. Arrate, and R. Chellappa, "Human action recognition by representing 3d skeletons as points in a lie group," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 588-595.
Y. Du, W. Wang, and L. Wang, "Hierarchical recurrent neural network for skeleton based action recognition," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1110-1118.
A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, "NTU RGB+D: A large scale dataset for 3D human activity analysis," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. 10W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, "Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks," in AAAI Conference on Artificial Intelligence (AAAI), 2016. 11J. Liu, A. Shahroudy, D. Xu, and G. Wang, "Spatio-temporal LSTM with trust gates for 3D human action recognition," in European Conference on Computer Vision (ECCV). Springer, 2016, pp. 816-833.
P. Koniusz, A. Cherian, and F. Porikli, "Tensor representations via kernel linearization for action recognition from 3d skeletons," arXiv preprint arXiv:1604.00239, 2016.
P. Wang, Z. Li, Y. Hou, and W. Li, "Action recognition based on joint trajectory maps using convolutional neural networks," in Proceedings of the 2016 ACM on Multimedia Conference. ACM, 2016, pp. 102-106.
Q. Ke, S. An, M. Bennamoun, F. Sohel, and F. Boussaid, "Skeletonnet: Mining deep part features for 3-d action recognition," IEEE Signal Processing Letters, vol. 24, no. 6, pp. 731-735, 2017.
J. Liu, A. Shahroudy, D. Xu, A. K. Chichung, and G. Wang, "Skeletonbased action recognition using spatio-temporal lstm network with trust gates," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
J. Liu, G. Wang, L.-Y. Duan, K. Abdiyeva, and A. C. Kot, "Skeletonbased human action recognition with global context-aware attention lstm networks," IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1586-1599, 2018.
F. Han, B. Reily, W. Hoff, and H. Zhang, "space-time representation of people based on 3d skeletal data: a review," arXiv preprint arXiv:1601.01006, 2016.
M. Ye, Q. Zhang, L. Wang, J. Zhu, R. Yang, and J. Gall, "A survey on human motion analysis from depth data," in Time-of-Flight and Depth Imaging. Sensors, Algorithms, and Applications. Springer, 2013, pp. 149-187.
F. Lv and R. Nevatia, "Recognition and segmentation of 3-D human action using HMM and multi-class adaboost," in European Conference on Computer Vision (ECCV), 2006, pp. 359-372.
D. Wu and L. Shao, "Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 724-731.
C. Sminchisescu, A. Kanaujia, and D. Metaxas, "Conditional models for contextual human motion recognition," Computer Vision and Image Understanding, vol. 104, no. 2, pp. 210-220, 2006. 22J. Wang, Z. Liu, Y. Wu, and J. Yuan, "Mining actionlet ensemble for action recognition with depth cameras," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 1290- 1297.
A. Graves, "Neural networks," in Supervised Sequence Labelling with Recurrent Neural Networks. Springer, 2012, pp. 15-35.
A. Graves, A.-r. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 6645-6649.
V. Veeriah, N. Zhuang, and G.-J. Qi, "Differential recurrent neural networks for action recognition," in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4041-4049.
J. Weston, S. Chopra, and A. Bordes, "Memory networks," arXiv preprint arXiv:1410.3916, 2014. 27J. Gu, G. Wang, and T. Chen, "Recurrent highway networks with language cnn for image captioning," arXiv preprint arXiv:1612.07086, 2016. 28T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, "Convolutional, long short-term memory, fully connected deep neural networks," in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4580-4584.
R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, "How to construct deep recurrent neural networks," arXiv preprint arXiv:1312.6026, 2013. 30Y. LeCun, Y. Bengio et al., "Convolutional networks for images, speech, and time series," The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995.
K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, "Return of the devil in the details: Delving deep into convolutional nets," arXiv preprint arXiv:1405.3531, 2014.
D. Ciregan, U. Meier, and J. Schmidhuber, "Multi-column deep neural networks for image classification," in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 3642- 3649.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012, pp. 1097-1105.
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014. 35C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1-9.
Y. Xiong, K. Zhu, D. Lin, and X. Tang, "Recognize complex events from static images by fusing deep channels," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1600-1609.
Q. Ke and Y. Li, "Is rotation a nuisance in shape recognition?" in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4146-4153.
Q. Ke, M. Bennamoun, S. An, F. Boussaid, and F. Sohel, "Human interaction prediction using deep temporal features," in European Conference on Computer Vision Workshops. Springer, 2016, pp. 403-414.
Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, "Leveraging structural context models and ranking score fusion for human interaction prediction," IEEE Transactions on Multimedia, 2017. 40W. Luo, Y. Li, R. Urtasun, and R. Zemel, "Understanding the effective receptive field in deep convolutional neural networks," in Advances in Neural Information Processing Systems, 2016, pp. 4898-4906.
K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras, "Two-person interaction detection using body-pose features and multiple instance learning," in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2012, pp. 28-35.
A. Yao, J. Gall, G. Fanelli, and L. Van Gool, "Does human action recognition benefit from pose estimation?"," in Proceedings of the 22nd British machine vision conference-BMVC 2011, 2011. 43R. Caruana, "Multitask learning," in Learning to learn. Springer, 1998, pp. 95-133.
Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, "A new representation of skeleton sequences for 3d action recognition," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
C. Wang, Y. Wang, and A. L. Yuille, "An approach to pose-based action recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 915-922.
R. Chaudhry, F. Ofli, G. Kurillo, R. Bajcsy, and R. Vidal, "Bioinspired dynamic 3d discriminative skeletal features for human action recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 471-478.
M. E. Hussein, M. Torki, M. A. Gowayyed, and M. El-Saban, "Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations." in IJCAI, vol. 13, 2013, pp. 2466-2472.
Use the free JavaScript formatter browser program to manage and tidy your scripts.

A Neoteric Approach Based on Multi Task Learning Network for Skeletal 3D Action Recognition

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite