Image Description Using Deep Neural Network

Akanksha P. Deshmukh; Dr. A. S. Ghotkar

doi:10.32628/IJSRST173766

Authors

Akanksha P. Deshmukh PG student, Department of Computer Engineering, Pune Institute of Computer Technology, Pune, India
Dr. A. S. Ghotkar Associate Professor, Department of Computer Engineering, Pune Institute of Computer Technology, Pune, India

Keywords:

Natural Language Processing, Neural Network, Torch , Convolution Neural Network, Recurrent Neural Network.

Abstract

Recent research in computer vision and machine learning has demonstrated some great abilities at detecting and recognizing objects in natural images. Image description is a good starting point for imparting artificial intelligence to machines by allowing them to analyze and describe complex visual scenes. Computer software recently become smart enough to recognize objects in pictures, but not finding exactly what activities happening inside pictures. So, there is a need to develop system that can generate natural language descriptions from images. Such system can be useful for childhood education, image retrieval and visually impaired people. Automatic description from image is a challenging problem that contains interest from the domain like computer vision and natural language processing. The vision based image description system uses deep learning Convolution Neural Network and Recurrent Neural Network for generating description of images. As a result, Neural Network shows better result for description of images with increasing Bilingual Evaluation Understudy (BLEU) score of 0.64, Consensus-based Image Description Evaluation (CIDEr) score of 0.72 and minimizes validation loss to 2.5.

References

Anurag Kishore and Sanjay Singh, “Natural langauage image descriptor”, IEEE Recent Advances in Intelligent Computational Systems (RAICS), pp. 10-12, December, 2015.
G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L.Berg,“Baby talk: Understanding and generating image descriptions”, IEEE Transaction On Pattern Analysis And Machine Intelligence, Vol. 35, NO. 12, December, 2013.
Andrej Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions”,CVPR, March, 2015
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan “Show and tell: A neural image caption generator”, arXiv preprint arXiv:1411.4555, March, 2014.
Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov, “Devise: A deep visual-semantic embedding model”, InAdvances in Neural Information Processing Systems, pp. 2121-2129, 2013.
Andrej karapathy, Armand Joulin, and Li Fei-Fei, “Deep fragment embeddings for bidirectional image sentence mapping”, arXiv preprnt arXiv:1406.5679, 2014.
J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep captioning with multimodal recurrent neural networks (m-rnn)”, ICLR, 2015.
Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg,Tamara L. Berg and Yejin Choi, “Collective Generation of Natural Image Descriptions”.
X. Chen and C. L. Zitnick, “Learning a recurrent visual representation for image caption generation”, arXiv preprint arXiv:1411.5654, 2014.
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach,S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description”, CoRR,vol. abs/1411.4389, 2014, [Online]. Available: http://arxiv.org/abs/1411.4389.
K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares,H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation”, In EMNLP, 2014.
J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition”, In ICML, 2014.
M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics”, Journal of Artificial Intelligence Research(JAIR), Vol. 47, pp. 853899, 2013.
Mitchell, Margaret, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, Karl Stratos, and Hal Daum III, “Midge: Generating image descriptions from computer vision detections”, In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747-756, 2012.
Kiela, Douwe, and Lon Bottou, ”Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics”, In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 36-45, 2014.
Ordonez, Vicente, Girish Kulkarni, and Tamara L. Berg, ”Im2text: Describing images using 1 million captioned photographs”, In Advances in Neural Information Processing Systems, pp. 1143-1151, 2011.
K. Papineni, S. Roukos, T. Ward, and W. Jing Zhu, “Bleu: A Method for Automatic Evaluation of Machine Translation”, Proc. 40th Ann. Meeting of Assoc. for Computational, pp. 311-318, 2012.
P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi,“Treetalk: Composition and compression of trees for image descriptions”, Trans. of the Association for Computational Linguistics, pp. 351362, 2014.

Image Description Using Deep Neural Network

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite