Image Captioning Using Scene Graph and Language Decoder

Herok, Asaduzzaman; Ahmed, Kawsar; Masum, Safayet Hossain

dc.contributor.author	Herok, Asaduzzaman
dc.contributor.author	Ahmed, Kawsar
dc.contributor.author	Masum, Safayet Hossain
dc.date.accessioned	2023-01-27T05:51:09Z
dc.date.available	2023-01-27T05:51:09Z
dc.date.issued	2022-05-30
dc.identifier.citation	[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086, 2018. [2] T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in Proceedings of the European conference on computer vision (ECCV), pp. 684–699, 2018. [3] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7008–7024, 2017. [4] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting image captioning with attributes,” in Proceedings of the IEEE international conference on computer vision, pp. 4894–4902, 2017. [5] D. Gurari, Y. Zhao, M. Zhang, and N. Bhattacharya, “Captioning images taken by people who are blind,” in European Conference on Computer Vision, pp. 417–434, Springer, 2020. [6] P. Dognin, I. Melnyk, Y. Mroueh, I. Padhi, M. Rigotti, J. Ross, Y. Schiff, R. A. Young, and B. Belgodere, “Image captioning as an assistive technology: Lessons learned from vizwiz 2020 challenge,” Journal of Artificial Intelligence Research, vol. 73, pp. 437–459, 2022. [7] D. W. Kim, J. gwon Hwang, S. H. Lim, and S. H. Lee, “An improved feature extraction approach to image captioning for visually impaired people,” [8] X. Yang, Y. Liu, and X. Wang, “Reformer: The relational transformer for image captioning,” arXiv preprint arXiv:2107.14178, 2021. [9] L. Huang, W. Wang, J. Chen, and X.-Y. Wei, “Attention on attention for image captioning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4634– 4643, 2019. 44 BIBLIOGRAPHY 45 [10] P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang, “Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” arXiv preprint arXiv:2202.03052, 2022. [11] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” arXiv preprint arXiv:1504.00325, 2015. [12] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “A comprehensive survey of deep learning for image captioning,” ACM Computing Surveys (CsUR), vol. 51, no. 6, pp. 1–36, 2019. [13] G. Letarte, F. Paradis, P. Giguère, and F. Laviolette, “Importance of self-attention for sentiment analysis,” in Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 267–275, 2018. [14] K. Tang, Y. Niu, J. Huang, J. Shi, and H. Zhang, “Unbiased scene graph generation from biased training,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3716–3725, 2020. [15] P. Ackland, S. Resnikoff, and R. Bourne, “World blindness and visual impairment: despite many successes, the problem is growing,” Community Eye Health, vol. 30, pp. 71–73, 01 2017. [16] M. Stefanini, M. Cornia, L. Baraldi, S. Cascianelli, G. Fiameni, and R. Cucchiara, “From show to tell: A survey on image captioning,” 07 2021. [17] Q. Zhu and J. Luo, “Generative pre-trained transformer for design concept generation: an exploration,” arXiv preprint arXiv:2111.08489, 2021. [18] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014. [19] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10685–10694, 2019. [20] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016. BIBLIOGRAPHY 46 [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [22] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, no. 1, pp. 32–73, 2017. [23] K. Erk, “Vector space models of word meaning and phrase meaning: A survey,” Language and Linguistics Compass, vol. 6, no. 10, pp. 635–653, 2012. [24] Y. Goldberg and O. Levy, “word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method,” arXiv preprint arXiv:1402.3722, 2014. [25] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014. [26] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013. [27] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, vol. 27, 2014. [28] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014. [29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [30] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014. [31] K. Clark, U. Khandelwal, O. Levy, and C. D. Manning, “What does bert look at? an analysis of bert’s attention,” arXiv preprint arXiv:1906.04341, 2019. [32] “What are the types of RNN?.” https://www.educative.io/edpresso/ what-are-the-types-of-rnn. Accessed: 2022-4-14. BIBLIOGRAPHY 47 [33] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164, 2015. [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015. [35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015. [36] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137, 2015. [37] T. Yao, Y. Pan, Y. Li, and T. Mei, “Hierarchy parsing for image captioning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2621–2629, 2019. [38] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015. [39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016. [40] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002. [41] S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72, 2005. [42] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015. BIBLIOGRAPHY 48 [43] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al., “From captions to visual concepts and back,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1473–1482, 2015. [44] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634, 2015. [45] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning, pp. 2048–2057, PMLR, 2015. [46] M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo, “Attentive relational networks for mapping images to scene graphs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3957–3966, 2019. [47] “PaddleOCR: Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices).” [48] M. Rahman, N. Mohammed, N. Mansoor, and S. Momen, “Chittron: An automatic bangla image captioning system,” Procedia Computer Science, vol. 154, pp. 636–642, 2019. [49] O. Pelka, S. Koitka, J. Rückert, F. Nensa, and C. M. Friedrich, “Radiology objects in context (roco): a multimodal image dataset,” in Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, pp. 180–189, Springer, 2018. [50] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, pp. 74–81, 2004. [51] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in European conference on computer vision, pp. 382–398, Springer, 2016.	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/1667
dc.description	Supervised by Dr. Md. Hasanul Kabir Professor, Department of Computer Science and Engineering(CSE). Islamic University of Technology(IUT) Co-Supervisor Sabbir Ahmed Lecturer, Department of Computer Science and Engineering (CSE). Islamic University of Technology(IUT) This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2022.	en_US
dc.description.abstract	Image captioning refers to the task of assigning natural language description to an image from its visual and cognitive information. It’s a multi-modal task where image understanding and natural language generation is the backbone. Real life applications like content based image retrieval, navigation of self driving car, assisting visually impaired people, visual question answering etc. are the areas where image captioning can be used. Even though a significant amount of research work has been done on image captioning, still a lot of works can be done to improve the accuracy of Image captioning systems specially for visually challenged images. We explored the possibilities of developing a more robust and accurate image captioning system that can handle motion blur, plain text tokens, partially visible objects in an image. We proposed a pipeline that includes Global feature extraction for extracting overall pictorial information of the image, Scene Graph for detecting objects and learning individual relationship among the objects, OCR token extractor for understanding the plain text in the image (if available) and an encoder-decoder based language model for features to text translation. The main goal was to exploit the research opportunities and improve the research gap. Finally, we explored the result of our findings and did a comparative analysis of our architecture with existing state-of-the-art papers on VizWiz-Captions dataset since images of this dataset are taken by visually impaired people making images more visually challenged.	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.subject	Image Captioning, Scene Graph, Encoder, Decoder, OCR	en_US
dc.title	Image Captioning Using Scene Graph and Language Decoder	en_US
dc.type	Thesis	en_US