Deep Learning Approach: Image Captioning in French and Arabic Language

Keita, Abdoulaye; Hamadou, Mohaman Dairou; Asag, Mazen Abdulwahab Mahyoub Salem

dc.contributor.author	Keita, Abdoulaye
dc.contributor.author	Hamadou, Mohaman Dairou
dc.contributor.author	Asag, Mazen Abdulwahab Mahyoub Salem
dc.date.accessioned	2024-01-18T07:06:26Z
dc.date.available	2024-01-18T07:06:26Z
dc.date.issued	2023-05-30
dc.identifier.citation	[1] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3156–3164. [2] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, pp. 311–318. [Online]. Available: https://aclanthology.org/P02-1040 [3] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting image captioning with at tributes,” in Proceedings of the IEEE international conference on computer vi sion, 2017, pp. 4894–4902. [4] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2641–2649. [5] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755. [6] S. Katiyar and S. K. Borgohain, “Comparative evaluation of cnn architectures for image caption generation,” arXiv preprint arXiv:2102.11506, 2021. [7] Z. Yang and N. Okazaki, “Image caption generation for news articles,” in Pro ceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 1941–1951. [8] C. Chen, S. Mu, W. Xiao, Z. Ye, L. Wu, and Q. Ju, “Improving image captioning with conditional generative adversarial nets,” in Proceedings of the AAAI Confer ence on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 8142–8150. 30 [9] H. R. Tavakoli, R. Shetty, A. Borji, and J. Laaksonen, “Paying attention to de scriptions generated by image captioning models,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2487–2496. [10] Y. Huang, B. Liu, J. Fu, and Y. Lu, “A picture is worth a thousand words: A unified system for diverse captions and rich images generation,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 2792–2794. [11] M. Tanti, A. Gatt, and K. P. Camilleri, “What is the role of recurrent neural net works (rnns) in an image caption generator?” arXiv preprint arXiv:1708.02043, 2017. [12] S. Katiyar and S. K. Borgohain, “Analysis of convolutional decoder for image caption generation,” arXiv preprint arXiv:2103.04914, 2021. [13] R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. Ikizler-Cinbis, F. Keller, A. Muscat, and B. Plank, “Automatic description generation from im ages: A survey of models, datasets, and evaluation measures,” Journal of Artifi cial Intelligence Research, vol. 55, pp. 409–442, 2016. [14] T. Miyazaki and N. Shimizu, “Cross-lingual image caption generation,” in Pro ceedings of the 54th Annual Meeting of the Association for Computational Lin guistics (Volume 1: Long Papers), 2016, pp. 1780–1790. [15] M. Nikolaus, M. Abdou, M. Lamm, R. Aralikatte, and D. Elliott, “Compositional generalization in image captioning,” arXiv preprint arXiv:1909.04402, 2019. [16] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical se quence training for image captioning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7008–7024. [17] Y. Yoshikawa, Y. Shigeto, and A. Takeuchi, “Stair captions: Constructing a large scale japanese image caption dataset,” arXiv preprint arXiv:1705.00823, 2017. [18] V. Jindal, “Generating image captions in arabic using root-word based recurrent neural networks and deep neural networks,” in Proceedings of the AAAI Confer ence on Artificial Intelligence, vol. 32, no. 1, 2018. [19] R. Biswas, M. Barz, M. Hartmann, and D. Sonntag, “Improving german image captions using machine translation and transfer learning,” in Statistical Language and Speech Processing: 9th International Conference, SLSP 2021, Cardiff, UK, November 23–25, 2021, Proceedings 9. Springer, 2021, pp. 3–14. [20] W. Zhao, B. Wang, J. Ye, M. Yang, Z. Zhao, R. Luo, and Y. Qiao, “A multi-task learning approach for image captioning.” in IJCAI, 2018, pp. 1205–1211. 31 [21] S. Kwon, B.-H. Go, and J.-H. Lee, “A text-based visual context modulation neu ral model for multimodal machine translation,” Pattern Recognition Letters, vol. 136, pp. 212–218, 2020. [22] D. Elliott, S. Frank, and E. Hasler, “Multilingual image description with neural sequence models,” arXiv preprint arXiv:1510.04709, 2015. [23] A. Mastropaolo, S. Scalabrino, N. Cooper, D. N. Palacio, D. Poshyvanyk, R. Oliveto, and G. Bavota, “Studying the usage of text-to-text transfer transformer to support code-related tasks,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 336–347. [24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [25] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural machine translation sys tem: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016. [26] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generating sentences from images,” in Computer Vision–ECCV 2010: 11th European Conference on Computer Vi sion, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11. Springer, 2010, pp. 15–29. [27] A. Roberts, C. Raffel, K. Lee, M. Matena, N. Shazeer, P. J. Liu, S. Narang, W. Li, and Y. Zhou, “Exploring the limits of transfer learning with a unified text-to-text transformer,” 2019. [28] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020. [29] B. Dzmitry and B. Yoshua, “Neural machine translation by jointly learning to align and translate,” in 3rd International Conference on Learning Representa tions, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Pro ceedings, 2	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/2062
dc.description	Supervised by Mr. Md. Hamjajul Ashmafee, Assistant Professor, Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.description.abstract	This research report introduces a novel dataset of French captions translated from the Flickr30k dataset using different translation models, namely we have Google Trans late and the powerful Transformers: T5 Small and T5 base models. A novel dataset of French captions means creating fresh data collection by translating existing captions from the Flickr30k dataset into French. The Flickr30k dataset is valuable for training and evaluating image captioning models in French. The main objective is to address the problem of generating precise image captions in French. The performance of an image captioning model is evaluated on the translated datasets, employing ResNet-50 for image feature encoding and LSTM network with attention in generating captions. These results demonstrate that the accuracy of image captions varies depending on the translation(or Language) models, with the Trans formers models outperforming Google Translate. The proposed approach achieves state-of-the-art performance in generating accurate French captions when combined with ResNet-50 and LSTM network with attention. The findings contribute to the field of image captioning and machine translation for French speakers, highlighting the importance of using advanced translation models for improved caption accuracy and other NLP tasks in French. Furthermore, this research provides insights into the potential of smaller-scale models in limited data scenarios. Based on our findings, we can explore alternative translation models, and data aug mentation techniques, and consider multi-modal approaches that could lead to more accurate and contextually relevant captions and the potential of this approach in other languages	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.subject	Novel dataset, Translation models, Transformers, Image captioning, Natural Language Processing, Multimodal technologies	en_US
dc.title	Deep Learning Approach: Image Captioning in French and Arabic Language	en_US
dc.type	Thesis	en_US