Abstract:
Image captioning refers to the task of assigning natural language description to an image from its visual and cognitive information. It’s a multi-modal task where image understanding and natural language generation is the backbone. Real life applications like content based image retrieval, navigation of self driving car, assisting visually impaired people, visual question answering etc. are the areas where image captioning can be used. Even though a significant amount of research work has been done on image captioning, still a lot of works can be done to improve the accuracy of Image captioning systems specially for visually challenged images. We explored the possibilities of developing a more robust and accurate image captioning system that can handle motion blur, plain text tokens, partially visible objects in an image. We proposed a pipeline that includes Global feature extraction for extracting overall pictorial information of the image, Scene Graph for detecting objects and learning individual relationship among the objects, OCR token extractor for understanding the plain text in the image (if available) and an encoder-decoder based language model for features to text translation. The main goal was to exploit the research opportunities and improve the research gap. Finally, we explored the result of our findings and did a comparative analysis of our architecture with existing state-of-the-art papers on VizWiz-Captions dataset since images of this dataset are taken by visually impaired people making images more visually challenged.
Description:
Supervised by
Dr. Md. Hasanul Kabir
Professor, Department of Computer Science and Engineering(CSE).
Islamic University of Technology(IUT)
Co-Supervisor
Sabbir Ahmed
Lecturer, Department of Computer Science and Engineering (CSE).
Islamic University of Technology(IUT)
This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2022.