Improving Zero-Shot Semantic Segmentation using Dynamic Kernels

Tajwar, Tauseef; Rahman, Muftiqur; Chowdhury, Taukir Azam

dc.contributor.author	Tajwar, Tauseef
dc.contributor.author	Rahman, Muftiqur
dc.contributor.author	Chowdhury, Taukir Azam
dc.date.accessioned	2024-08-30T09:18:27Z
dc.date.available	2024-08-30T09:18:27Z
dc.date.issued	2023-05-30
dc.identifier.citation	[1] Zeynep Akata, Florent Perronnin, Za"id Harchaoui, and Cordelia Schmid. Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38:1425–1438, 2016. 20 [2] Saeid Asgari Taghanaki, Kumar Abhishek, Joseph Paul Cohen, Julien Cohen Adad, and Ghassan Hamarneh. Deep semantic segmentation of natural and med ical images: a review. Artificial Intelligence Review, 54(1):137–178, Jan 2021. 5 [3] Donghyeon Baek, Youngmin Oh, and Bumsub Ham. Exploiting a joint embed ding space for generalized zero-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9536–9545, 2021. 26 [4] Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick P´erez. Zero-shot semantic segmentation. Advances in Neural Information Processing Systems, 32, 2019. 1, 3, 5, 7, 26, 27, 28, 29, 39, 42 [5] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1209–1218, 2018. 40 [6] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537, 2021. 12 [7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020. 36, 38 [8] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014. 11, 36 [9] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image seg mentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 5 48 [10] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021. 1, 3, 8, 12, 14, 32, 38 [11] Jiaxin Cheng, Soumyaroop Nandi, Prem Natarajan, and Wael Abd-Almageed. Sign: Spatial-information incorporated generative network for generalized zero shot semantic segmentation. In Proceedings of the IEEE/CVF International Con ference on Computer Vision, pages 9556–9566, 2021. 39, 42 [12] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017. 36 [13] Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. Decoupling zero-shot se mantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11583–11592, 2022. 8, 38 [14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 11 [15] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010. 40 [16] Rafael Felix, Ian Reid, Gustavo Carneiro, et al. Multi-modal cycle-consistent gen eralized zero-shot learning. In Proceedings of the European Conference on Com puter Vision (ECCV), pages 21–37, 2018. 21 [17] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. Advances in neural information processing systems, 26, 2013. 20 [18] Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. Transductive multi-view zero-shot learning. IEEE transactions on pattern analysis and machine intelligence, 37(11):2332–2345, 2015. 20 [19] Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, and Liqing Zhang. Context aware feature generation for zero-shot semantic segmentation. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1921–1929, 2020. 26, 39, 42 [20] Amirhossein Habibian, Thomas Mensink, and Cees GM Snoek. Composite con cept discovery for zero-shot video event detection. In Proceedings of International Conference on Multimedia Retrieval, pages 17–24, 2014. 21 [21] Shijie Hao, Yuan Zhou, and Yanrong Guo. A brief survey on semantic segmentation with deep learning. Neurocomputing, 406:302–321, 2020. 5 49 [22] Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger R Roth, and Daguang Xu. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 574–584, 2022. 36 [23] Ping Hu, Stan Sclaroff, and Kate Saenko. Uncertainty-aware learning for zero shot semantic segmentation. Advances in Neural Information Processing Systems, 33:21713–21724, 2020. 27 [24] Huimin Huang, Lanfen Lin, Ruofeng Tong, Hongjie Hu, Qiaowei Zhang, Yutaro Iwamoto, Xianhua Han, Yen-Wei Chen, and Jian Wu. Unet 3+: A full-scale connected unet for medical image segmentation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1055–1059. IEEE, 2020. 11 [25] Fabian Isensee, Paul F J¨ager, Peter M Full, Philipp Vollmuth, and Klaus H Maier Hein. nnu-net for brain tumor segmentation. In International MICCAI Brainlesion Workshop, pages 118–132. Springer, 2020. 5 [26] Naoki Kato, Toshihiko Yamasaki, and Kiyoharu Aizawa. Zero-shot semantic seg mentation via variational mapping. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019. 7, 27 [27] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 951–958, 2009. 20 [28] Jingjing Li, Mengmeng Jing, Ke Lu, Zhengming Ding, Lei Zhu, and Zi Huang. Leveraging the invariant side of generative zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7402–7411, 2019. 21 [29] Xiaomeng Li, Hao Chen, Xiaojuan Qi, Qi Dou, Chi-Wing Fu, and Pheng-Ann Heng. H-denseunet: hybrid densely connected unet for liver and tumor segmenta tion from ct volumes. IEEE transactions on medical imaging, 37(12):2663–2674, 2018. 11 [30] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In International conference on machine learning, pages 1718–1727. PMLR, 2015. 26 [31] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 40 [32] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021. 11, 36 50 [33] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015. 11 [34] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international con ference on computer vision, pages 1520–1528, 2015. 11 [35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In Inter national Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 1, 3, 21, 23, 25, 31, 36, 37 [36] Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In International conference on machine learning, pages 2152–2161. PMLR, 2015. 20 [37] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional net works for biomedical image segmentation. In International Conference on Medi cal image computing and computer-assisted intervention, pages 234–241. Springer, 2015. 5, 11 [38] Mert Bulent Sariyildiz and Ramazan Gokberk Cinbis. Gradient matching genera tive networks for zero-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2168–2178, 2019. 21 [39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 37 [40] Segnet Vijay, A Kendall, and R Cipolla. A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell, (39):2481, 2015. 11 [41] Ce Wang, Moshiur Farazi, and Nick Barnes. Recursive training for zero-shot se mantic segmentation. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021. 7 [42] Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt Schiele, and Zeynep Akata. Semantic projection network for zero-and few-label semantic segmentation. In Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni tion, pages 8256–8265, 2019. 26, 39, 42 [43] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5542–5551, 2018. 21 [44] Xiao Xiao, Shen Lian, Zhiming Luo, and Shaozi Li. Weighted res-unet for high quality retina vessel segmentation. In 2018 9th international conference on infor 51 mation technology in medicine and education (ITME), pages 327–331. IEEE, 2018. 11 [45] Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model. arXiv preprint arXiv:2112.14757, 2021. 1, 3, 8, 31, 35, 39, 40, 41, 42, 44 [46] Yang Yang, Yadan Luo, Weilun Chen, Fumin Shen, Jie Shao, and Heng Tao Shen. Zero-shot hashing via transferring supervised knowledge. In Proceedings of the 24th ACM international conference on Multimedia, pages 1286–1295, 2016. 21 [47] Minghao Yin, Zhuliang Yao, Yue Cao, Xiu Li, Zheng Zhang, Stephen Lin, and Han Hu. Disentangled non-local neural networks. In European Conference on Computer Vision, pages 191–207. Springer, 2020. 11 [48] Hui Zhang and Henghui Ding. Prototypical matching and open set rejection for zero-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6974–6983, 2021. 7, 26 [49] Wenwei Zhang, Jiangmiao Pang, Kai Chen, and Chen Change Loy. K-net: Towards unified image segmentation. Advances in Neural Information Processing Systems, 34:10326–10338, 2021. 1, 3, 16, 19, 20, 36 [50] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on com puter vision and pattern recognition, pages 2881–2890, 2017. 11 [51] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transform ers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6881–6890, 2021. 11, 36 [52] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. Unet++: A nested u-net architecture for medical image segmentation. In Deep learning in medical image analysis and multimodal learning for clinical decision support, pages 3–11. Springer, 2018. 1	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/2145
dc.description	Supervised by Dr. Md. Hasanul Kabir, Professor, Co-Supervisor Sabbir Ahmed, Assistant Professor, Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.description.abstract	Zero-shot Semantic Segmentation (ZS3) is a daunting task since it requires segmenting items into classes that were never seen during training. One popular method is to divide ZS3 into two sub-tasks: creating mask suggestions and assign ing class labels to individual pixels inside those regions. However, many existing approaches have difficulty producing masks with sufficient generalization capa bilities, resulting in notable performance constraints, particularly on unknown classes. In this regard, we propose using “Dynamic Kernels” to improve object understanding within a ZS3 model during the training phase. We want to pro duce superior mask suggestions that permit a more accurate representation of the objects by harnessing the intrinsic inductive biases of these kernels. These specialized agents, known as dynamic kernels, adjust based on data taken from visible classes, allowing them to obtain insights on unseen things. In addition, for segment classification, our proposed system utilizes the Contrastive Language Image Pre-Training (CLIP) architecture. This integration improves the model’s generalizability by utilizing its cross-modal training capabilities. The utilization of dynamic kernels in conjunction with CLIP proves to be advantageous as it allows for finer granularity in processing, enabling performance enhancements for both seen and unseen classes. Our proposed ZSK-Net surpasses the existing state-of-the-art methods by achieving a remarkable improvement of +10.4 and +0.9 in hIoU on the Pascal VOC and COCO-Stuff datasets, respectively.	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering(CSE), Islamic University of Technology(IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.title	Improving Zero-Shot Semantic Segmentation using Dynamic Kernels	en_US
dc.type	Thesis	en_US