Injecting Semantic Concepts into End-to-End Image Captioning

Zhiyuan Fang; Jianfeng Wang; Xiaowei Hu; Lin Liang; Zhe Gan; Lijuan Wang; Yezhou Yang; Zicheng Liu

doi:10.1109/CVPR52688.2022.01748

Injecting Semantic Concepts into End-to-End Image Captioning

Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lin Liang, Zhe Gan, Lijuan Wang, Yezhou Yang, Zicheng Liu

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

47 Scopus citations

Abstract

Tremendous progresses have been made in recent years in developing better image captioning models, yet most of them rely on a separate object detector to extract regional features. Recent vision-language studies are shifting towards the detector-free trend by leveraging grid representations for more flexible model training and faster inference speed. However, such development is primarily focused on image understanding tasks, and remains less investigated for the caption generation task. In this paper, we are concerned with a better-performing detector-free image captioning model, and propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features. For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning. In particular, the CTN is built on the basis of a vision transformer, and is designed to predict the concept tokens through a classification task, from which the rich semantic information contained greatly benefits the captioning task. Compared with the previous detector-based models, ViTCAP drastically simplifies the architectures and at the same time achieves competitive performance on various challenging image captioning datasets. In particular, ViTCAP reaches 138.1 CIDEr scores on COCO-caption Karpathy-split, 93.8 and 108.6 CIDEr scores on nocaps and Google-CC captioning datasets, respectively.

Original language	English (US)
Title of host publication	Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Publisher	IEEE Computer Society
Pages	17988-17998
Number of pages	11
ISBN (Electronic)	9781665469463
DOIs	https://doi.org/10.1109/CVPR52688.2022.01748
State	Published - 2022
Event	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 - New Orleans, United States Duration: Jun 19 2022 → Jun 24 2022

Publication series

Name	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume	2022-June
ISSN (Print)	1063-6919

Conference

Conference	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Country/Territory	United States
City	New Orleans
Period	6/19/22 → 6/24/22

Keywords

Vision + language
Vision applications and systems

ASJC Scopus subject areas

Software
Computer Vision and Pattern Recognition

Access to Document

10.1109/CVPR52688.2022.01748

Cite this

Fang, Z., Wang, J., Hu, X., Liang, L., Gan, Z., Wang, L., Yang, Y., & Liu, Z. (2022). Injecting Semantic Concepts into End-to-End Image Captioning. In Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 (pp. 17988-17998). (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2022-June). IEEE Computer Society. https://doi.org/10.1109/CVPR52688.2022.01748

Injecting Semantic Concepts into End-to-End Image Captioning. / Fang, Zhiyuan; Wang, Jianfeng; Hu, Xiaowei et al.
Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. IEEE Computer Society, 2022. p. 17988-17998 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2022-June).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Fang, Z, Wang, J, Hu, X, Liang, L, Gan, Z, Wang, L, Yang, Y & Liu, Z 2022, Injecting Semantic Concepts into End-to-End Image Captioning. in Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2022-June, IEEE Computer Society, pp. 17988-17998, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, United States, 6/19/22. https://doi.org/10.1109/CVPR52688.2022.01748

Fang Z, Wang J, Hu X, Liang L, Gan Z, Wang L et al. Injecting Semantic Concepts into End-to-End Image Captioning. In Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. IEEE Computer Society. 2022. p. 17988-17998. (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). doi: 10.1109/CVPR52688.2022.01748

@inproceedings{33eb177f963f41c082a643f9ede25690,

title = "Injecting Semantic Concepts into End-to-End Image Captioning",

abstract = "Tremendous progresses have been made in recent years in developing better image captioning models, yet most of them rely on a separate object detector to extract regional features. Recent vision-language studies are shifting towards the detector-free trend by leveraging grid representations for more flexible model training and faster inference speed. However, such development is primarily focused on image understanding tasks, and remains less investigated for the caption generation task. In this paper, we are concerned with a better-performing detector-free image captioning model, and propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features. For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning. In particular, the CTN is built on the basis of a vision transformer, and is designed to predict the concept tokens through a classification task, from which the rich semantic information contained greatly benefits the captioning task. Compared with the previous detector-based models, ViTCAP drastically simplifies the architectures and at the same time achieves competitive performance on various challenging image captioning datasets. In particular, ViTCAP reaches 138.1 CIDEr scores on COCO-caption Karpathy-split, 93.8 and 108.6 CIDEr scores on nocaps and Google-CC captioning datasets, respectively.",

keywords = "Vision + language, Vision applications and systems",

author = "Zhiyuan Fang and Jianfeng Wang and Xiaowei Hu and Lin Liang and Zhe Gan and Lijuan Wang and Yezhou Yang and Zicheng Liu",

note = "Funding Information: In this paper, we propose the ViTCAP, a detector-free image captioning model in the full transformer architecture fashion. Compared with existing captioning models, ViTCAP can be trained in an end-to-end fashion without intermediate regional operations using grid representations. Our proposed Concept Token Network learns broad semantic concepts and encodes them as the concept tokens that largely benefit the captioning task on a series of challenging captioning benchmarks. Extensive experiments indicate that ViTCAP achieves competing performances, approaching most detector-based models. We anticipate that ViTCAP will lead to more future works in building efficient Vision and Language models. Acknowledgement. This work was supported by the National Science Foundation under Grant CMMI-1925403, IIS-2132724 and IIS-1750082. Publisher Copyright: {\textcopyright} 2022 IEEE.; 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 ; Conference date: 19-06-2022 Through 24-06-2022",

year = "2022",

doi = "10.1109/CVPR52688.2022.01748",

language = "English (US)",

series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

publisher = "IEEE Computer Society",

pages = "17988--17998",

booktitle = "Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022",

}

TY - GEN

T1 - Injecting Semantic Concepts into End-to-End Image Captioning

AU - Fang, Zhiyuan

AU - Wang, Jianfeng

AU - Hu, Xiaowei

AU - Liang, Lin

AU - Gan, Zhe

AU - Wang, Lijuan

AU - Yang, Yezhou

AU - Liu, Zicheng

N1 - Funding Information: In this paper, we propose the ViTCAP, a detector-free image captioning model in the full transformer architecture fashion. Compared with existing captioning models, ViTCAP can be trained in an end-to-end fashion without intermediate regional operations using grid representations. Our proposed Concept Token Network learns broad semantic concepts and encodes them as the concept tokens that largely benefit the captioning task on a series of challenging captioning benchmarks. Extensive experiments indicate that ViTCAP achieves competing performances, approaching most detector-based models. We anticipate that ViTCAP will lead to more future works in building efficient Vision and Language models. Acknowledgement. This work was supported by the National Science Foundation under Grant CMMI-1925403, IIS-2132724 and IIS-1750082. Publisher Copyright: © 2022 IEEE.

PY - 2022

Y1 - 2022

N2 - Tremendous progresses have been made in recent years in developing better image captioning models, yet most of them rely on a separate object detector to extract regional features. Recent vision-language studies are shifting towards the detector-free trend by leveraging grid representations for more flexible model training and faster inference speed. However, such development is primarily focused on image understanding tasks, and remains less investigated for the caption generation task. In this paper, we are concerned with a better-performing detector-free image captioning model, and propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features. For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning. In particular, the CTN is built on the basis of a vision transformer, and is designed to predict the concept tokens through a classification task, from which the rich semantic information contained greatly benefits the captioning task. Compared with the previous detector-based models, ViTCAP drastically simplifies the architectures and at the same time achieves competitive performance on various challenging image captioning datasets. In particular, ViTCAP reaches 138.1 CIDEr scores on COCO-caption Karpathy-split, 93.8 and 108.6 CIDEr scores on nocaps and Google-CC captioning datasets, respectively.

AB - Tremendous progresses have been made in recent years in developing better image captioning models, yet most of them rely on a separate object detector to extract regional features. Recent vision-language studies are shifting towards the detector-free trend by leveraging grid representations for more flexible model training and faster inference speed. However, such development is primarily focused on image understanding tasks, and remains less investigated for the caption generation task. In this paper, we are concerned with a better-performing detector-free image captioning model, and propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features. For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning. In particular, the CTN is built on the basis of a vision transformer, and is designed to predict the concept tokens through a classification task, from which the rich semantic information contained greatly benefits the captioning task. Compared with the previous detector-based models, ViTCAP drastically simplifies the architectures and at the same time achieves competitive performance on various challenging image captioning datasets. In particular, ViTCAP reaches 138.1 CIDEr scores on COCO-caption Karpathy-split, 93.8 and 108.6 CIDEr scores on nocaps and Google-CC captioning datasets, respectively.

KW - Vision + language

KW - Vision applications and systems

UR - http://www.scopus.com/inward/record.url?scp=85138182406&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85138182406&partnerID=8YFLogxK

U2 - 10.1109/CVPR52688.2022.01748

DO - 10.1109/CVPR52688.2022.01748

M3 - Conference contribution

AN - SCOPUS:85138182406

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 17988

EP - 17998

BT - Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022

PB - IEEE Computer Society

T2 - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022

Y2 - 19 June 2022 through 24 June 2022

ER -

Injecting Semantic Concepts into End-to-End Image Captioning

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this