Compressing Visual-linguistic Model via Knowledge Distillation

Zhiyuan Fang; Jianfeng Wang; Xiaowei Hu; Lijuan Wang; Yezhou Yang; Zicheng Liu

doi:10.1109/ICCV48922.2021.00146

Compressing Visual-linguistic Model via Knowledge Distillation

Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lijuan Wang, Yezhou Yang, Zicheng Liu

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

30 Scopus citations

Abstract

Despite exciting progress in pre-training for visual-linguistic (VL) representations, very few aspire to a small VL model. In this paper, we study knowledge distillation (KD) to effectively compress a transformer based large VL model into a small VL model. The major challenge arises from the inconsistent regional visual tokens extracted from different detectors of Teacher and Student, resulting in the misalignment of hidden representations and attention distributions. To address the problem, we retrain and adapt the Teacher by using the same region proposals from Student's detector while the features are from Teacher's own object detector. With aligned network inputs, the adapted Teacher is capable of transferring the knowledge through the intermediate representations. Specifically, we use the mean square error loss to mimic the attention distribution inside the transformer block, and present a token-wise noise contrastive loss to align the hidden state by contrasting with negative representations stored in a sample queue. To this end, we show that our proposed distillation significantly improves the performance of small VL models on image captioning and visual question answering tasks. It reaches 120.8 in CIDEr score on COCO captioning, an improvement of 5.1 over its non-distilled counterpart; and an accuracy of 69.8 on VQA 2.0, a 0.8 gain from the baseline. Our extensive experiments and ablations confirm the effectiveness of VL distillation in both pre-training and fine-tuning stages.

Original language	English (US)
Title of host publication	Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	1408-1418
Number of pages	11
ISBN (Electronic)	9781665428125
DOIs	https://doi.org/10.1109/ICCV48922.2021.00146
State	Published - 2021
Event	18th IEEE/CVF International Conference on Computer Vision, ICCV 2021 - Virtual, Online, Canada Duration: Oct 11 2021 → Oct 17 2021

Publication series

Name	Proceedings of the IEEE International Conference on Computer Vision
ISSN (Print)	1550-5499

Conference

Conference	18th IEEE/CVF International Conference on Computer Vision, ICCV 2021
Country/Territory	Canada
City	Virtual, Online
Period	10/11/21 → 10/17/21

ASJC Scopus subject areas

Software
Computer Vision and Pattern Recognition

Access to Document

10.1109/ICCV48922.2021.00146

Cite this

Fang, Z., Wang, J., Hu, X., Wang, L., Yang, Y., & Liu, Z. (2021). Compressing Visual-linguistic Model via Knowledge Distillation. In Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021 (pp. 1408-1418). (Proceedings of the IEEE International Conference on Computer Vision). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCV48922.2021.00146

Compressing Visual-linguistic Model via Knowledge Distillation. / Fang, Zhiyuan; Wang, Jianfeng; Hu, Xiaowei et al.
Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021. Institute of Electrical and Electronics Engineers Inc., 2021. p. 1408-1418 (Proceedings of the IEEE International Conference on Computer Vision).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Fang, Z, Wang, J, Hu, X, Wang, L, Yang, Y & Liu, Z 2021, Compressing Visual-linguistic Model via Knowledge Distillation. in Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021. Proceedings of the IEEE International Conference on Computer Vision, Institute of Electrical and Electronics Engineers Inc., pp. 1408-1418, 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021, Virtual, Online, Canada, 10/11/21. https://doi.org/10.1109/ICCV48922.2021.00146

Fang Z, Wang J, Hu X, Wang L, Yang Y, Liu Z. Compressing Visual-linguistic Model via Knowledge Distillation. In Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021. Institute of Electrical and Electronics Engineers Inc. 2021. p. 1408-1418. (Proceedings of the IEEE International Conference on Computer Vision). doi: 10.1109/ICCV48922.2021.00146

@inproceedings{aee8ba1b47da48b19a98b8d58a3e0311,

title = "Compressing Visual-linguistic Model via Knowledge Distillation",

abstract = "Despite exciting progress in pre-training for visual-linguistic (VL) representations, very few aspire to a small VL model. In this paper, we study knowledge distillation (KD) to effectively compress a transformer based large VL model into a small VL model. The major challenge arises from the inconsistent regional visual tokens extracted from different detectors of Teacher and Student, resulting in the misalignment of hidden representations and attention distributions. To address the problem, we retrain and adapt the Teacher by using the same region proposals from Student's detector while the features are from Teacher's own object detector. With aligned network inputs, the adapted Teacher is capable of transferring the knowledge through the intermediate representations. Specifically, we use the mean square error loss to mimic the attention distribution inside the transformer block, and present a token-wise noise contrastive loss to align the hidden state by contrasting with negative representations stored in a sample queue. To this end, we show that our proposed distillation significantly improves the performance of small VL models on image captioning and visual question answering tasks. It reaches 120.8 in CIDEr score on COCO captioning, an improvement of 5.1 over its non-distilled counterpart; and an accuracy of 69.8 on VQA 2.0, a 0.8 gain from the baseline. Our extensive experiments and ablations confirm the effectiveness of VL distillation in both pre-training and fine-tuning stages.",

author = "Zhiyuan Fang and Jianfeng Wang and Xiaowei Hu and Lijuan Wang and Yezhou Yang and Zicheng Liu",

note = "Funding Information: Acknowledgements: Z. Fang and Y. Yang are partially supported by the US National Science Foundation project #1750082. Publisher Copyright: {\textcopyright} 2021 IEEE; 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021 ; Conference date: 11-10-2021 Through 17-10-2021",

year = "2021",

doi = "10.1109/ICCV48922.2021.00146",

language = "English (US)",

series = "Proceedings of the IEEE International Conference on Computer Vision",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "1408--1418",

booktitle = "Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021",

}

TY - GEN

T1 - Compressing Visual-linguistic Model via Knowledge Distillation

AU - Fang, Zhiyuan

AU - Wang, Jianfeng

AU - Hu, Xiaowei

AU - Wang, Lijuan

AU - Yang, Yezhou

AU - Liu, Zicheng

PY - 2021

Y1 - 2021

N2 - Despite exciting progress in pre-training for visual-linguistic (VL) representations, very few aspire to a small VL model. In this paper, we study knowledge distillation (KD) to effectively compress a transformer based large VL model into a small VL model. The major challenge arises from the inconsistent regional visual tokens extracted from different detectors of Teacher and Student, resulting in the misalignment of hidden representations and attention distributions. To address the problem, we retrain and adapt the Teacher by using the same region proposals from Student's detector while the features are from Teacher's own object detector. With aligned network inputs, the adapted Teacher is capable of transferring the knowledge through the intermediate representations. Specifically, we use the mean square error loss to mimic the attention distribution inside the transformer block, and present a token-wise noise contrastive loss to align the hidden state by contrasting with negative representations stored in a sample queue. To this end, we show that our proposed distillation significantly improves the performance of small VL models on image captioning and visual question answering tasks. It reaches 120.8 in CIDEr score on COCO captioning, an improvement of 5.1 over its non-distilled counterpart; and an accuracy of 69.8 on VQA 2.0, a 0.8 gain from the baseline. Our extensive experiments and ablations confirm the effectiveness of VL distillation in both pre-training and fine-tuning stages.

AB - Despite exciting progress in pre-training for visual-linguistic (VL) representations, very few aspire to a small VL model. In this paper, we study knowledge distillation (KD) to effectively compress a transformer based large VL model into a small VL model. The major challenge arises from the inconsistent regional visual tokens extracted from different detectors of Teacher and Student, resulting in the misalignment of hidden representations and attention distributions. To address the problem, we retrain and adapt the Teacher by using the same region proposals from Student's detector while the features are from Teacher's own object detector. With aligned network inputs, the adapted Teacher is capable of transferring the knowledge through the intermediate representations. Specifically, we use the mean square error loss to mimic the attention distribution inside the transformer block, and present a token-wise noise contrastive loss to align the hidden state by contrasting with negative representations stored in a sample queue. To this end, we show that our proposed distillation significantly improves the performance of small VL models on image captioning and visual question answering tasks. It reaches 120.8 in CIDEr score on COCO captioning, an improvement of 5.1 over its non-distilled counterpart; and an accuracy of 69.8 on VQA 2.0, a 0.8 gain from the baseline. Our extensive experiments and ablations confirm the effectiveness of VL distillation in both pre-training and fine-tuning stages.

UR - http://www.scopus.com/inward/record.url?scp=85121660046&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85121660046&partnerID=8YFLogxK

U2 - 10.1109/ICCV48922.2021.00146

DO - 10.1109/ICCV48922.2021.00146

M3 - Conference contribution

AN - SCOPUS:85121660046

T3 - Proceedings of the IEEE International Conference on Computer Vision

SP - 1408

EP - 1418

BT - Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021

Y2 - 11 October 2021 through 17 October 2021

ER -

Compressing Visual-linguistic Model via Knowledge Distillation

Abstract

Publication series

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this