Ensemble learning on deep neural networks for image caption generation

Harshitha Katpally; Ajay Bansal

doi:10.1109/ICSC.2020.00016

Ensemble learning on deep neural networks for image caption generation

Harshitha Katpally, Ajay Bansal

Software Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

13 Scopus citations

Abstract

Capturing the information in an image into a natural language sentence is considered a difficult problem to be solved by computers. Image captioning involves not just detecting objects from images but understanding the interactions between the objects to be translated into relevant captions. So, expertise in the field of computer vision paired with natural language processing is crucial for this purpose. The sequence to sequence modelling strategy of deep neural networks is the traditional approach to generate a sequential list of words that are combined to represent the image. But these models suffer from the problem of high variance by not being able to generalize well on the training data. The main focus of this paper is to reduce the variance factor that will help in generating better captions. To achieve this, Ensemble Learning techniques have been explored, which have the reputation of solving the high variance problem that occurs in machine learning algorithms. Three different ensemble techniques namely, k-fold ensemble, bootstrap aggregation ensemble and boosting ensemble have been evaluated in our work. For each of these techniques, three output combination approaches have been analyzed. Extensive experiments have been conducted on the Flickr8k dataset which has a collection of 8000 images and 5 different captions for every image. The bleu score performance metric, which is considered to be the standard for evaluating natural language processing (NLP) problems, is used to evaluate the predictions. Based on this metric, the analysis shows that ensemble learning performs significantly better and generates more meaningful captions compared to any of the individual models used.

Original language	English (US)
Title of host publication	Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	61-68
Number of pages	8
ISBN (Electronic)	9781728163321
DOIs	https://doi.org/10.1109/ICSC.2020.00016
State	Published - Feb 2020
Event	14th IEEE International Conference on Semantic Computing, ICSC 2020 - San Diego, United States Duration: Feb 3 2020 → Feb 5 2020

Publication series

Name	Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020

Conference

Conference	14th IEEE International Conference on Semantic Computing, ICSC 2020
Country/Territory	United States
City	San Diego
Period	2/3/20 → 2/5/20

Keywords

Boosting
Bootstrap aggregation
Deep neural networks
Ensemble learning
Image captioning
K-fold ensemble

ASJC Scopus subject areas

Artificial Intelligence
Computer Science Applications
Computer Vision and Pattern Recognition

Access to Document

10.1109/ICSC.2020.00016

Cite this

Katpally, H., & Bansal, A. (2020). Ensemble learning on deep neural networks for image caption generation. In Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020 (pp. 61-68). Article 9031513 (Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICSC.2020.00016

Ensemble learning on deep neural networks for image caption generation. / Katpally, Harshitha; Bansal, Ajay.
Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020. Institute of Electrical and Electronics Engineers Inc., 2020. p. 61-68 9031513 (Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Katpally, H & Bansal, A 2020, Ensemble learning on deep neural networks for image caption generation. in Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020., 9031513, Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020, Institute of Electrical and Electronics Engineers Inc., pp. 61-68, 14th IEEE International Conference on Semantic Computing, ICSC 2020, San Diego, United States, 2/3/20. https://doi.org/10.1109/ICSC.2020.00016

Katpally H, Bansal A. Ensemble learning on deep neural networks for image caption generation. In Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020. Institute of Electrical and Electronics Engineers Inc. 2020. p. 61-68. 9031513. (Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020). doi: 10.1109/ICSC.2020.00016

@inproceedings{89bfa97eba5348d2825c6475a2cd313c,

title = "Ensemble learning on deep neural networks for image caption generation",

abstract = "Capturing the information in an image into a natural language sentence is considered a difficult problem to be solved by computers. Image captioning involves not just detecting objects from images but understanding the interactions between the objects to be translated into relevant captions. So, expertise in the field of computer vision paired with natural language processing is crucial for this purpose. The sequence to sequence modelling strategy of deep neural networks is the traditional approach to generate a sequential list of words that are combined to represent the image. But these models suffer from the problem of high variance by not being able to generalize well on the training data. The main focus of this paper is to reduce the variance factor that will help in generating better captions. To achieve this, Ensemble Learning techniques have been explored, which have the reputation of solving the high variance problem that occurs in machine learning algorithms. Three different ensemble techniques namely, k-fold ensemble, bootstrap aggregation ensemble and boosting ensemble have been evaluated in our work. For each of these techniques, three output combination approaches have been analyzed. Extensive experiments have been conducted on the Flickr8k dataset which has a collection of 8000 images and 5 different captions for every image. The bleu score performance metric, which is considered to be the standard for evaluating natural language processing (NLP) problems, is used to evaluate the predictions. Based on this metric, the analysis shows that ensemble learning performs significantly better and generates more meaningful captions compared to any of the individual models used.",

keywords = "Boosting, Bootstrap aggregation, Deep neural networks, Ensemble learning, Image captioning, K-fold ensemble",

author = "Harshitha Katpally and Ajay Bansal",

note = "Publisher Copyright: {\textcopyright} 2020 IEEE.; 14th IEEE International Conference on Semantic Computing, ICSC 2020 ; Conference date: 03-02-2020 Through 05-02-2020",

year = "2020",

month = feb,

doi = "10.1109/ICSC.2020.00016",

language = "English (US)",

series = "Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "61--68",

booktitle = "Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020",

}

TY - GEN

T1 - Ensemble learning on deep neural networks for image caption generation

AU - Katpally, Harshitha

AU - Bansal, Ajay

PY - 2020/2

Y1 - 2020/2

N2 - Capturing the information in an image into a natural language sentence is considered a difficult problem to be solved by computers. Image captioning involves not just detecting objects from images but understanding the interactions between the objects to be translated into relevant captions. So, expertise in the field of computer vision paired with natural language processing is crucial for this purpose. The sequence to sequence modelling strategy of deep neural networks is the traditional approach to generate a sequential list of words that are combined to represent the image. But these models suffer from the problem of high variance by not being able to generalize well on the training data. The main focus of this paper is to reduce the variance factor that will help in generating better captions. To achieve this, Ensemble Learning techniques have been explored, which have the reputation of solving the high variance problem that occurs in machine learning algorithms. Three different ensemble techniques namely, k-fold ensemble, bootstrap aggregation ensemble and boosting ensemble have been evaluated in our work. For each of these techniques, three output combination approaches have been analyzed. Extensive experiments have been conducted on the Flickr8k dataset which has a collection of 8000 images and 5 different captions for every image. The bleu score performance metric, which is considered to be the standard for evaluating natural language processing (NLP) problems, is used to evaluate the predictions. Based on this metric, the analysis shows that ensemble learning performs significantly better and generates more meaningful captions compared to any of the individual models used.

AB - Capturing the information in an image into a natural language sentence is considered a difficult problem to be solved by computers. Image captioning involves not just detecting objects from images but understanding the interactions between the objects to be translated into relevant captions. So, expertise in the field of computer vision paired with natural language processing is crucial for this purpose. The sequence to sequence modelling strategy of deep neural networks is the traditional approach to generate a sequential list of words that are combined to represent the image. But these models suffer from the problem of high variance by not being able to generalize well on the training data. The main focus of this paper is to reduce the variance factor that will help in generating better captions. To achieve this, Ensemble Learning techniques have been explored, which have the reputation of solving the high variance problem that occurs in machine learning algorithms. Three different ensemble techniques namely, k-fold ensemble, bootstrap aggregation ensemble and boosting ensemble have been evaluated in our work. For each of these techniques, three output combination approaches have been analyzed. Extensive experiments have been conducted on the Flickr8k dataset which has a collection of 8000 images and 5 different captions for every image. The bleu score performance metric, which is considered to be the standard for evaluating natural language processing (NLP) problems, is used to evaluate the predictions. Based on this metric, the analysis shows that ensemble learning performs significantly better and generates more meaningful captions compared to any of the individual models used.

KW - Boosting

KW - Bootstrap aggregation

KW - Deep neural networks

KW - Ensemble learning

KW - Image captioning

KW - K-fold ensemble

UR - http://www.scopus.com/inward/record.url?scp=85083468511&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85083468511&partnerID=8YFLogxK

U2 - 10.1109/ICSC.2020.00016

DO - 10.1109/ICSC.2020.00016

M3 - Conference contribution

AN - SCOPUS:85083468511

T3 - Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020

SP - 61

EP - 68

BT - Proceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 14th IEEE International Conference on Semantic Computing, ICSC 2020

Y2 - 3 February 2020 through 5 February 2020

ER -

Ensemble learning on deep neural networks for image caption generation

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this