Ensemble learning on deep neural networks for image caption generation

Harshitha Katpally, Ajay Bansal

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Capturing the information in an image into a natural language sentence is considered a difficult problem to be solved by computers. Image captioning involves not just detecting objects from images but understanding the interactions between the objects to be translated into relevant captions. So, expertise in the field of computer vision paired with natural language processing is crucial for this purpose. The sequence to sequence modelling strategy of deep neural networks is the traditional approach to generate a sequential list of words that are combined to represent the image. But these models suffer from the problem of high variance by not being able to generalize well on the training data. The main focus of this paper is to reduce the variance factor that will help in generating better captions. To achieve this, Ensemble Learning techniques have been explored, which have the reputation of solving the high variance problem that occurs in machine learning algorithms. Three different ensemble techniques namely, k-fold ensemble, bootstrap aggregation ensemble and boosting ensemble have been evaluated in our work. For each of these techniques, three output combination approaches have been analyzed. Extensive experiments have been conducted on the Flickr8k dataset which has a collection of 8000 images and 5 different captions for every image. The bleu score performance metric, which is considered to be the standard for evaluating natural language processing (NLP) problems, is used to evaluate the predictions. Based on this metric, the analysis shows that ensemble learning performs significantly better and generates more meaningful captions compared to any of the individual models used.

Original languageEnglish (US)
Title of host publicationProceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages61-68
Number of pages8
ISBN (Electronic)9781728163321
DOIs
StatePublished - Feb 2020
Externally publishedYes
Event14th IEEE International Conference on Semantic Computing, ICSC 2020 - San Diego, United States
Duration: Feb 3 2020Feb 5 2020

Publication series

NameProceedings - 14th IEEE International Conference on Semantic Computing, ICSC 2020

Conference

Conference14th IEEE International Conference on Semantic Computing, ICSC 2020
CountryUnited States
CitySan Diego
Period2/3/202/5/20

Keywords

  • Boosting
  • Bootstrap aggregation
  • Deep neural networks
  • Ensemble learning
  • Image captioning
  • K-fold ensemble

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Computer Vision and Pattern Recognition

Fingerprint Dive into the research topics of 'Ensemble learning on deep neural networks for image caption generation'. Together they form a unique fingerprint.

Cite this