Abstract

With the sustaining bloom of multimedia data, Zero-shot Learning (ZSL) techniques have attracted much attention in recent years for its ability to train learning models that can handle 'unseen' categories. Existing ZSL algorithms mainly take advantages of attribute-based semantic space and only focus on static image data. Besides, most ZSL studies merely consider the semantic embedded labels and fail to address domain shift problem. In this paper, we purpose a deep two-output model for video ZSL and action recognition tasks by computing both spatial and temporal features from video contents through distinct Convolutional Neural Networks (CNNs) and training a Multi-layer Perceptron (MLP) upon extracted features to map videos to semantic embedding word vectors. Moreover, we introduce a domain adaptation strategy named 'ConSSEV' - by combining outputs from two distinct output layers of our MLP to improve the results of zero-shot learning. Our experiments on UCF101 dataset demonstrate the purposed model has more advantages associated with more complex video embedding schemes, and outperforms the state-of-the-art zero-shot learning techniques.

Original languageEnglish (US)
Title of host publication2016 IEEE International Conference on Image Processing, ICIP 2016 - Proceedings
PublisherIEEE Computer Society
Pages4195-4199
Number of pages5
Volume2016-August
ISBN (Electronic)9781467399616
DOIs
StatePublished - Aug 3 2016
Event23rd IEEE International Conference on Image Processing, ICIP 2016 - Phoenix, United States
Duration: Sep 25 2016Sep 28 2016

Other

Other23rd IEEE International Conference on Image Processing, ICIP 2016
CountryUnited States
CityPhoenix
Period9/25/169/28/16

Fingerprint

Semantics
Multilayer neural networks
Learning algorithms
Labels
Neural networks
Experiments

Keywords

  • Action recognition
  • Convolutional neural network
  • Multi-layer perceptron
  • Zero-shot learning

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition
  • Signal Processing

Cite this

Li, Y., Hu, S. H., & Li, B. (2016). Recognizing unseen actions in a domain-adapted embedding space. In 2016 IEEE International Conference on Image Processing, ICIP 2016 - Proceedings (Vol. 2016-August, pp. 4195-4199). [7533150] IEEE Computer Society. https://doi.org/10.1109/ICIP.2016.7533150

Recognizing unseen actions in a domain-adapted embedding space. / Li, Yikang; Hu, Sheng Hung; Li, Baoxin.

2016 IEEE International Conference on Image Processing, ICIP 2016 - Proceedings. Vol. 2016-August IEEE Computer Society, 2016. p. 4195-4199 7533150.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Li, Y, Hu, SH & Li, B 2016, Recognizing unseen actions in a domain-adapted embedding space. in 2016 IEEE International Conference on Image Processing, ICIP 2016 - Proceedings. vol. 2016-August, 7533150, IEEE Computer Society, pp. 4195-4199, 23rd IEEE International Conference on Image Processing, ICIP 2016, Phoenix, United States, 9/25/16. https://doi.org/10.1109/ICIP.2016.7533150
Li Y, Hu SH, Li B. Recognizing unseen actions in a domain-adapted embedding space. In 2016 IEEE International Conference on Image Processing, ICIP 2016 - Proceedings. Vol. 2016-August. IEEE Computer Society. 2016. p. 4195-4199. 7533150 https://doi.org/10.1109/ICIP.2016.7533150
Li, Yikang ; Hu, Sheng Hung ; Li, Baoxin. / Recognizing unseen actions in a domain-adapted embedding space. 2016 IEEE International Conference on Image Processing, ICIP 2016 - Proceedings. Vol. 2016-August IEEE Computer Society, 2016. pp. 4195-4199
@inproceedings{7fd0dbc3e0144188a623e35c81ae875b,
title = "Recognizing unseen actions in a domain-adapted embedding space",
abstract = "With the sustaining bloom of multimedia data, Zero-shot Learning (ZSL) techniques have attracted much attention in recent years for its ability to train learning models that can handle 'unseen' categories. Existing ZSL algorithms mainly take advantages of attribute-based semantic space and only focus on static image data. Besides, most ZSL studies merely consider the semantic embedded labels and fail to address domain shift problem. In this paper, we purpose a deep two-output model for video ZSL and action recognition tasks by computing both spatial and temporal features from video contents through distinct Convolutional Neural Networks (CNNs) and training a Multi-layer Perceptron (MLP) upon extracted features to map videos to semantic embedding word vectors. Moreover, we introduce a domain adaptation strategy named 'ConSSEV' - by combining outputs from two distinct output layers of our MLP to improve the results of zero-shot learning. Our experiments on UCF101 dataset demonstrate the purposed model has more advantages associated with more complex video embedding schemes, and outperforms the state-of-the-art zero-shot learning techniques.",
keywords = "Action recognition, Convolutional neural network, Multi-layer perceptron, Zero-shot learning",
author = "Yikang Li and Hu, {Sheng Hung} and Baoxin Li",
year = "2016",
month = "8",
day = "3",
doi = "10.1109/ICIP.2016.7533150",
language = "English (US)",
volume = "2016-August",
pages = "4195--4199",
booktitle = "2016 IEEE International Conference on Image Processing, ICIP 2016 - Proceedings",
publisher = "IEEE Computer Society",
address = "United States",

}

TY - GEN

T1 - Recognizing unseen actions in a domain-adapted embedding space

AU - Li, Yikang

AU - Hu, Sheng Hung

AU - Li, Baoxin

PY - 2016/8/3

Y1 - 2016/8/3

N2 - With the sustaining bloom of multimedia data, Zero-shot Learning (ZSL) techniques have attracted much attention in recent years for its ability to train learning models that can handle 'unseen' categories. Existing ZSL algorithms mainly take advantages of attribute-based semantic space and only focus on static image data. Besides, most ZSL studies merely consider the semantic embedded labels and fail to address domain shift problem. In this paper, we purpose a deep two-output model for video ZSL and action recognition tasks by computing both spatial and temporal features from video contents through distinct Convolutional Neural Networks (CNNs) and training a Multi-layer Perceptron (MLP) upon extracted features to map videos to semantic embedding word vectors. Moreover, we introduce a domain adaptation strategy named 'ConSSEV' - by combining outputs from two distinct output layers of our MLP to improve the results of zero-shot learning. Our experiments on UCF101 dataset demonstrate the purposed model has more advantages associated with more complex video embedding schemes, and outperforms the state-of-the-art zero-shot learning techniques.

AB - With the sustaining bloom of multimedia data, Zero-shot Learning (ZSL) techniques have attracted much attention in recent years for its ability to train learning models that can handle 'unseen' categories. Existing ZSL algorithms mainly take advantages of attribute-based semantic space and only focus on static image data. Besides, most ZSL studies merely consider the semantic embedded labels and fail to address domain shift problem. In this paper, we purpose a deep two-output model for video ZSL and action recognition tasks by computing both spatial and temporal features from video contents through distinct Convolutional Neural Networks (CNNs) and training a Multi-layer Perceptron (MLP) upon extracted features to map videos to semantic embedding word vectors. Moreover, we introduce a domain adaptation strategy named 'ConSSEV' - by combining outputs from two distinct output layers of our MLP to improve the results of zero-shot learning. Our experiments on UCF101 dataset demonstrate the purposed model has more advantages associated with more complex video embedding schemes, and outperforms the state-of-the-art zero-shot learning techniques.

KW - Action recognition

KW - Convolutional neural network

KW - Multi-layer perceptron

KW - Zero-shot learning

UR - http://www.scopus.com/inward/record.url?scp=85006826677&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85006826677&partnerID=8YFLogxK

U2 - 10.1109/ICIP.2016.7533150

DO - 10.1109/ICIP.2016.7533150

M3 - Conference contribution

VL - 2016-August

SP - 4195

EP - 4199

BT - 2016 IEEE International Conference on Image Processing, ICIP 2016 - Proceedings

PB - IEEE Computer Society

ER -