TY - GEN
T1 - Recognizing unseen actions in a domain-adapted embedding space
AU - Li, Yikang
AU - Hu, Sheng Hung
AU - Li, Baoxin
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/8/3
Y1 - 2016/8/3
N2 - With the sustaining bloom of multimedia data, Zero-shot Learning (ZSL) techniques have attracted much attention in recent years for its ability to train learning models that can handle 'unseen' categories. Existing ZSL algorithms mainly take advantages of attribute-based semantic space and only focus on static image data. Besides, most ZSL studies merely consider the semantic embedded labels and fail to address domain shift problem. In this paper, we purpose a deep two-output model for video ZSL and action recognition tasks by computing both spatial and temporal features from video contents through distinct Convolutional Neural Networks (CNNs) and training a Multi-layer Perceptron (MLP) upon extracted features to map videos to semantic embedding word vectors. Moreover, we introduce a domain adaptation strategy named 'ConSSEV' - by combining outputs from two distinct output layers of our MLP to improve the results of zero-shot learning. Our experiments on UCF101 dataset demonstrate the purposed model has more advantages associated with more complex video embedding schemes, and outperforms the state-of-the-art zero-shot learning techniques.
AB - With the sustaining bloom of multimedia data, Zero-shot Learning (ZSL) techniques have attracted much attention in recent years for its ability to train learning models that can handle 'unseen' categories. Existing ZSL algorithms mainly take advantages of attribute-based semantic space and only focus on static image data. Besides, most ZSL studies merely consider the semantic embedded labels and fail to address domain shift problem. In this paper, we purpose a deep two-output model for video ZSL and action recognition tasks by computing both spatial and temporal features from video contents through distinct Convolutional Neural Networks (CNNs) and training a Multi-layer Perceptron (MLP) upon extracted features to map videos to semantic embedding word vectors. Moreover, we introduce a domain adaptation strategy named 'ConSSEV' - by combining outputs from two distinct output layers of our MLP to improve the results of zero-shot learning. Our experiments on UCF101 dataset demonstrate the purposed model has more advantages associated with more complex video embedding schemes, and outperforms the state-of-the-art zero-shot learning techniques.
KW - Action recognition
KW - Convolutional neural network
KW - Multi-layer perceptron
KW - Zero-shot learning
UR - http://www.scopus.com/inward/record.url?scp=85006826677&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85006826677&partnerID=8YFLogxK
U2 - 10.1109/ICIP.2016.7533150
DO - 10.1109/ICIP.2016.7533150
M3 - Conference contribution
AN - SCOPUS:85006826677
T3 - Proceedings - International Conference on Image Processing, ICIP
SP - 4195
EP - 4199
BT - 2016 IEEE International Conference on Image Processing, ICIP 2016 - Proceedings
PB - IEEE Computer Society
T2 - 23rd IEEE International Conference on Image Processing, ICIP 2016
Y2 - 25 September 2016 through 28 September 2016
ER -