Abstract

We propose to learn semantic spatio-temporal embeddings for videos to support high-level video analysis. The first step of the proposed embedding employs a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Gated Recurrent Unit encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a multilayer perceptron to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. We demonstrate the usefulness and effectiveness of this new video representation by experiments on action recognition, zero-shot video classification, and 'word-to-video' retrieval, using the UCF-101 dataset.

Original languageEnglish (US)
Title of host publication2016 23rd International Conference on Pattern Recognition, ICPR 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages811-816
Number of pages6
ISBN (Electronic)9781509048472
DOIs
StatePublished - Apr 13 2017
Event23rd International Conference on Pattern Recognition, ICPR 2016 - Cancun, Mexico
Duration: Dec 4 2016Dec 8 2016

Other

Other23rd International Conference on Pattern Recognition, ICPR 2016
CountryMexico
CityCancun
Period12/4/1612/8/16

Fingerprint

Semantics
Multilayer neural networks
Neural networks
Experiments

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition

Cite this

Hu, S. H., Li, Y., & Li, B. (2017). Video2vec: Learning semantic spatio-temporal embeddings for video representation. In 2016 23rd International Conference on Pattern Recognition, ICPR 2016 (pp. 811-816). [7899735] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICPR.2016.7899735

Video2vec : Learning semantic spatio-temporal embeddings for video representation. / Hu, Sheng Hung; Li, Yikang; Li, Baoxin.

2016 23rd International Conference on Pattern Recognition, ICPR 2016. Institute of Electrical and Electronics Engineers Inc., 2017. p. 811-816 7899735.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hu, SH, Li, Y & Li, B 2017, Video2vec: Learning semantic spatio-temporal embeddings for video representation. in 2016 23rd International Conference on Pattern Recognition, ICPR 2016., 7899735, Institute of Electrical and Electronics Engineers Inc., pp. 811-816, 23rd International Conference on Pattern Recognition, ICPR 2016, Cancun, Mexico, 12/4/16. https://doi.org/10.1109/ICPR.2016.7899735
Hu SH, Li Y, Li B. Video2vec: Learning semantic spatio-temporal embeddings for video representation. In 2016 23rd International Conference on Pattern Recognition, ICPR 2016. Institute of Electrical and Electronics Engineers Inc. 2017. p. 811-816. 7899735 https://doi.org/10.1109/ICPR.2016.7899735
Hu, Sheng Hung ; Li, Yikang ; Li, Baoxin. / Video2vec : Learning semantic spatio-temporal embeddings for video representation. 2016 23rd International Conference on Pattern Recognition, ICPR 2016. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 811-816
@inproceedings{3c58bc7ecd3c427a97f86478615f6088,
title = "Video2vec: Learning semantic spatio-temporal embeddings for video representation",
abstract = "We propose to learn semantic spatio-temporal embeddings for videos to support high-level video analysis. The first step of the proposed embedding employs a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Gated Recurrent Unit encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a multilayer perceptron to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. We demonstrate the usefulness and effectiveness of this new video representation by experiments on action recognition, zero-shot video classification, and 'word-to-video' retrieval, using the UCF-101 dataset.",
author = "Hu, {Sheng Hung} and Yikang Li and Baoxin Li",
year = "2017",
month = "4",
day = "13",
doi = "10.1109/ICPR.2016.7899735",
language = "English (US)",
pages = "811--816",
booktitle = "2016 23rd International Conference on Pattern Recognition, ICPR 2016",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

TY - GEN

T1 - Video2vec

T2 - Learning semantic spatio-temporal embeddings for video representation

AU - Hu, Sheng Hung

AU - Li, Yikang

AU - Li, Baoxin

PY - 2017/4/13

Y1 - 2017/4/13

N2 - We propose to learn semantic spatio-temporal embeddings for videos to support high-level video analysis. The first step of the proposed embedding employs a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Gated Recurrent Unit encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a multilayer perceptron to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. We demonstrate the usefulness and effectiveness of this new video representation by experiments on action recognition, zero-shot video classification, and 'word-to-video' retrieval, using the UCF-101 dataset.

AB - We propose to learn semantic spatio-temporal embeddings for videos to support high-level video analysis. The first step of the proposed embedding employs a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Gated Recurrent Unit encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a multilayer perceptron to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. We demonstrate the usefulness and effectiveness of this new video representation by experiments on action recognition, zero-shot video classification, and 'word-to-video' retrieval, using the UCF-101 dataset.

UR - http://www.scopus.com/inward/record.url?scp=85019107022&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85019107022&partnerID=8YFLogxK

U2 - 10.1109/ICPR.2016.7899735

DO - 10.1109/ICPR.2016.7899735

M3 - Conference contribution

SP - 811

EP - 816

BT - 2016 23rd International Conference on Pattern Recognition, ICPR 2016

PB - Institute of Electrical and Electronics Engineers Inc.

ER -