Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Zhigang Tu; Hongyan Li; Dejun Zhang; Justin Dauwels; Baoxin Li; Junsong Yuan

doi:10.1109/TIP.2018.2890749

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Zhigang Tu, Hongyan Li, Dejun Zhang, Justin Dauwels, Baoxin Li, Junsong Yuan

Research output: Contribution to journal › Article › peer-review

129 Scopus citations

Abstract

Despite outstanding performance in image recognition, convolutional neural networks (CNNs) do not yet achieve the same impressive results on action recognition in videos. This is partially due to the inability of CNN for modeling long-range temporal structures especially those involving individual action stages that are critical to human action recognition. In this paper, we propose a novel action-stage (ActionS) emphasized spatiotemporal vector of locally aggregated descriptors (ActionS-ST-VLAD) method to aggregate informative deep features across the entire video according to adaptive video feature segmentation and adaptive segment feature sampling (AVFS-ASFS). In our ActionS-ST-VLAD encoding approach, by using AVFS-ASFS, the keyframe features are chosen and the corresponding deep features are automatically split into segments with the features in each segment belonging to a temporally coherent ActionS. Then, based on the extracted keyframe feature in each segment, a flow-guided warping technique is introduced to detect and discard redundant feature maps, while the informative ones are aggregated by using our exploited similarity weight. Furthermore, we exploit an RGBF modality to capture motion salient regions in the RGB images corresponding to action activity. Extensive experiments are conducted on four public benchmarks-HMDB51, UCF101, Kinetics, and ActivityNet for evaluation. Results show that our method is able to effectively pool useful deep features spatiotemporally, leading to the state-of-The-Art performance for video-based action recognition.

Original language	English (US)
Article number	8600333
Pages (from-to)	2799-2812
Number of pages	14
Journal	IEEE Transactions on Image Processing
Volume	28
Issue number	6
DOIs	https://doi.org/10.1109/TIP.2018.2890749
State	Published - Jun 2019

Keywords

Action recognition
ActionS-ST-VLAD
adaptive feature sampling
adaptive video feature segmentation
feature encoding

ASJC Scopus subject areas

Software
Computer Graphics and Computer-Aided Design

Access to Document

10.1109/TIP.2018.2890749

Cite this

@article{bfed440fb7e14b6c82cda4c245c75eec,

title = "Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition",

abstract = "Despite outstanding performance in image recognition, convolutional neural networks (CNNs) do not yet achieve the same impressive results on action recognition in videos. This is partially due to the inability of CNN for modeling long-range temporal structures especially those involving individual action stages that are critical to human action recognition. In this paper, we propose a novel action-stage (ActionS) emphasized spatiotemporal vector of locally aggregated descriptors (ActionS-ST-VLAD) method to aggregate informative deep features across the entire video according to adaptive video feature segmentation and adaptive segment feature sampling (AVFS-ASFS). In our ActionS-ST-VLAD encoding approach, by using AVFS-ASFS, the keyframe features are chosen and the corresponding deep features are automatically split into segments with the features in each segment belonging to a temporally coherent ActionS. Then, based on the extracted keyframe feature in each segment, a flow-guided warping technique is introduced to detect and discard redundant feature maps, while the informative ones are aggregated by using our exploited similarity weight. Furthermore, we exploit an RGBF modality to capture motion salient regions in the RGB images corresponding to action activity. Extensive experiments are conducted on four public benchmarks-HMDB51, UCF101, Kinetics, and ActivityNet for evaluation. Results show that our method is able to effectively pool useful deep features spatiotemporally, leading to the state-of-The-Art performance for video-based action recognition.",

keywords = "Action recognition, ActionS-ST-VLAD, adaptive feature sampling, adaptive video feature segmentation, feature encoding",

author = "Zhigang Tu and Hongyan Li and Dejun Zhang and Justin Dauwels and Baoxin Li and Junsong Yuan",

note = "Funding Information: Manuscript received July 8, 2018; revised November 12, 2018 and December 17, 2018; accepted December 26, 2018. Date of publication January 3, 2019; date of current version March 21, 2019. This work was supported in part by Wuhan University under Grant CXFW-18-413100063, in part by the National Key Research and Development Program of China under Grant 2016YFF0103501, in part by the Natural Science Foundation of China (NSFC) under Grant 61572012, and in part by the Natural Science Fund of Hubei Province under Grants 2017CFB598 and 2017CFB677. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Xiaochun Cao. (Corresponding author: Hongyan Li.) Z. Tu is with the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China (e-mail: tuzhigang@whu.edu.cn). Publisher Copyright: {\textcopyright} 1992-2012 IEEE.",

year = "2019",

month = jun,

doi = "10.1109/TIP.2018.2890749",

language = "English (US)",

volume = "28",

pages = "2799--2812",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "6",

}

TY - JOUR

T1 - Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

AU - Tu, Zhigang

AU - Li, Hongyan

AU - Zhang, Dejun

AU - Dauwels, Justin

AU - Li, Baoxin

AU - Yuan, Junsong

N1 - Funding Information: Manuscript received July 8, 2018; revised November 12, 2018 and December 17, 2018; accepted December 26, 2018. Date of publication January 3, 2019; date of current version March 21, 2019. This work was supported in part by Wuhan University under Grant CXFW-18-413100063, in part by the National Key Research and Development Program of China under Grant 2016YFF0103501, in part by the Natural Science Foundation of China (NSFC) under Grant 61572012, and in part by the Natural Science Fund of Hubei Province under Grants 2017CFB598 and 2017CFB677. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Xiaochun Cao. (Corresponding author: Hongyan Li.) Z. Tu is with the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China (e-mail: tuzhigang@whu.edu.cn). Publisher Copyright: © 1992-2012 IEEE.

PY - 2019/6

Y1 - 2019/6

N2 - Despite outstanding performance in image recognition, convolutional neural networks (CNNs) do not yet achieve the same impressive results on action recognition in videos. This is partially due to the inability of CNN for modeling long-range temporal structures especially those involving individual action stages that are critical to human action recognition. In this paper, we propose a novel action-stage (ActionS) emphasized spatiotemporal vector of locally aggregated descriptors (ActionS-ST-VLAD) method to aggregate informative deep features across the entire video according to adaptive video feature segmentation and adaptive segment feature sampling (AVFS-ASFS). In our ActionS-ST-VLAD encoding approach, by using AVFS-ASFS, the keyframe features are chosen and the corresponding deep features are automatically split into segments with the features in each segment belonging to a temporally coherent ActionS. Then, based on the extracted keyframe feature in each segment, a flow-guided warping technique is introduced to detect and discard redundant feature maps, while the informative ones are aggregated by using our exploited similarity weight. Furthermore, we exploit an RGBF modality to capture motion salient regions in the RGB images corresponding to action activity. Extensive experiments are conducted on four public benchmarks-HMDB51, UCF101, Kinetics, and ActivityNet for evaluation. Results show that our method is able to effectively pool useful deep features spatiotemporally, leading to the state-of-The-Art performance for video-based action recognition.

AB - Despite outstanding performance in image recognition, convolutional neural networks (CNNs) do not yet achieve the same impressive results on action recognition in videos. This is partially due to the inability of CNN for modeling long-range temporal structures especially those involving individual action stages that are critical to human action recognition. In this paper, we propose a novel action-stage (ActionS) emphasized spatiotemporal vector of locally aggregated descriptors (ActionS-ST-VLAD) method to aggregate informative deep features across the entire video according to adaptive video feature segmentation and adaptive segment feature sampling (AVFS-ASFS). In our ActionS-ST-VLAD encoding approach, by using AVFS-ASFS, the keyframe features are chosen and the corresponding deep features are automatically split into segments with the features in each segment belonging to a temporally coherent ActionS. Then, based on the extracted keyframe feature in each segment, a flow-guided warping technique is introduced to detect and discard redundant feature maps, while the informative ones are aggregated by using our exploited similarity weight. Furthermore, we exploit an RGBF modality to capture motion salient regions in the RGB images corresponding to action activity. Extensive experiments are conducted on four public benchmarks-HMDB51, UCF101, Kinetics, and ActivityNet for evaluation. Results show that our method is able to effectively pool useful deep features spatiotemporally, leading to the state-of-The-Art performance for video-based action recognition.

KW - Action recognition

KW - ActionS-ST-VLAD

KW - adaptive feature sampling

KW - adaptive video feature segmentation

KW - feature encoding

UR - http://www.scopus.com/inward/record.url?scp=85063468385&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063468385&partnerID=8YFLogxK

U2 - 10.1109/TIP.2018.2890749

DO - 10.1109/TIP.2018.2890749

M3 - Article

AN - SCOPUS:85063468385

SN - 1057-7149

VL - 28

SP - 2799

EP - 2812

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

IS - 6

M1 - 8600333

ER -

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this