TY - JOUR
T1 - Semantic Cues Enhanced Multimodality Multistream CNN for Action Recognition
AU - Tu, Zhigang
AU - Xie, Wei
AU - Dauwels, Justin
AU - Li, Baoxin
AU - Yuan, Junsong
N1 - Funding Information:
Manuscript received January 17, 2018; revised April 1, 2018; accepted April 21, 2018. Date of publication April 25, 2018; date of current version May 3, 2019. This work is supported in part by the Singapore Ministry of Education Academic Research Fund Tier 2 under Grant MOE2015-T2-2-114, in part by the National Natural Science Foundation of China under Grant 61501198, in part by the Natural Science Foundation of Hubei Province under Grant 2014CFB461, and in part by the University at Buffalo. This paper was recommended by Associate Editor G.-J. Qi. (Corresponding author: Zhigang Tu.) Z. Tu is with the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China (e-mail: tuzhigang1986@gmail.com).
Publisher Copyright:
© 1991-2012 IEEE.
PY - 2019/5
Y1 - 2019/5
N2 - This paper addresses the issue of video-based action recognition by exploiting an advanced multistream convolutional neural network (CNN) to fully use semantics-derived multiple modalities in both spatial (appearance) and temporal (motion) domains, since the performance of the CNN-based action recognition methods heavily relates to two factors: semantic visual cues and the network architecture. Our work consists of two major parts. First, to extract useful human-related semantics accurately, we propose a novel spatiotemporal saliency-based video object segmentation (STS) model. By fusing different distinctive saliency maps, which are computed according to object signatures of complementary object detection approaches, a refined STS maps can be obtained. In this way, various challenges in the realistic video can be handled jointly. Based on the estimated saliency maps, an energy function is constructed to segment two semantic cues: the actor and one distinctive acting part of the actor. Second, we modify the architecture of the two-stream network (TS-Net) to design a multistream network that consists of three TS-Nets with respect to the extracted semantics, which is able to use deeper abstract visual features of multimodalities in multi-scale spatiotemporally. Importantly, the performance of action recognition is significantly boosted when integrating the captured human-related semantics into our framework. Experiments on four public benchmarks-JHMDB, HMDB51, UCF-Sports, and UCF101-demonstrate that the proposed method outperforms the state-of-the-art algorithms.
AB - This paper addresses the issue of video-based action recognition by exploiting an advanced multistream convolutional neural network (CNN) to fully use semantics-derived multiple modalities in both spatial (appearance) and temporal (motion) domains, since the performance of the CNN-based action recognition methods heavily relates to two factors: semantic visual cues and the network architecture. Our work consists of two major parts. First, to extract useful human-related semantics accurately, we propose a novel spatiotemporal saliency-based video object segmentation (STS) model. By fusing different distinctive saliency maps, which are computed according to object signatures of complementary object detection approaches, a refined STS maps can be obtained. In this way, various challenges in the realistic video can be handled jointly. Based on the estimated saliency maps, an energy function is constructed to segment two semantic cues: the actor and one distinctive acting part of the actor. Second, we modify the architecture of the two-stream network (TS-Net) to design a multistream network that consists of three TS-Nets with respect to the extracted semantics, which is able to use deeper abstract visual features of multimodalities in multi-scale spatiotemporally. Importantly, the performance of action recognition is significantly boosted when integrating the captured human-related semantics into our framework. Experiments on four public benchmarks-JHMDB, HMDB51, UCF-Sports, and UCF101-demonstrate that the proposed method outperforms the state-of-the-art algorithms.
KW - Action recognition
KW - multi-modalities
KW - multi-stream CNN
KW - semantic cues
KW - spatiotemporal saliency estimation
KW - video object detection
UR - http://www.scopus.com/inward/record.url?scp=85045994931&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85045994931&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2018.2830102
DO - 10.1109/TCSVT.2018.2830102
M3 - Article
AN - SCOPUS:85045994931
SN - 1051-8215
VL - 29
SP - 1423
EP - 1437
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 5
M1 - 8347006
ER -