Abstract

This paper addresses the issue of video-based action recognition by exploiting an advanced multi-stream Convolutional Neural Network (CNN) to fully use semantics-derived multiple modalities in both spatial (appearance) and temporal (motion) domains, since the performance of the CNN-based action recognition methods heavily relate to two factors: semantic visual cues and the network architecture. Our work consists of two major parts. First, to extract useful human-related semantics accurately, we propose a novel spatiotemporal saliency based video object segmentation (STS-VOS) model. By fusing different distinctive saliency maps, which are computed according to object signatures of complementary object detection approaches, a refined spatiotemporal saliency maps can be obtained. In this way, various challenges in the realistic video can be handled jointly. Based on the estimated saliency maps, an energy function is constructed to segment two semantic cues: the actor and one distinctive acting part of the actor. Second, we modify the architecture of the two-stream network (TS-Net) to design a multi-stream network (MS-Net) that consists of three TS-Nets with respect to the extracted semantics, which is able to use deeper abstract visual features of multi-modalities in multi-scale spatiotemporally. Importantly, the performance of action recognition is significantly boosted when integrating the captured human-related semantics into our framework. Experiments on four public benchmarks JHMDB, HMDB51, UCF-Sports and UCF101 demonstrate that the proposed method outperforms the state of the art algorithms.

Original languageEnglish (US)
JournalIEEE Transactions on Circuits and Systems for Video Technology
DOIs
StateAccepted/In press - Apr 24 2018

Fingerprint

Semantics
Neural networks
Sports
Network architecture
Experiments

Keywords

  • Action Recognition
  • Feature extraction
  • Multi-Modalities
  • Multi-stream CNN
  • Object detection
  • Object segmentation
  • Semantic Cues
  • Semantics
  • Spatiotemporal phenomena
  • Spatiotemporal Saliency Estimation
  • Streaming media
  • Video Object Detection
  • Visualization

ASJC Scopus subject areas

  • Media Technology
  • Electrical and Electronic Engineering

Cite this

Semantic Cues Enhanced Multi-modality Multi-Stream CNN for Action Recognition. / Tu, Zhigang; Xie, Wei; Dauwels, Justin; Li, Baoxin; Yuan, Junsong.

In: IEEE Transactions on Circuits and Systems for Video Technology, 24.04.2018.

Research output: Contribution to journalArticle

@article{b6363a73f0f34ab8a5bbd2e9e4409609,
title = "Semantic Cues Enhanced Multi-modality Multi-Stream CNN for Action Recognition",
abstract = "This paper addresses the issue of video-based action recognition by exploiting an advanced multi-stream Convolutional Neural Network (CNN) to fully use semantics-derived multiple modalities in both spatial (appearance) and temporal (motion) domains, since the performance of the CNN-based action recognition methods heavily relate to two factors: semantic visual cues and the network architecture. Our work consists of two major parts. First, to extract useful human-related semantics accurately, we propose a novel spatiotemporal saliency based video object segmentation (STS-VOS) model. By fusing different distinctive saliency maps, which are computed according to object signatures of complementary object detection approaches, a refined spatiotemporal saliency maps can be obtained. In this way, various challenges in the realistic video can be handled jointly. Based on the estimated saliency maps, an energy function is constructed to segment two semantic cues: the actor and one distinctive acting part of the actor. Second, we modify the architecture of the two-stream network (TS-Net) to design a multi-stream network (MS-Net) that consists of three TS-Nets with respect to the extracted semantics, which is able to use deeper abstract visual features of multi-modalities in multi-scale spatiotemporally. Importantly, the performance of action recognition is significantly boosted when integrating the captured human-related semantics into our framework. Experiments on four public benchmarks JHMDB, HMDB51, UCF-Sports and UCF101 demonstrate that the proposed method outperforms the state of the art algorithms.",
keywords = "Action Recognition, Feature extraction, Multi-Modalities, Multi-stream CNN, Object detection, Object segmentation, Semantic Cues, Semantics, Spatiotemporal phenomena, Spatiotemporal Saliency Estimation, Streaming media, Video Object Detection, Visualization",
author = "Zhigang Tu and Wei Xie and Justin Dauwels and Baoxin Li and Junsong Yuan",
year = "2018",
month = "4",
day = "24",
doi = "10.1109/TCSVT.2018.2830102",
language = "English (US)",
journal = "IEEE Transactions on Circuits and Systems for Video Technology",
issn = "1051-8215",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Semantic Cues Enhanced Multi-modality Multi-Stream CNN for Action Recognition

AU - Tu, Zhigang

AU - Xie, Wei

AU - Dauwels, Justin

AU - Li, Baoxin

AU - Yuan, Junsong

PY - 2018/4/24

Y1 - 2018/4/24

N2 - This paper addresses the issue of video-based action recognition by exploiting an advanced multi-stream Convolutional Neural Network (CNN) to fully use semantics-derived multiple modalities in both spatial (appearance) and temporal (motion) domains, since the performance of the CNN-based action recognition methods heavily relate to two factors: semantic visual cues and the network architecture. Our work consists of two major parts. First, to extract useful human-related semantics accurately, we propose a novel spatiotemporal saliency based video object segmentation (STS-VOS) model. By fusing different distinctive saliency maps, which are computed according to object signatures of complementary object detection approaches, a refined spatiotemporal saliency maps can be obtained. In this way, various challenges in the realistic video can be handled jointly. Based on the estimated saliency maps, an energy function is constructed to segment two semantic cues: the actor and one distinctive acting part of the actor. Second, we modify the architecture of the two-stream network (TS-Net) to design a multi-stream network (MS-Net) that consists of three TS-Nets with respect to the extracted semantics, which is able to use deeper abstract visual features of multi-modalities in multi-scale spatiotemporally. Importantly, the performance of action recognition is significantly boosted when integrating the captured human-related semantics into our framework. Experiments on four public benchmarks JHMDB, HMDB51, UCF-Sports and UCF101 demonstrate that the proposed method outperforms the state of the art algorithms.

AB - This paper addresses the issue of video-based action recognition by exploiting an advanced multi-stream Convolutional Neural Network (CNN) to fully use semantics-derived multiple modalities in both spatial (appearance) and temporal (motion) domains, since the performance of the CNN-based action recognition methods heavily relate to two factors: semantic visual cues and the network architecture. Our work consists of two major parts. First, to extract useful human-related semantics accurately, we propose a novel spatiotemporal saliency based video object segmentation (STS-VOS) model. By fusing different distinctive saliency maps, which are computed according to object signatures of complementary object detection approaches, a refined spatiotemporal saliency maps can be obtained. In this way, various challenges in the realistic video can be handled jointly. Based on the estimated saliency maps, an energy function is constructed to segment two semantic cues: the actor and one distinctive acting part of the actor. Second, we modify the architecture of the two-stream network (TS-Net) to design a multi-stream network (MS-Net) that consists of three TS-Nets with respect to the extracted semantics, which is able to use deeper abstract visual features of multi-modalities in multi-scale spatiotemporally. Importantly, the performance of action recognition is significantly boosted when integrating the captured human-related semantics into our framework. Experiments on four public benchmarks JHMDB, HMDB51, UCF-Sports and UCF101 demonstrate that the proposed method outperforms the state of the art algorithms.

KW - Action Recognition

KW - Feature extraction

KW - Multi-Modalities

KW - Multi-stream CNN

KW - Object detection

KW - Object segmentation

KW - Semantic Cues

KW - Semantics

KW - Spatiotemporal phenomena

KW - Spatiotemporal Saliency Estimation

KW - Streaming media

KW - Video Object Detection

KW - Visualization

UR - http://www.scopus.com/inward/record.url?scp=85045994931&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045994931&partnerID=8YFLogxK

U2 - 10.1109/TCSVT.2018.2830102

DO - 10.1109/TCSVT.2018.2830102

M3 - Article

AN - SCOPUS:85045994931

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

SN - 1051-8215

ER -