Abstract

In recent years the most popular video-based human action recognition methods rely on extracting feature representations using Convolutional Neural Networks (CNN) and then using these representations to classify actions. In this work, we propose a fast and accurate video representation that is derived from the motion-salient region (MSR), which represents features most useful for action labeling. By improving a well-performed foreground detection technique, the region of interest (ROI) corresponding to actors in the foreground in both the appearance and the motion field can be detected under various realistic challenges. Furthermore, we propose a complementary motion salient measure to select a secondary ROI - the major moving part of the human. Accordingly, a MSR-based CNN descriptor (MSR-CNN) is formulated to recognize human action, where the descriptor incorporates appearance and motion features along with tracks of MSR. The computation can be efficiently implemented due to two characteristics: 1) only part of the RGB image and the motion field need to be processed; 2) less data is used as input for the CNN feature extraction. Comparative evaluation on JHMDB and UCF Sports datasets shows that our method outperforms the state-of-the-art in both efficiency and accuracy.

Original languageEnglish (US)
Title of host publication2016 23rd International Conference on Pattern Recognition, ICPR 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages3524-3529
Number of pages6
ISBN (Electronic)9781509048472
DOIs
StatePublished - Apr 13 2017
Event23rd International Conference on Pattern Recognition, ICPR 2016 - Cancun, Mexico
Duration: Dec 4 2016Dec 8 2016

Other

Other23rd International Conference on Pattern Recognition, ICPR 2016
CountryMexico
CityCancun
Period12/4/1612/8/16

Fingerprint

Neural networks
Sports
Labeling
Feature extraction

Keywords

  • Action recognition
  • Convolutional Neural Networks
  • Motion salient regions

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition

Cite this

Tu, Z., Cao, J., Li, Y., & Li, B. (2017). MSR-CNN: Applying motion salient region based descriptors for action recognition. In 2016 23rd International Conference on Pattern Recognition, ICPR 2016 (pp. 3524-3529). [7900180] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICPR.2016.7900180

MSR-CNN : Applying motion salient region based descriptors for action recognition. / Tu, Zhigang; Cao, Jun; Li, Yikang; Li, Baoxin.

2016 23rd International Conference on Pattern Recognition, ICPR 2016. Institute of Electrical and Electronics Engineers Inc., 2017. p. 3524-3529 7900180.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Tu, Z, Cao, J, Li, Y & Li, B 2017, MSR-CNN: Applying motion salient region based descriptors for action recognition. in 2016 23rd International Conference on Pattern Recognition, ICPR 2016., 7900180, Institute of Electrical and Electronics Engineers Inc., pp. 3524-3529, 23rd International Conference on Pattern Recognition, ICPR 2016, Cancun, Mexico, 12/4/16. https://doi.org/10.1109/ICPR.2016.7900180
Tu Z, Cao J, Li Y, Li B. MSR-CNN: Applying motion salient region based descriptors for action recognition. In 2016 23rd International Conference on Pattern Recognition, ICPR 2016. Institute of Electrical and Electronics Engineers Inc. 2017. p. 3524-3529. 7900180 https://doi.org/10.1109/ICPR.2016.7900180
Tu, Zhigang ; Cao, Jun ; Li, Yikang ; Li, Baoxin. / MSR-CNN : Applying motion salient region based descriptors for action recognition. 2016 23rd International Conference on Pattern Recognition, ICPR 2016. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 3524-3529
@inproceedings{b2d590e2444e4c8db8b12bd63992d4f0,
title = "MSR-CNN: Applying motion salient region based descriptors for action recognition",
abstract = "In recent years the most popular video-based human action recognition methods rely on extracting feature representations using Convolutional Neural Networks (CNN) and then using these representations to classify actions. In this work, we propose a fast and accurate video representation that is derived from the motion-salient region (MSR), which represents features most useful for action labeling. By improving a well-performed foreground detection technique, the region of interest (ROI) corresponding to actors in the foreground in both the appearance and the motion field can be detected under various realistic challenges. Furthermore, we propose a complementary motion salient measure to select a secondary ROI - the major moving part of the human. Accordingly, a MSR-based CNN descriptor (MSR-CNN) is formulated to recognize human action, where the descriptor incorporates appearance and motion features along with tracks of MSR. The computation can be efficiently implemented due to two characteristics: 1) only part of the RGB image and the motion field need to be processed; 2) less data is used as input for the CNN feature extraction. Comparative evaluation on JHMDB and UCF Sports datasets shows that our method outperforms the state-of-the-art in both efficiency and accuracy.",
keywords = "Action recognition, Convolutional Neural Networks, Motion salient regions",
author = "Zhigang Tu and Jun Cao and Yikang Li and Baoxin Li",
year = "2017",
month = "4",
day = "13",
doi = "10.1109/ICPR.2016.7900180",
language = "English (US)",
pages = "3524--3529",
booktitle = "2016 23rd International Conference on Pattern Recognition, ICPR 2016",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

TY - GEN

T1 - MSR-CNN

T2 - Applying motion salient region based descriptors for action recognition

AU - Tu, Zhigang

AU - Cao, Jun

AU - Li, Yikang

AU - Li, Baoxin

PY - 2017/4/13

Y1 - 2017/4/13

N2 - In recent years the most popular video-based human action recognition methods rely on extracting feature representations using Convolutional Neural Networks (CNN) and then using these representations to classify actions. In this work, we propose a fast and accurate video representation that is derived from the motion-salient region (MSR), which represents features most useful for action labeling. By improving a well-performed foreground detection technique, the region of interest (ROI) corresponding to actors in the foreground in both the appearance and the motion field can be detected under various realistic challenges. Furthermore, we propose a complementary motion salient measure to select a secondary ROI - the major moving part of the human. Accordingly, a MSR-based CNN descriptor (MSR-CNN) is formulated to recognize human action, where the descriptor incorporates appearance and motion features along with tracks of MSR. The computation can be efficiently implemented due to two characteristics: 1) only part of the RGB image and the motion field need to be processed; 2) less data is used as input for the CNN feature extraction. Comparative evaluation on JHMDB and UCF Sports datasets shows that our method outperforms the state-of-the-art in both efficiency and accuracy.

AB - In recent years the most popular video-based human action recognition methods rely on extracting feature representations using Convolutional Neural Networks (CNN) and then using these representations to classify actions. In this work, we propose a fast and accurate video representation that is derived from the motion-salient region (MSR), which represents features most useful for action labeling. By improving a well-performed foreground detection technique, the region of interest (ROI) corresponding to actors in the foreground in both the appearance and the motion field can be detected under various realistic challenges. Furthermore, we propose a complementary motion salient measure to select a secondary ROI - the major moving part of the human. Accordingly, a MSR-based CNN descriptor (MSR-CNN) is formulated to recognize human action, where the descriptor incorporates appearance and motion features along with tracks of MSR. The computation can be efficiently implemented due to two characteristics: 1) only part of the RGB image and the motion field need to be processed; 2) less data is used as input for the CNN feature extraction. Comparative evaluation on JHMDB and UCF Sports datasets shows that our method outperforms the state-of-the-art in both efficiency and accuracy.

KW - Action recognition

KW - Convolutional Neural Networks

KW - Motion salient regions

UR - http://www.scopus.com/inward/record.url?scp=85019149204&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85019149204&partnerID=8YFLogxK

U2 - 10.1109/ICPR.2016.7900180

DO - 10.1109/ICPR.2016.7900180

M3 - Conference contribution

SP - 3524

EP - 3529

BT - 2016 23rd International Conference on Pattern Recognition, ICPR 2016

PB - Institute of Electrical and Electronics Engineers Inc.

ER -