ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language

Zhe Wang; Zhiyuan Fang; Jun Wang; Yezhou Yang

doi:10.1007/978-3-030-58610-2_24

ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language

Zhe Wang, Zhiyuan Fang, Jun Wang, Yezhou Yang

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

74 Scopus citations

Abstract

Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as a performance boost by a robust feature learning that the referred identity can be accurately bundled by multiple attribute cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into sub-spaces corresponding to attributes using a light auxiliary attribute segmentation layer. It then aligns these visual features with the textual attributes parsed from the sentences via a novel contrastive learning loss. We validate our ViTAA framework through extensive experiments on tasks of person search by natural language and by attribute-phrase queries, on which our system achieves state-of-the-art performances. Codes and models are available at https://github.com/Jarr0d/ViTAA.

Original language	English (US)
Title of host publication	Computer Vision – ECCV 2020 - 16th European Conference, Proceedings
Editors	Andrea Vedaldi, Horst Bischof, Thomas Brox, Jan-Michael Frahm
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	402-420
Number of pages	19
ISBN (Print)	9783030586096
DOIs	https://doi.org/10.1007/978-3-030-58610-2_24
State	Published - 2020
Event	16th European Conference on Computer Vision, ECCV 2020 - Glasgow, United Kingdom Duration: Aug 23 2020 → Aug 28 2020

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	12357 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	16th European Conference on Computer Vision, ECCV 2020
Country/Territory	United Kingdom
City	Glasgow
Period	8/23/20 → 8/28/20

Keywords

Metric learning
Person re-identification
Person search by natural language
Vision and language

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-030-58610-2_24

Cite this

Wang, Z., Fang, Z., Wang, J., & Yang, Y. (2020). ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer Vision – ECCV 2020 - 16th European Conference, Proceedings (pp. 402-420). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12357 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-58610-2_24

ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language. / Wang, Zhe; Fang, Zhiyuan; Wang, Jun et al.
Computer Vision – ECCV 2020 - 16th European Conference, Proceedings. ed. / Andrea Vedaldi; Horst Bischof; Thomas Brox; Jan-Michael Frahm. Springer Science and Business Media Deutschland GmbH, 2020. p. 402-420 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12357 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Wang, Z, Fang, Z, Wang, J & Yang, Y 2020, ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language. in A Vedaldi, H Bischof, T Brox & J-M Frahm (eds), Computer Vision – ECCV 2020 - 16th European Conference, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12357 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 402-420, 16th European Conference on Computer Vision, ECCV 2020, Glasgow, United Kingdom, 8/23/20. https://doi.org/10.1007/978-3-030-58610-2_24

Wang Z, Fang Z, Wang J, Yang Y. ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language. In Vedaldi A, Bischof H, Brox T, Frahm JM, editors, Computer Vision – ECCV 2020 - 16th European Conference, Proceedings. Springer Science and Business Media Deutschland GmbH. 2020. p. 402-420. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-58610-2_24

Wang, Zhe ; Fang, Zhiyuan ; Wang, Jun et al. / ViTAA : Visual-Textual Attributes Alignment in Person Search by Natural Language. Computer Vision – ECCV 2020 - 16th European Conference, Proceedings. editor / Andrea Vedaldi ; Horst Bischof ; Thomas Brox ; Jan-Michael Frahm. Springer Science and Business Media Deutschland GmbH, 2020. pp. 402-420 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{c8310fc0d0bd4036afadb6d5bb0b161e,

title = "ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language",

abstract = "Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as a performance boost by a robust feature learning that the referred identity can be accurately bundled by multiple attribute cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into sub-spaces corresponding to attributes using a light auxiliary attribute segmentation layer. It then aligns these visual features with the textual attributes parsed from the sentences via a novel contrastive learning loss. We validate our ViTAA framework through extensive experiments on tasks of person search by natural language and by attribute-phrase queries, on which our system achieves state-of-the-art performances. Codes and models are available at https://github.com/Jarr0d/ViTAA.",

keywords = "Metric learning, Person re-identification, Person search by natural language, Vision and language",

author = "Zhe Wang and Zhiyuan Fang and Jun Wang and Yezhou Yang",

note = "Funding Information: Acknowledgements. Vising scholarship support for Z. Wang from the China Scholarship Council #201806020020 and Amazon AWS Machine Learning Research Award (MLRA) support are greatly appreciated. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the sponsors. Publisher Copyright: {\textcopyright} 2020, Springer Nature Switzerland AG.; 16th European Conference on Computer Vision, ECCV 2020 ; Conference date: 23-08-2020 Through 28-08-2020",

year = "2020",

doi = "10.1007/978-3-030-58610-2_24",

language = "English (US)",

isbn = "9783030586096",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "402--420",

editor = "Andrea Vedaldi and Horst Bischof and Thomas Brox and Jan-Michael Frahm",

booktitle = "Computer Vision – ECCV 2020 - 16th European Conference, Proceedings",

address = "Germany",

}

TY - GEN

T1 - ViTAA

T2 - 16th European Conference on Computer Vision, ECCV 2020

AU - Wang, Zhe

AU - Fang, Zhiyuan

AU - Wang, Jun

AU - Yang, Yezhou

N1 - Funding Information: Acknowledgements. Vising scholarship support for Z. Wang from the China Scholarship Council #201806020020 and Amazon AWS Machine Learning Research Award (MLRA) support are greatly appreciated. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the sponsors. Publisher Copyright: © 2020, Springer Nature Switzerland AG.

PY - 2020

Y1 - 2020

N2 - Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as a performance boost by a robust feature learning that the referred identity can be accurately bundled by multiple attribute cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into sub-spaces corresponding to attributes using a light auxiliary attribute segmentation layer. It then aligns these visual features with the textual attributes parsed from the sentences via a novel contrastive learning loss. We validate our ViTAA framework through extensive experiments on tasks of person search by natural language and by attribute-phrase queries, on which our system achieves state-of-the-art performances. Codes and models are available at https://github.com/Jarr0d/ViTAA.

AB - Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as a performance boost by a robust feature learning that the referred identity can be accurately bundled by multiple attribute cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into sub-spaces corresponding to attributes using a light auxiliary attribute segmentation layer. It then aligns these visual features with the textual attributes parsed from the sentences via a novel contrastive learning loss. We validate our ViTAA framework through extensive experiments on tasks of person search by natural language and by attribute-phrase queries, on which our system achieves state-of-the-art performances. Codes and models are available at https://github.com/Jarr0d/ViTAA.

KW - Metric learning

KW - Person re-identification

KW - Person search by natural language

KW - Vision and language

UR - http://www.scopus.com/inward/record.url?scp=85093074193&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85093074193&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-58610-2_24

DO - 10.1007/978-3-030-58610-2_24

M3 - Conference contribution

AN - SCOPUS:85093074193

SN - 9783030586096

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 402

EP - 420

BT - Computer Vision – ECCV 2020 - 16th European Conference, Proceedings

A2 - Vedaldi, Andrea

A2 - Bischof, Horst

A2 - Brox, Thomas

A2 - Frahm, Jan-Michael

PB - Springer Science and Business Media Deutschland GmbH

Y2 - 23 August 2020 through 28 August 2020

ER -

ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this