Designing an Effective Metric Learning Pipeline for Speaker Diarization

Vivek Sivaraman Narayanaswamy; Jayaraman J. Thiagarajan; Huan Song; Andreas Spanias

doi:10.1109/ICASSP.2019.8682255

Designing an Effective Metric Learning Pipeline for Speaker Diarization

Vivek Sivaraman Narayanaswamy, Jayaraman J. Thiagarajan, Huan Song, Andreas Spanias

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

12 Scopus citations

Abstract

State-of-the-art speaker diarization systems utilize knowledge from external data, in the form of a pre-trained distance metric, to effectively determine relative speaker identities to unseen data. However, much of recent focus has been on choosing the appropriate feature extractor, ranging from pre-trained i-vectors to representations learned via different sequence modeling architectures (e.g. 1D-CNNs, LSTMs, attention models), while adopting off-the-shelf metric learning solutions. In this paper, we argue that, regardless of the feature extractor, it is crucial to carefully design a metric learning pipeline, namely the loss function, the sampling strategy and the discriminative margin parameter, for building robust diarization systems. Furthermore, we propose to adopt a fine-grained validation process to obtain a comprehensive evaluation of the generalization power of metric learning pipelines. To this end, we measure diarization performance across different language speakers, and variations in the number of speakers in a recording. Using empirical studies, we provide interesting insights into the effectiveness of different design choices and make recommendations.

Original language	English (US)
Title of host publication	2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	5806-5810
Number of pages	5
ISBN (Electronic)	9781479981311
DOIs	https://doi.org/10.1109/ICASSP.2019.8682255
State	Published - May 2019
Event	44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Brighton, United Kingdom Duration: May 12 2019 → May 17 2019

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume	2019-May
ISSN (Print)	1520-6149

Conference

Conference	44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
Country/Territory	United Kingdom
City	Brighton
Period	5/12/19 → 5/17/19

Keywords

Speaker diarization
attention models
inverse distance weighted sampling
metric learning

ASJC Scopus subject areas

Software
Signal Processing
Electrical and Electronic Engineering

Access to Document

10.1109/ICASSP.2019.8682255

Cite this

Narayanaswamy, V. S., Thiagarajan, J. J., Song, H., & Spanias, A. (2019). Designing an Effective Metric Learning Pipeline for Speaker Diarization. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings (pp. 5806-5810). Article 8682255 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2019-May). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2019.8682255

Designing an Effective Metric Learning Pipeline for Speaker Diarization. / Narayanaswamy, Vivek Sivaraman; Thiagarajan, Jayaraman J.; Song, Huan et al.
2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. p. 5806-5810 8682255 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2019-May).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Narayanaswamy, VS, Thiagarajan, JJ, Song, H & Spanias, A 2019, Designing an Effective Metric Learning Pipeline for Speaker Diarization. in 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings., 8682255, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019-May, Institute of Electrical and Electronics Engineers Inc., pp. 5806-5810, 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019, Brighton, United Kingdom, 5/12/19. https://doi.org/10.1109/ICASSP.2019.8682255

Narayanaswamy VS, Thiagarajan JJ, Song H, Spanias A. Designing an Effective Metric Learning Pipeline for Speaker Diarization. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2019. p. 5806-5810. 8682255. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP.2019.8682255

Narayanaswamy, Vivek Sivaraman ; Thiagarajan, Jayaraman J. ; Song, Huan et al. / Designing an Effective Metric Learning Pipeline for Speaker Diarization. 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 5806-5810 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{54d919857919491985f5f432b582e205,

title = "Designing an Effective Metric Learning Pipeline for Speaker Diarization",

abstract = "State-of-the-art speaker diarization systems utilize knowledge from external data, in the form of a pre-trained distance metric, to effectively determine relative speaker identities to unseen data. However, much of recent focus has been on choosing the appropriate feature extractor, ranging from pre-trained i-vectors to representations learned via different sequence modeling architectures (e.g. 1D-CNNs, LSTMs, attention models), while adopting off-the-shelf metric learning solutions. In this paper, we argue that, regardless of the feature extractor, it is crucial to carefully design a metric learning pipeline, namely the loss function, the sampling strategy and the discriminative margin parameter, for building robust diarization systems. Furthermore, we propose to adopt a fine-grained validation process to obtain a comprehensive evaluation of the generalization power of metric learning pipelines. To this end, we measure diarization performance across different language speakers, and variations in the number of speakers in a recording. Using empirical studies, we provide interesting insights into the effectiveness of different design choices and make recommendations.",

keywords = "Speaker diarization, attention models, inverse distance weighted sampling, metric learning",

author = "Narayanaswamy, {Vivek Sivaraman} and Thiagarajan, {Jayaraman J.} and Huan Song and Andreas Spanias",

note = "Funding Information: This work was supported in part by the ASU SenSIP center, Arizona State University. Portions of this work were performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-PROC-767885. Publisher Copyright: {\textcopyright} 2019 IEEE.; 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 ; Conference date: 12-05-2019 Through 17-05-2019",

year = "2019",

month = may,

doi = "10.1109/ICASSP.2019.8682255",

language = "English (US)",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "5806--5810",

booktitle = "2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings",

}

TY - GEN

T1 - Designing an Effective Metric Learning Pipeline for Speaker Diarization

AU - Narayanaswamy, Vivek Sivaraman

AU - Thiagarajan, Jayaraman J.

AU - Song, Huan

AU - Spanias, Andreas

N1 - Funding Information: This work was supported in part by the ASU SenSIP center, Arizona State University. Portions of this work were performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-PROC-767885. Publisher Copyright: © 2019 IEEE.

PY - 2019/5

Y1 - 2019/5

N2 - State-of-the-art speaker diarization systems utilize knowledge from external data, in the form of a pre-trained distance metric, to effectively determine relative speaker identities to unseen data. However, much of recent focus has been on choosing the appropriate feature extractor, ranging from pre-trained i-vectors to representations learned via different sequence modeling architectures (e.g. 1D-CNNs, LSTMs, attention models), while adopting off-the-shelf metric learning solutions. In this paper, we argue that, regardless of the feature extractor, it is crucial to carefully design a metric learning pipeline, namely the loss function, the sampling strategy and the discriminative margin parameter, for building robust diarization systems. Furthermore, we propose to adopt a fine-grained validation process to obtain a comprehensive evaluation of the generalization power of metric learning pipelines. To this end, we measure diarization performance across different language speakers, and variations in the number of speakers in a recording. Using empirical studies, we provide interesting insights into the effectiveness of different design choices and make recommendations.

AB - State-of-the-art speaker diarization systems utilize knowledge from external data, in the form of a pre-trained distance metric, to effectively determine relative speaker identities to unseen data. However, much of recent focus has been on choosing the appropriate feature extractor, ranging from pre-trained i-vectors to representations learned via different sequence modeling architectures (e.g. 1D-CNNs, LSTMs, attention models), while adopting off-the-shelf metric learning solutions. In this paper, we argue that, regardless of the feature extractor, it is crucial to carefully design a metric learning pipeline, namely the loss function, the sampling strategy and the discriminative margin parameter, for building robust diarization systems. Furthermore, we propose to adopt a fine-grained validation process to obtain a comprehensive evaluation of the generalization power of metric learning pipelines. To this end, we measure diarization performance across different language speakers, and variations in the number of speakers in a recording. Using empirical studies, we provide interesting insights into the effectiveness of different design choices and make recommendations.

KW - Speaker diarization

KW - attention models

KW - inverse distance weighted sampling

KW - metric learning

UR - http://www.scopus.com/inward/record.url?scp=85068996908&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85068996908&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2019.8682255

DO - 10.1109/ICASSP.2019.8682255

M3 - Conference contribution

AN - SCOPUS:85068996908

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 5806

EP - 5810

BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019

Y2 - 12 May 2019 through 17 May 2019

ER -

Designing an Effective Metric Learning Pipeline for Speaker Diarization

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this