Abstract

In automatic speech processing systems, speaker diarization is a crucial front-end component to separate segments from different speakers. Inspired by the recent success of deep neural networks (DNNs) in semantic inferencing, triplet loss-based architectures have been successfully used for this problem. However, existing work utilizes conventional i-vectors as the input representation and builds simple fully connected networks for metric learning, thus not fully leveraging the modeling power of DNN architectures. This paper investigates the importance of learning effective representations from the sequences directly in metric learning pipelines for speaker diarization. More specifically, we propose to employ attention models to learn embeddings and the metric jointly in an end-to-end fashion. Experiments are conducted on the CALLHOME conversational speech corpus. The diarization results demonstrate that, besides providing a unified model, the proposed approach achieves improved performance when compared against existing approaches.

Original languageEnglish (US)
Pages (from-to)3608-3612
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2018-September
DOIs
StatePublished - Jan 1 2018
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: Sep 2 2018Sep 6 2018

Fingerprint

Metric
Speech processing
Network architecture
Neural Networks
Speech Processing
Pipelines
Semantics
Network Architecture
Experiments
Modeling
Model
Demonstrate
Experiment
Learning
Deep neural networks
Triplet
Speech
Corpus
Architecture
Conventional

Keywords

  • Attention models
  • Metric learning
  • Speaker diarization
  • Triplet network

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Cite this

Triplet network with attention for speaker diarization. / Song, Huan; Willi, Megan; Thiagarajan, Jayaraman J.; Berisha, Visar; Spanias, Andreas.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2018-September, 01.01.2018, p. 3608-3612.

Research output: Contribution to journalConference article

@article{1d84ac82f3224f818ace1bed5402cd1a,
title = "Triplet network with attention for speaker diarization",
abstract = "In automatic speech processing systems, speaker diarization is a crucial front-end component to separate segments from different speakers. Inspired by the recent success of deep neural networks (DNNs) in semantic inferencing, triplet loss-based architectures have been successfully used for this problem. However, existing work utilizes conventional i-vectors as the input representation and builds simple fully connected networks for metric learning, thus not fully leveraging the modeling power of DNN architectures. This paper investigates the importance of learning effective representations from the sequences directly in metric learning pipelines for speaker diarization. More specifically, we propose to employ attention models to learn embeddings and the metric jointly in an end-to-end fashion. Experiments are conducted on the CALLHOME conversational speech corpus. The diarization results demonstrate that, besides providing a unified model, the proposed approach achieves improved performance when compared against existing approaches.",
keywords = "Attention models, Metric learning, Speaker diarization, Triplet network",
author = "Huan Song and Megan Willi and Thiagarajan, {Jayaraman J.} and Visar Berisha and Andreas Spanias",
year = "2018",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2018-2305",
language = "English (US)",
volume = "2018-September",
pages = "3608--3612",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Triplet network with attention for speaker diarization

AU - Song, Huan

AU - Willi, Megan

AU - Thiagarajan, Jayaraman J.

AU - Berisha, Visar

AU - Spanias, Andreas

PY - 2018/1/1

Y1 - 2018/1/1

N2 - In automatic speech processing systems, speaker diarization is a crucial front-end component to separate segments from different speakers. Inspired by the recent success of deep neural networks (DNNs) in semantic inferencing, triplet loss-based architectures have been successfully used for this problem. However, existing work utilizes conventional i-vectors as the input representation and builds simple fully connected networks for metric learning, thus not fully leveraging the modeling power of DNN architectures. This paper investigates the importance of learning effective representations from the sequences directly in metric learning pipelines for speaker diarization. More specifically, we propose to employ attention models to learn embeddings and the metric jointly in an end-to-end fashion. Experiments are conducted on the CALLHOME conversational speech corpus. The diarization results demonstrate that, besides providing a unified model, the proposed approach achieves improved performance when compared against existing approaches.

AB - In automatic speech processing systems, speaker diarization is a crucial front-end component to separate segments from different speakers. Inspired by the recent success of deep neural networks (DNNs) in semantic inferencing, triplet loss-based architectures have been successfully used for this problem. However, existing work utilizes conventional i-vectors as the input representation and builds simple fully connected networks for metric learning, thus not fully leveraging the modeling power of DNN architectures. This paper investigates the importance of learning effective representations from the sequences directly in metric learning pipelines for speaker diarization. More specifically, we propose to employ attention models to learn embeddings and the metric jointly in an end-to-end fashion. Experiments are conducted on the CALLHOME conversational speech corpus. The diarization results demonstrate that, besides providing a unified model, the proposed approach achieves improved performance when compared against existing approaches.

KW - Attention models

KW - Metric learning

KW - Speaker diarization

KW - Triplet network

UR - http://www.scopus.com/inward/record.url?scp=85054959506&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054959506&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2018-2305

DO - 10.21437/Interspeech.2018-2305

M3 - Conference article

VL - 2018-September

SP - 3608

EP - 3612

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -