Extractive summarization using cohesion network analysis and submodular set functions

Valentin Sergiu Cioaca; Mihai Dascalu; Danielle S. McNamara

doi:10.1109/SYNASC51798.2020.00035

Extractive summarization using cohesion network analysis and submodular set functions

Valentin Sergiu Cioaca, Mihai Dascalu, Danielle S. McNamara

Psychology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

4 Scopus citations

Abstract

Numerous approaches have been introduced to automate the process of text summarization, but only few can be easily adapted to multiple languages. This paper introduces a multilingual text processing pipeline integrated in the open-source ReaderBench framework, which can be retrofit to cover more than 50 languages. While considering the extensibility of the approach and the problem of missing labeled data for training in various languages besides English, an unsupervised algorithm was preferred to perform extractive summarization (i.e., select the most representative sentences from the original document). Specifically, two different approaches relying on text cohesion were implemented: a) a graph-based text representation derived from Cohesion Network Analysis that extends TextRank, and b) a class of submodular set functions. Evaluations were performed on the DUC dataset and use as baseline the implementation of TextRank from Gensim. Our results using the submodular set functions outperform the baseline. In addition, two use cases on English and Romanian languages are presented, with corresponding graphical representations for the two methods.

Original language	English (US)
Title of host publication	Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	161-168
Number of pages	8
ISBN (Electronic)	9781728176284
DOIs	https://doi.org/10.1109/SYNASC51798.2020.00035
State	Published - Sep 2020
Event	22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020 - Virtual, Timisoara, Romania Duration: Sep 1 2020 → Sep 4 2020

Publication series

Name	Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020

Conference

Conference	22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020
Country/Territory	Romania
City	Virtual, Timisoara
Period	9/1/20 → 9/4/20

Keywords

Cohesion Network Analysis
Extractive summarization
SpaCy framework
Submodular functions
TextRank
Word Mover's Distance

ASJC Scopus subject areas

Computer Science Applications
Computational Mathematics
Modeling and Simulation
Numerical Analysis

Access to Document

10.1109/SYNASC51798.2020.00035

Cite this

Cioaca, V. S., Dascalu, M., & McNamara, D. S. (2020). Extractive summarization using cohesion network analysis and submodular set functions. In Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020 (pp. 161-168). Article 9357072 (Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/SYNASC51798.2020.00035

Extractive summarization using cohesion network analysis and submodular set functions. / Cioaca, Valentin Sergiu; Dascalu, Mihai; McNamara, Danielle S.
Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020. Institute of Electrical and Electronics Engineers Inc., 2020. p. 161-168 9357072 (Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Cioaca, VS, Dascalu, M & McNamara, DS 2020, Extractive summarization using cohesion network analysis and submodular set functions. in Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020., 9357072, Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020, Institute of Electrical and Electronics Engineers Inc., pp. 161-168, 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020, Virtual, Timisoara, Romania, 9/1/20. https://doi.org/10.1109/SYNASC51798.2020.00035

Cioaca VS, Dascalu M, McNamara DS. Extractive summarization using cohesion network analysis and submodular set functions. In Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020. Institute of Electrical and Electronics Engineers Inc. 2020. p. 161-168. 9357072. (Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020). doi: 10.1109/SYNASC51798.2020.00035

Cioaca, Valentin Sergiu ; Dascalu, Mihai ; McNamara, Danielle S. / Extractive summarization using cohesion network analysis and submodular set functions. Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020. Institute of Electrical and Electronics Engineers Inc., 2020. pp. 161-168 (Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020).

@inproceedings{8d5141b4ec8547ffa86f6a4c86b8fb81,

title = "Extractive summarization using cohesion network analysis and submodular set functions",

abstract = "Numerous approaches have been introduced to automate the process of text summarization, but only few can be easily adapted to multiple languages. This paper introduces a multilingual text processing pipeline integrated in the open-source ReaderBench framework, which can be retrofit to cover more than 50 languages. While considering the extensibility of the approach and the problem of missing labeled data for training in various languages besides English, an unsupervised algorithm was preferred to perform extractive summarization (i.e., select the most representative sentences from the original document). Specifically, two different approaches relying on text cohesion were implemented: a) a graph-based text representation derived from Cohesion Network Analysis that extends TextRank, and b) a class of submodular set functions. Evaluations were performed on the DUC dataset and use as baseline the implementation of TextRank from Gensim. Our results using the submodular set functions outperform the baseline. In addition, two use cases on English and Romanian languages are presented, with corresponding graphical representations for the two methods.",

keywords = "Cohesion Network Analysis, Extractive summarization, SpaCy framework, Submodular functions, TextRank, Word Mover's Distance",

author = "Cioaca, {Valentin Sergiu} and Mihai Dascalu and McNamara, {Danielle S.}",

note = "Publisher Copyright: {\textcopyright} 2020 IEEE.; 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020 ; Conference date: 01-09-2020 Through 04-09-2020",

year = "2020",

month = sep,

doi = "10.1109/SYNASC51798.2020.00035",

language = "English (US)",

series = "Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "161--168",

booktitle = "Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020",

}

TY - GEN

T1 - Extractive summarization using cohesion network analysis and submodular set functions

AU - Cioaca, Valentin Sergiu

AU - Dascalu, Mihai

AU - McNamara, Danielle S.

PY - 2020/9

Y1 - 2020/9

N2 - Numerous approaches have been introduced to automate the process of text summarization, but only few can be easily adapted to multiple languages. This paper introduces a multilingual text processing pipeline integrated in the open-source ReaderBench framework, which can be retrofit to cover more than 50 languages. While considering the extensibility of the approach and the problem of missing labeled data for training in various languages besides English, an unsupervised algorithm was preferred to perform extractive summarization (i.e., select the most representative sentences from the original document). Specifically, two different approaches relying on text cohesion were implemented: a) a graph-based text representation derived from Cohesion Network Analysis that extends TextRank, and b) a class of submodular set functions. Evaluations were performed on the DUC dataset and use as baseline the implementation of TextRank from Gensim. Our results using the submodular set functions outperform the baseline. In addition, two use cases on English and Romanian languages are presented, with corresponding graphical representations for the two methods.

AB - Numerous approaches have been introduced to automate the process of text summarization, but only few can be easily adapted to multiple languages. This paper introduces a multilingual text processing pipeline integrated in the open-source ReaderBench framework, which can be retrofit to cover more than 50 languages. While considering the extensibility of the approach and the problem of missing labeled data for training in various languages besides English, an unsupervised algorithm was preferred to perform extractive summarization (i.e., select the most representative sentences from the original document). Specifically, two different approaches relying on text cohesion were implemented: a) a graph-based text representation derived from Cohesion Network Analysis that extends TextRank, and b) a class of submodular set functions. Evaluations were performed on the DUC dataset and use as baseline the implementation of TextRank from Gensim. Our results using the submodular set functions outperform the baseline. In addition, two use cases on English and Romanian languages are presented, with corresponding graphical representations for the two methods.

KW - Cohesion Network Analysis

KW - Extractive summarization

KW - SpaCy framework

KW - Submodular functions

KW - TextRank

KW - Word Mover's Distance

UR - http://www.scopus.com/inward/record.url?scp=85102346431&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85102346431&partnerID=8YFLogxK

U2 - 10.1109/SYNASC51798.2020.00035

DO - 10.1109/SYNASC51798.2020.00035

M3 - Conference contribution

AN - SCOPUS:85102346431

T3 - Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020

SP - 161

EP - 168

BT - Proceedings - 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2020

Y2 - 1 September 2020 through 4 September 2020

ER -

Extractive summarization using cohesion network analysis and submodular set functions

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this