TY - GEN
T1 - Human-Machine Interaction for Improved Cybersecurity Named Entity Recognition Considering Semantic Similarity
AU - Kashihara, Kazuaki
AU - Shakarian, Jana
AU - Baral, Chitta
N1 - Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - The automated and timely conversion or extraction of cybersecurity information from unstructured text from online sources is important and required for many applications. Named Entity Recognition (NER) is used to detect the relevant domain entities such as product, attack name, malware name, hacker group name, etc. To train a new NER model for cybersecurity, traditional NER requires a training corpus annotated with cybersecurity entities and state-of-the-art methods require time-consuming and labor intensive feature engineering. We propose a Human-Machine Interaction method for semi-automatic labeling and corpus generation for cybersecurity entities. Our method evaluates the learned NER model with the sentences that we collected in the training process, and the user selects only the correct pair of the named entity and its category for next iteration training. Thus, each iteration gets better training corpora to train the NER model. Some entities are ambiguous since the word or phrase has multiple meanings. We introduce a new semantic similarity measure and determine which category the word belongs to based on this semantic similarity of the entire sentence. The experimental evaluation result shows that our method is better than existing methods in finding undiscovered keywords of given categories.
AB - The automated and timely conversion or extraction of cybersecurity information from unstructured text from online sources is important and required for many applications. Named Entity Recognition (NER) is used to detect the relevant domain entities such as product, attack name, malware name, hacker group name, etc. To train a new NER model for cybersecurity, traditional NER requires a training corpus annotated with cybersecurity entities and state-of-the-art methods require time-consuming and labor intensive feature engineering. We propose a Human-Machine Interaction method for semi-automatic labeling and corpus generation for cybersecurity entities. Our method evaluates the learned NER model with the sentences that we collected in the training process, and the user selects only the correct pair of the named entity and its category for next iteration training. Thus, each iteration gets better training corpora to train the NER model. Some entities are ambiguous since the word or phrase has multiple meanings. We introduce a new semantic similarity measure and determine which category the word belongs to based on this semantic similarity of the entire sentence. The experimental evaluation result shows that our method is better than existing methods in finding undiscovered keywords of given categories.
KW - Cybersecurity
KW - NER
KW - Named Entity Recognition
KW - Semantic similarity
UR - http://www.scopus.com/inward/record.url?scp=85090097390&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090097390&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-55187-2_28
DO - 10.1007/978-3-030-55187-2_28
M3 - Conference contribution
AN - SCOPUS:85090097390
SN - 9783030551865
T3 - Advances in Intelligent Systems and Computing
SP - 347
EP - 361
BT - Intelligent Systems and Applications - Proceedings of the 2020 Intelligent Systems Conference IntelliSys Volume 2
A2 - Arai, Kohei
A2 - Kapoor, Supriya
A2 - Bhatia, Rahul
PB - Springer
T2 - Intelligent Systems Conference, IntelliSys 2020
Y2 - 3 September 2020 through 4 September 2020
ER -