Pharmacovigilance from social media: Mining adverse drug reaction mentions using sequence labeling with word embedding cluster features

Azadeh Nikfarjam; Abeed Sarker; Karen O'Connor; Rachel Ginn; Graciela Gonzalez

doi:10.1093/jamia/ocu041

Pharmacovigilance from social media: Mining adverse drug reaction mentions using sequence labeling with word embedding cluster features

Azadeh Nikfarjam, Abeed Sarker, Karen O'Connor, Rachel Ginn, Graciela Gonzalez

Research output: Contribution to journal › Article › peer-review

407 Scopus citations

Abstract

Objective Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and userexpressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. Methods: We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words' semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. Results: ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. Conclusion: It is possible to extract complex medical concepts, with relatively high performance, from informal, usergenerated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.

Original language	English (US)
Pages (from-to)	671-681
Number of pages	11
Journal	Journal of the American Medical Informatics Association
Volume	22
Issue number	3
DOIs	https://doi.org/10.1093/jamia/ocu041
State	Published - 2015

Keywords

ADR
Adverse drug reaction
Deep learning word embeddings
Machine learning
Natural language processing
Pharmacovigilance
Social media mining

ASJC Scopus subject areas

Health Informatics

Access to Document

10.1093/jamia/ocu041

Cite this

@article{cc518aeb10fc475b85520080265ddd83,

title = "Pharmacovigilance from social media: Mining adverse drug reaction mentions using sequence labeling with word embedding cluster features",

abstract = "Objective Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and userexpressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. Methods: We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words' semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. Results: ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. Conclusion: It is possible to extract complex medical concepts, with relatively high performance, from informal, usergenerated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.",

keywords = "ADR, Adverse drug reaction, Deep learning word embeddings, Machine learning, Natural language processing, Pharmacovigilance, Social media mining",

author = "Azadeh Nikfarjam and Abeed Sarker and Karen O'Connor and Rachel Ginn and Graciela Gonzalez",

note = "Publisher Copyright: {\textcopyright} The Author 2015.",

year = "2015",

doi = "10.1093/jamia/ocu041",

language = "English (US)",

volume = "22",

pages = "671--681",

journal = "Journal of the American Medical Informatics Association",

issn = "1067-5027",

publisher = "Oxford University Press",

number = "3",

}

TY - JOUR

T1 - Pharmacovigilance from social media

T2 - Mining adverse drug reaction mentions using sequence labeling with word embedding cluster features

AU - Nikfarjam, Azadeh

AU - Sarker, Abeed

AU - O'Connor, Karen

AU - Ginn, Rachel

AU - Gonzalez, Graciela

N1 - Publisher Copyright: © The Author 2015.

PY - 2015

Y1 - 2015

N2 - Objective Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and userexpressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. Methods: We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words' semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. Results: ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. Conclusion: It is possible to extract complex medical concepts, with relatively high performance, from informal, usergenerated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.

AB - Objective Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and userexpressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. Methods: We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words' semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. Results: ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. Conclusion: It is possible to extract complex medical concepts, with relatively high performance, from informal, usergenerated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.

KW - ADR

KW - Adverse drug reaction

KW - Deep learning word embeddings

KW - Machine learning

KW - Natural language processing

KW - Pharmacovigilance

KW - Social media mining

UR - http://www.scopus.com/inward/record.url?scp=84927943705&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84927943705&partnerID=8YFLogxK

U2 - 10.1093/jamia/ocu041

DO - 10.1093/jamia/ocu041

M3 - Article

C2 - 25755127

AN - SCOPUS:84927943705

SN - 1067-5027

VL - 22

SP - 671

EP - 681

JO - Journal of the American Medical Informatics Association

JF - Journal of the American Medical Informatics Association

IS - 3

ER -

Pharmacovigilance from social media: Mining adverse drug reaction mentions using sequence labeling with word embedding cluster features

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this