Pharmacovigilance from social media: Mining adverse drug reaction mentions using sequence labeling with word embedding cluster features

Azadeh Nikfarjam, Abeed Sarker, Karen O'Connor, Rachel Ginn, Graciela Gonzalez

    Research output: Contribution to journalArticle

    177 Citations (Scopus)

    Abstract

    Objective Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and userexpressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. Methods: We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words' semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. Results: ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. Conclusion: It is possible to extract complex medical concepts, with relatively high performance, from informal, usergenerated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.

    Original languageEnglish (US)
    Pages (from-to)671-681
    Number of pages11
    JournalJournal of the American Medical Informatics Association
    Volume22
    Issue number3
    DOIs
    StatePublished - 2015

    Fingerprint

    Social Media
    Pharmacovigilance
    Drug-Related Side Effects and Adverse Reactions
    Natural Language Processing
    Personal Health Records
    Complex Mixtures
    Semantics
    Cluster Analysis
    Language
    Public Health
    Learning
    Machine Learning

    Keywords

    • ADR
    • Adverse drug reaction
    • Deep learning word embeddings
    • Machine learning
    • Natural language processing
    • Pharmacovigilance
    • Social media mining

    ASJC Scopus subject areas

    • Health Informatics

    Cite this

    Pharmacovigilance from social media : Mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. / Nikfarjam, Azadeh; Sarker, Abeed; O'Connor, Karen; Ginn, Rachel; Gonzalez, Graciela.

    In: Journal of the American Medical Informatics Association, Vol. 22, No. 3, 2015, p. 671-681.

    Research output: Contribution to journalArticle

    Nikfarjam, Azadeh ; Sarker, Abeed ; O'Connor, Karen ; Ginn, Rachel ; Gonzalez, Graciela. / Pharmacovigilance from social media : Mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. In: Journal of the American Medical Informatics Association. 2015 ; Vol. 22, No. 3. pp. 671-681.
    @article{cc518aeb10fc475b85520080265ddd83,
    title = "Pharmacovigilance from social media: Mining adverse drug reaction mentions using sequence labeling with word embedding cluster features",
    abstract = "Objective Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and userexpressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. Methods: We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words' semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. Results: ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. Conclusion: It is possible to extract complex medical concepts, with relatively high performance, from informal, usergenerated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.",
    keywords = "ADR, Adverse drug reaction, Deep learning word embeddings, Machine learning, Natural language processing, Pharmacovigilance, Social media mining",
    author = "Azadeh Nikfarjam and Abeed Sarker and Karen O'Connor and Rachel Ginn and Graciela Gonzalez",
    year = "2015",
    doi = "10.1093/jamia/ocu041",
    language = "English (US)",
    volume = "22",
    pages = "671--681",
    journal = "Journal of the American Medical Informatics Association : JAMIA",
    issn = "1067-5027",
    publisher = "Oxford University Press",
    number = "3",

    }

    TY - JOUR

    T1 - Pharmacovigilance from social media

    T2 - Mining adverse drug reaction mentions using sequence labeling with word embedding cluster features

    AU - Nikfarjam, Azadeh

    AU - Sarker, Abeed

    AU - O'Connor, Karen

    AU - Ginn, Rachel

    AU - Gonzalez, Graciela

    PY - 2015

    Y1 - 2015

    N2 - Objective Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and userexpressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. Methods: We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words' semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. Results: ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. Conclusion: It is possible to extract complex medical concepts, with relatively high performance, from informal, usergenerated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.

    AB - Objective Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and userexpressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. Methods: We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words' semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. Results: ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. Conclusion: It is possible to extract complex medical concepts, with relatively high performance, from informal, usergenerated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.

    KW - ADR

    KW - Adverse drug reaction

    KW - Deep learning word embeddings

    KW - Machine learning

    KW - Natural language processing

    KW - Pharmacovigilance

    KW - Social media mining

    UR - http://www.scopus.com/inward/record.url?scp=84927943705&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84927943705&partnerID=8YFLogxK

    U2 - 10.1093/jamia/ocu041

    DO - 10.1093/jamia/ocu041

    M3 - Article

    C2 - 25755127

    AN - SCOPUS:84927943705

    VL - 22

    SP - 671

    EP - 681

    JO - Journal of the American Medical Informatics Association : JAMIA

    JF - Journal of the American Medical Informatics Association : JAMIA

    SN - 1067-5027

    IS - 3

    ER -