Portable automatic text classification for adverse drug reaction detection via multi-corpus training

Abeed Sarker, Graciela Gonzalez

    Research output: Contribution to journalArticle

    133 Citations (Scopus)

    Abstract

    Objective: Automatic detection of adverse drug reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media-where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing (NLP) approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. Methods: One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (. e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Results: Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Conclusions: Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (. e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future.

    Original languageEnglish (US)
    Pages (from-to)196-207
    Number of pages12
    JournalJournal of Biomedical Informatics
    Volume53
    DOIs
    StatePublished - Feb 1 2015

    Fingerprint

    Drug-Related Side Effects and Adverse Reactions
    Social Media
    Natural Language Processing
    Pharmacovigilance
    Semantics
    Research
    Benchmarking
    Processing
    Internet
    Learning algorithms
    Learning systems
    Datasets
    Costs and Cost Analysis
    Costs
    Experiments

    Keywords

    • Adverse drug reaction
    • Natural language processing
    • Pharmacovigilance
    • Social media monitoring
    • Text classification

    ASJC Scopus subject areas

    • Computer Science Applications
    • Health Informatics

    Cite this

    Portable automatic text classification for adverse drug reaction detection via multi-corpus training. / Sarker, Abeed; Gonzalez, Graciela.

    In: Journal of Biomedical Informatics, Vol. 53, 01.02.2015, p. 196-207.

    Research output: Contribution to journalArticle

    @article{35599cc231a74e0dae58dfb2429b1814,
    title = "Portable automatic text classification for adverse drug reaction detection via multi-corpus training",
    abstract = "Objective: Automatic detection of adverse drug reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media-where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing (NLP) approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. Methods: One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (. e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Results: Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Conclusions: Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (. e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future.",
    keywords = "Adverse drug reaction, Natural language processing, Pharmacovigilance, Social media monitoring, Text classification",
    author = "Abeed Sarker and Graciela Gonzalez",
    year = "2015",
    month = "2",
    day = "1",
    doi = "10.1016/j.jbi.2014.11.002",
    language = "English (US)",
    volume = "53",
    pages = "196--207",
    journal = "Journal of Biomedical Informatics",
    issn = "1532-0464",
    publisher = "Academic Press Inc.",

    }

    TY - JOUR

    T1 - Portable automatic text classification for adverse drug reaction detection via multi-corpus training

    AU - Sarker, Abeed

    AU - Gonzalez, Graciela

    PY - 2015/2/1

    Y1 - 2015/2/1

    N2 - Objective: Automatic detection of adverse drug reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media-where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing (NLP) approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. Methods: One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (. e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Results: Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Conclusions: Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (. e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future.

    AB - Objective: Automatic detection of adverse drug reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media-where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing (NLP) approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. Methods: One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (. e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Results: Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Conclusions: Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (. e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future.

    KW - Adverse drug reaction

    KW - Natural language processing

    KW - Pharmacovigilance

    KW - Social media monitoring

    KW - Text classification

    UR - http://www.scopus.com/inward/record.url?scp=84924285421&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84924285421&partnerID=8YFLogxK

    U2 - 10.1016/j.jbi.2014.11.002

    DO - 10.1016/j.jbi.2014.11.002

    M3 - Article

    C2 - 25451103

    AN - SCOPUS:84924285421

    VL - 53

    SP - 196

    EP - 207

    JO - Journal of Biomedical Informatics

    JF - Journal of Biomedical Informatics

    SN - 1532-0464

    ER -