Challenges and solutions to employing natural language processing and machine learning to measure patients’ health literacy and physician writing complexity: The ECLIPPSE study

William Brown; Renu Balyan; Andrew J. Karter; Scott Crossley; Wagahta Semere; Nicholas D. Duran; Courtney Lyles; Jennifer Liu; Howard H. Moffet; Ryane Daniels; Danielle S. McNamara; Dean Schillinger

doi:10.1016/j.jbi.2020.103658

Challenges and solutions to employing natural language processing and machine learning to measure patients’ health literacy and physician writing complexity: The ECLIPPSE study

William Brown, Renu Balyan, Andrew J. Karter, Scott Crossley, Wagahta Semere, Nicholas D. Duran, Courtney Lyles, Jennifer Liu, Howard H. Moffet, Ryane Daniels, Danielle S. McNamara, Dean Schillinger

Research output: Contribution to journal › Comment/debate › peer-review

5 Scopus citations

Abstract

Objective: In the National Library of Medicine funded ECLIPPSE Project (Employing Computational Linguistics to Improve Patient-Provider Secure Emails exchange), we attempted to create novel, valid, and scalable measures of both patients’ health literacy (HL) and physicians’ linguistic complexity by employing natural language processing (NLP) techniques and machine learning (ML). We applied these techniques to > 400,000 patients’ and physicians’ secure messages (SMs) exchanged via an electronic patient portal, developing and validating an automated patient literacy profile (LP) and physician complexity profile (CP). Herein, we describe the challenges faced and the solutions implemented during this innovative endeavor. Materials and methods: To describe challenges and solutions, we used two data sources: study documents and interviews with study investigators. Over the five years of the project, the team tracked their research process using a combination of Google Docs tools and an online team organization, tracking, and management tool (Asana). In year 5, the team convened a number of times to discuss, categorize, and code primary challenges and solutions. Results: We identified 23 challenges and associated approaches that emerged from three overarching process domains: (1) Data Mining related to the SM corpus; (2) Analyses using NLP indices on the SM corpus; and (3) Interdisciplinary Collaboration. With respect to Data Mining, problems included cleaning SMs to enable analyses, removing hidden caregiver proxies (e.g., other family members) and Spanish language SMs, and culling SMs to ensure that only patients’ primary care physicians were included. With respect to Analyses, critical decisions needed to be made as to which computational linguistic indices and ML approaches should be selected; how to enable the NLP-based linguistic indices tools to run smoothly and to extract meaningful data from a large corpus of medical text; and how to best assess content and predictive validities of both the LP and the CP. With respect to the Interdisciplinary Collaboration, because the research required engagement between clinicians, health services researchers, biomedical informaticians, linguists, and cognitive scientists, continual effort was needed to identify and reconcile differences in scientific terminologies and resolve confusion; arrive at common understanding of tasks that needed to be completed and priorities therein; reach compromises regarding what represents “meaningful findings” in health services vs. cognitive science research; and address constraints regarding potential transportability of the final LP and CP to different health care settings. Discussion: Our study represents a process evaluation of an innovative research initiative to harness “big linguistic data” to estimate patient HL and physician linguistic complexity. Any of the challenges we identified, if left unaddressed, would have either rendered impossible the effort to generate LPs and CPs, or invalidated analytic results related to the LPs and CPs. Investigators undertaking similar research in HL or using computational linguistic methods to assess patient-clinician exchange will face similar challenges and may find our solutions helpful when designing and executing their health communications research.

Original language	English (US)
Article number	103658
Journal	Journal of Biomedical Informatics
Volume	113
DOIs	https://doi.org/10.1016/j.jbi.2020.103658
State	Published - Jan 2021

Keywords

Diabetes health care quality
Digital health and health services research
Electronic health records
Health literacy
Machine learning
Natural language processing

ASJC Scopus subject areas

Health Informatics
Computer Science Applications

Access to Document

10.1016/j.jbi.2020.103658

Cite this

Brown, W., Balyan, R., Karter, A. J., Crossley, S., Semere, W., Duran, N. D., Lyles, C., Liu, J., Moffet, H. H., Daniels, R., McNamara, D. S., & Schillinger, D. (2021). Challenges and solutions to employing natural language processing and machine learning to measure patients’ health literacy and physician writing complexity: The ECLIPPSE study. Journal of Biomedical Informatics, 113, Article 103658. https://doi.org/10.1016/j.jbi.2020.103658

Challenges and solutions to employing natural language processing and machine learning to measure patients’ health literacy and physician writing complexity: The ECLIPPSE study. / Brown, William; Balyan, Renu; Karter, Andrew J. et al.
In: Journal of Biomedical Informatics, Vol. 113, 103658, 01.2021.

Research output: Contribution to journal › Comment/debate › peer-review

Brown, W, Balyan, R, Karter, AJ, Crossley, S, Semere, W, Duran, ND, Lyles, C, Liu, J, Moffet, HH, Daniels, R, McNamara, DS & Schillinger, D 2021, 'Challenges and solutions to employing natural language processing and machine learning to measure patients’ health literacy and physician writing complexity: The ECLIPPSE study', Journal of Biomedical Informatics, vol. 113, 103658. https://doi.org/10.1016/j.jbi.2020.103658

@article{85056848a34e42ec98d5bfd19b508dca,

title = "Challenges and solutions to employing natural language processing and machine learning to measure patients{\textquoteright} health literacy and physician writing complexity: The ECLIPPSE study",

abstract = "Objective: In the National Library of Medicine funded ECLIPPSE Project (Employing Computational Linguistics to Improve Patient-Provider Secure Emails exchange), we attempted to create novel, valid, and scalable measures of both patients{\textquoteright} health literacy (HL) and physicians{\textquoteright} linguistic complexity by employing natural language processing (NLP) techniques and machine learning (ML). We applied these techniques to > 400,000 patients{\textquoteright} and physicians{\textquoteright} secure messages (SMs) exchanged via an electronic patient portal, developing and validating an automated patient literacy profile (LP) and physician complexity profile (CP). Herein, we describe the challenges faced and the solutions implemented during this innovative endeavor. Materials and methods: To describe challenges and solutions, we used two data sources: study documents and interviews with study investigators. Over the five years of the project, the team tracked their research process using a combination of Google Docs tools and an online team organization, tracking, and management tool (Asana). In year 5, the team convened a number of times to discuss, categorize, and code primary challenges and solutions. Results: We identified 23 challenges and associated approaches that emerged from three overarching process domains: (1) Data Mining related to the SM corpus; (2) Analyses using NLP indices on the SM corpus; and (3) Interdisciplinary Collaboration. With respect to Data Mining, problems included cleaning SMs to enable analyses, removing hidden caregiver proxies (e.g., other family members) and Spanish language SMs, and culling SMs to ensure that only patients{\textquoteright} primary care physicians were included. With respect to Analyses, critical decisions needed to be made as to which computational linguistic indices and ML approaches should be selected; how to enable the NLP-based linguistic indices tools to run smoothly and to extract meaningful data from a large corpus of medical text; and how to best assess content and predictive validities of both the LP and the CP. With respect to the Interdisciplinary Collaboration, because the research required engagement between clinicians, health services researchers, biomedical informaticians, linguists, and cognitive scientists, continual effort was needed to identify and reconcile differences in scientific terminologies and resolve confusion; arrive at common understanding of tasks that needed to be completed and priorities therein; reach compromises regarding what represents “meaningful findings” in health services vs. cognitive science research; and address constraints regarding potential transportability of the final LP and CP to different health care settings. Discussion: Our study represents a process evaluation of an innovative research initiative to harness “big linguistic data” to estimate patient HL and physician linguistic complexity. Any of the challenges we identified, if left unaddressed, would have either rendered impossible the effort to generate LPs and CPs, or invalidated analytic results related to the LPs and CPs. Investigators undertaking similar research in HL or using computational linguistic methods to assess patient-clinician exchange will face similar challenges and may find our solutions helpful when designing and executing their health communications research.",

keywords = "Diabetes health care quality, Digital health and health services research, Electronic health records, Health literacy, Machine learning, Natural language processing",

author = "William Brown and Renu Balyan and Karter, {Andrew J.} and Scott Crossley and Wagahta Semere and Duran, {Nicholas D.} and Courtney Lyles and Jennifer Liu and Moffet, {Howard H.} and Ryane Daniels and McNamara, {Danielle S.} and Dean Schillinger",

note = "Publisher Copyright: {\textcopyright} 2020",

year = "2021",

month = jan,

doi = "10.1016/j.jbi.2020.103658",

language = "English (US)",

volume = "113",

journal = "Journal of Biomedical Informatics",

issn = "1532-0464",

publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Challenges and solutions to employing natural language processing and machine learning to measure patients’ health literacy and physician writing complexity

T2 - The ECLIPPSE study

AU - Brown, William

AU - Balyan, Renu

AU - Karter, Andrew J.

AU - Crossley, Scott

AU - Semere, Wagahta

AU - Duran, Nicholas D.

AU - Lyles, Courtney

AU - Liu, Jennifer

AU - Moffet, Howard H.

AU - Daniels, Ryane

AU - McNamara, Danielle S.

AU - Schillinger, Dean

PY - 2021/1

Y1 - 2021/1

N2 - Objective: In the National Library of Medicine funded ECLIPPSE Project (Employing Computational Linguistics to Improve Patient-Provider Secure Emails exchange), we attempted to create novel, valid, and scalable measures of both patients’ health literacy (HL) and physicians’ linguistic complexity by employing natural language processing (NLP) techniques and machine learning (ML). We applied these techniques to > 400,000 patients’ and physicians’ secure messages (SMs) exchanged via an electronic patient portal, developing and validating an automated patient literacy profile (LP) and physician complexity profile (CP). Herein, we describe the challenges faced and the solutions implemented during this innovative endeavor. Materials and methods: To describe challenges and solutions, we used two data sources: study documents and interviews with study investigators. Over the five years of the project, the team tracked their research process using a combination of Google Docs tools and an online team organization, tracking, and management tool (Asana). In year 5, the team convened a number of times to discuss, categorize, and code primary challenges and solutions. Results: We identified 23 challenges and associated approaches that emerged from three overarching process domains: (1) Data Mining related to the SM corpus; (2) Analyses using NLP indices on the SM corpus; and (3) Interdisciplinary Collaboration. With respect to Data Mining, problems included cleaning SMs to enable analyses, removing hidden caregiver proxies (e.g., other family members) and Spanish language SMs, and culling SMs to ensure that only patients’ primary care physicians were included. With respect to Analyses, critical decisions needed to be made as to which computational linguistic indices and ML approaches should be selected; how to enable the NLP-based linguistic indices tools to run smoothly and to extract meaningful data from a large corpus of medical text; and how to best assess content and predictive validities of both the LP and the CP. With respect to the Interdisciplinary Collaboration, because the research required engagement between clinicians, health services researchers, biomedical informaticians, linguists, and cognitive scientists, continual effort was needed to identify and reconcile differences in scientific terminologies and resolve confusion; arrive at common understanding of tasks that needed to be completed and priorities therein; reach compromises regarding what represents “meaningful findings” in health services vs. cognitive science research; and address constraints regarding potential transportability of the final LP and CP to different health care settings. Discussion: Our study represents a process evaluation of an innovative research initiative to harness “big linguistic data” to estimate patient HL and physician linguistic complexity. Any of the challenges we identified, if left unaddressed, would have either rendered impossible the effort to generate LPs and CPs, or invalidated analytic results related to the LPs and CPs. Investigators undertaking similar research in HL or using computational linguistic methods to assess patient-clinician exchange will face similar challenges and may find our solutions helpful when designing and executing their health communications research.

AB - Objective: In the National Library of Medicine funded ECLIPPSE Project (Employing Computational Linguistics to Improve Patient-Provider Secure Emails exchange), we attempted to create novel, valid, and scalable measures of both patients’ health literacy (HL) and physicians’ linguistic complexity by employing natural language processing (NLP) techniques and machine learning (ML). We applied these techniques to > 400,000 patients’ and physicians’ secure messages (SMs) exchanged via an electronic patient portal, developing and validating an automated patient literacy profile (LP) and physician complexity profile (CP). Herein, we describe the challenges faced and the solutions implemented during this innovative endeavor. Materials and methods: To describe challenges and solutions, we used two data sources: study documents and interviews with study investigators. Over the five years of the project, the team tracked their research process using a combination of Google Docs tools and an online team organization, tracking, and management tool (Asana). In year 5, the team convened a number of times to discuss, categorize, and code primary challenges and solutions. Results: We identified 23 challenges and associated approaches that emerged from three overarching process domains: (1) Data Mining related to the SM corpus; (2) Analyses using NLP indices on the SM corpus; and (3) Interdisciplinary Collaboration. With respect to Data Mining, problems included cleaning SMs to enable analyses, removing hidden caregiver proxies (e.g., other family members) and Spanish language SMs, and culling SMs to ensure that only patients’ primary care physicians were included. With respect to Analyses, critical decisions needed to be made as to which computational linguistic indices and ML approaches should be selected; how to enable the NLP-based linguistic indices tools to run smoothly and to extract meaningful data from a large corpus of medical text; and how to best assess content and predictive validities of both the LP and the CP. With respect to the Interdisciplinary Collaboration, because the research required engagement between clinicians, health services researchers, biomedical informaticians, linguists, and cognitive scientists, continual effort was needed to identify and reconcile differences in scientific terminologies and resolve confusion; arrive at common understanding of tasks that needed to be completed and priorities therein; reach compromises regarding what represents “meaningful findings” in health services vs. cognitive science research; and address constraints regarding potential transportability of the final LP and CP to different health care settings. Discussion: Our study represents a process evaluation of an innovative research initiative to harness “big linguistic data” to estimate patient HL and physician linguistic complexity. Any of the challenges we identified, if left unaddressed, would have either rendered impossible the effort to generate LPs and CPs, or invalidated analytic results related to the LPs and CPs. Investigators undertaking similar research in HL or using computational linguistic methods to assess patient-clinician exchange will face similar challenges and may find our solutions helpful when designing and executing their health communications research.

KW - Diabetes health care quality

KW - Digital health and health services research

KW - Electronic health records

KW - Health literacy

KW - Machine learning

KW - Natural language processing

UR - http://www.scopus.com/inward/record.url?scp=85099202097&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85099202097&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2020.103658

DO - 10.1016/j.jbi.2020.103658

M3 - Comment/debate

C2 - 33316421

AN - SCOPUS:85099202097

SN - 1532-0464

VL - 113

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

M1 - 103658

ER -

Challenges and solutions to employing natural language processing and machine learning to measure patients’ health literacy and physician writing complexity: The ECLIPPSE study

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this