< Back to previous page

Project

The sublanguage factor: Modeling term variation in clinical records

Background, motivation and aim of the dissertation

The past years have seen a steady increase in the production of health-related data, such as electronic health records, clinical notes and discharge summaries. Besides the primary function of the record – i.e. the monitoring of an individual patient’s progress – the collection of such files opens new opportunities for large-scale analyses, such as cohort identification or improved decision support systems. As these records consist largely of free text, the development of such applications depends on suitable methods of Natural Language Processing (NLP) to unlock the relevant information. Clinical records, though, are known to employ a language that is not only highly specialized (Friedmann et al. 2002), but also deviates from standard language regarding lexis and grammar. Existing NLP modules cannot be readily adopted, but must be modified to handle these particular features. The processing of clinical narrative is thus a vibrant area of research (Vellupilai et al. 2015, Névéol & Zweigenbaum 2015), which has been fostered by the organization of shared tasks (Clef eHealth, i2b2).

 

Clinical knowledge is classified through medical ontologies, which provide a controlled vocabulary to express established concepts (Ivanović & Budimac 2014). Text-to-concept mapping is thus a key task of clinical language processing. For the recognition of relevant entities from free text, recent approaches employ strategies of supervised and unsupervised learning (Pradhan 2015). However, for languages other than English, the resources required for training are hard to obtain (Skeppstedt 2014). Another crucial limitation lies in the strong dependence on canonical forms: As most recognisers rely on forms derived from existing ontologies for development and/ or validation, non-standards variants of a term will remain undetected (Henriksson 2014). These findings show a clear need for an improved coverage of different formal levels of clinical terminology. To this end, data-driven approaches may benefit from a structural description of lexical variation. Apart from the core task of named entity recognition, related applications, such as context extraction (Azahl 2014), may be enhanced by the insights gained from a systematic linguistic analysis.

 

The proposed dissertation aims at an empirical investigation of lexical variation based on the analysis of a corpus of Dutch clinical records. Starting from the hypothesis that word choices in clinical narrative differ systematically from standardized medical terminology, and that these choices are triggered by socio-cognitive factors, we will identify the contexts in which such deviations occur and model them through predictive features. Following validation on our original dataset, these features may be generalized across medical subdomains. This study will thus lead to a usage-based characterization of lexical variation in clinical Dutch, which may serve as input for computational-linguistic approaches to clinical language processing.

Provisional description of the objectives and methodology of the proposal

During the first two years of the dissertation, we will carry out a case study on the terminology of diabetes and related concepts. In the term acquisition phase, we will search our set of clinical records, as well as a background corpus of diabetes-related text, for relevant terminological variants. Besides ontologies, scholarly journals and encyclopediae, webpages addressed at laymen and e-Health communities have proven valuable resources for the extraction of informal term variants (Fahmi 2009, Elhadad 2014). Following the semi-automatic alignment with medical concepts, the term/concept pairs will be corrected manually with the support of our project partners at UZ Leuven.

 

Based on this collection, we will abstract away from textual occurrences to develop a typology of variation. We will classify the observed instances regarding their type, such as orthographical and morphological alternation, (ad hoc) abbreviation and affixation, switches in register or substitution by neoclassical scientific terms. At the morpho-syntactic level, this typology will enable the formulation of transformational rules, which may be validated by computing the share of variation types relative to the total variation observed in a set of patient records.

 

Next, we will analyse the cognitive and sociolinguistic level by correlating the observed variation types with factors such as situational context and background knowledge of the speaker. Moreover, despite their normative nature, medical ontologies have been shown to allow for polysemous readings (Kreuzthaler & Schulz 2012). Another angle of analysis will thus explore how the semantic heterogeneity of the underlying concept affects the choice among associated variants of a term.

In the next phase, we will focus on the modelling of particular variation types. Combining ontological knowledge, distributional and morpho-syntactic properties, we will derive a set of lower-level features. Previous studies have shown that available tools for clinical language processing, most of which were developed for the processing of English, degrade in performance when transferred to a language with different morphological properties (Skeppstedt 2014). In cooperation with our project partners at LIIR, we will evaluate how the incorporation of our enhanced feature set affects the performance of a state-of-the-art recognizer.

 

Terminological variation has been shown to pattern with contextual, diachronic and, in the case of the clinical sublanguage, even with the institution of origin (Lv 2014). For a final evaluation of our dataset, we will include additional relations extracted by the project partners at LIIR. Through a multivariate analysis, we will identify those factors that significantly affect variation and use them for the re-training of a classifier.

 

Finally, we will conduct a transfer study to generalize our findings. The feature set derived from the diabetes records will be applied to other clinical subdomains, which may require a reconfiguration of the features. 

 

References

Afzal, Z., Pons, E., Kang, N., Sturkenboom, M., Schuemie, M. J., & Kors, J. A. (2014). ContextD: an algorithm to identify contextual properties of medical terms in a Dutch clinical corpus. BMC Bioinformatics, 15(1), 373.

Elhadad, N., & Zhang, S. (2014). Characterizing the Sublanguage of Online Breast Cancer Forums for Medications, Symptoms, and Emotions. AMIA Annu Symp Proc. 2014; 2014: 516–525.

Fahmi, I. (2009). Automatic term and relation extraction for medical question answering system. Retrieved from http://dissertations.ub.rug.nl/faculties/arts/2009/i.fahmi/

Friedman, C., Kra, P., & Rzhetsky, A. (2002). Two biomedical sublanguages: A description based on the theories of Zellig Harris. Journal of Biomedical Informatics, 35(4), 222–235.

Henriksson, A., Moen, H., Skeppstedt, M., Daudaravičius, V., & Duneld, M. (2014). Synonym extraction and abbreviation expansion with ensembles of semantic spaces. Journal of Biomedical Semantics, 5(1), 6.

Ivanović, M., & Budimac, Z. (2014). An overview of ontologies and data resources in medical domains. Expert Systems with Applications, 41(11), 5158–5166. http://doi.org/10.1016/j.eswa.2014.02.045

Kreuzthaler M, S. S. (2012). Metonymies in medical terminologies. A SNOMED CT case study. AMIA Annu Symp Proc. 2012; 2012: 463–467.

Lv, X., Guan, Y., & Deng, B. (2014). Transfer learning based clinical concept extraction on data from multiple sources. Journal of Biomedical Informatics, 52, 55–64.

Névéol, A., Zweigenbaum, P., Editors, S., & Yearbook, I. (2015). Clinical Natural Language Processing in 2014 : Foundational Methods Supporting Efficient Healthcare Topics in Clinical NLP. Yearbook of Medical Informatics, 194–198.

Pradhan, S., Elhadad, N., South, B. R., Martinez, D., Christensen, L., Vogel, A., & Savova, G. (2015). Evaluating the state of the art in disorder recognition and normalization of the clinical narrative. Journal of the American Medical Informatics Association : JAMIA, 22(1), 143–54.

Skeppstedt, M., Kvist, M., Nilsson, G. H., & Dalianis, H. (2014). Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study. Journal of Biomedical Informatics, 49, 148–158.

Velupillai, S., Mowery, D., South, B. R., Kvist, M., & Dalianis, H. (2015). Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis. Yearbook of Medical Informatics, 10(1), 183–193.

 

Date:10 May 2016 →  16 Sep 2019
Keywords:lexical variation, medical language, clinical corpus
Disciplines:Education curriculum, Linguistics, Theory and methodology of linguistics, Other languages and literary studies
Project type:PhD project