Medical Evidence Interpretation for Risk Assessment (MOIRA)


The role of reliable data in medicine cannot be underestimated. This applies not only to information describing general population-level phenomena covered in scientific publications, but also to health service records describing individuals. Although text mining methods have been widely applied to the former category, the latter has attracted much less attention. One of the main reasons is that these data were previously stored in a format that made them less accessible for digital processing. i.e., as paper documents, which were frequently handwritten. However, increasing adoption of digital solutions both by health service institutions and individual medical practitioners has started to change the picture. This new situation poses both new challenges and opportunities for text mining methods, since there is potentially valuable knowledge contained in individual medical records. In this project, we aim to analyse medical reports using text mining techniques, with the specific goal of quantifying the risk associated with the evidence described.


This project is being undertaken by NaCTeM in cooperation with a commercial partner, Pacific Life Re. The main task is to analyse an individual's medical report and determine the level of risk associated with the conditions described. The main challenges include the following:

  • Medical reports are highly structured documents, containing many elements of different types, such as simple information (e.g., date of birth, gender, height and weight), enumerations (e.g., prescribed drugs), textual descriptions (e.g., outcomes of hospital visits) and references to external documents (e.g., test results).
  • Risk can be associated with entities of different types: diseases, symptoms, drugs, test results or habits.
  • The influence of a certain risk factor always depends on its context in the document, e.g., temporal (since medical reports frequently cover many years of treatment) or accompanying gradable adjectives (e.g., severe).
  • The final risk is usually not a simple sum of the influences of individual factors, as some of them may strongly interact with each other, and thus have a significant impact on overall risk, e.g., family history and negative results of related tests.
  • External knowledge is necessary to interpret the document, as the importance of certain types of evidence (e.g., the fact that the individual has prevously suffered from a particular disease) is considered to be implicitly understood by the reader, and hence is not explicitly written in the report.
  • The quality of language is frequently poor: reports may contain many (potentially non-standard) abbreviations and acronyms, incomplete sentences and correspondence with patients, which can pose significant challenges for text mining methods.
The goal of our work is not only to automatically compute a global risk per document, but also to allow humans to understand how the risk value was calculated, by automatically highlighting the parts of the text that have a significant contribution to the result. This would allow them to make use of the text mined information, even if the system was unable to detect all of the risks mentioned.

Related work

The project will build on NaCTeM's experience in the following relevant areas:

  • topic analysis,
  • coembeddings,
  • term extraction (TerMine),
  • biomedical named entity recognition and normalisation,
  • event attribute recognition (negation, confidence etc.),
  • active learning,
  • document classification
The project will also benefit from the expertise of experienced medical risk assessment specialists at Pacific Life Re.

Project team

Principal Investigator: Prof. Sophia Ananiadou
Researcher: Dr. Piotr Przybyła.