Controllable readablity corpus
Introduction
Owing to the highly technical nature of biomedical documents, the ease with which people can understand their content varies according to their level of domain knowledge. While existing biomedical document summarization systems are generally only able to produce highly technical summaries, it would be desirable for them also to be able to produce plain language summaries (PLSs) that can be understood by lay people. In order to support the development of summarization systems that can support this goal, we have produced a corpus consisting of biomedical papers, accompanied both by their technical summaries and by PLSs written by the authors.
Corpus Description
The corpus consists of 28,124 peer-reviewed biomedical research papers along with their technical and PLSs from six PLOS journals that cover a broad range of biomedical research subjects, i.e., PLOS Biology, PLOS Computational Biology, PLOS Genetics, PLOS Medicine, PLOS Neglected Tropical Diseases, and PLOS Pathogens.
The PLSs are taken from the Author Summary section of articles. This section consists of a short, non-technical summary of the article, which is distinct from the abstract, with the goal of making the research accessible both to scientists and non-scientists. This is achieved by highlighting where the work fits within a broader context, presenting the significance in a simple manner and avoiding the use of acronyms and complex terminology.
To construct the corpus, we downloaded the complete PLOS article dataset (as of 4th April 2022), after which we filtered out articles without an Author Summary section. We then extracted the full text, the abstract (as the technical summary), and the Author Summary (as the PLS) from the remaining papers. This resulted in a total of 28,124 document-technical summary-PLS triplets. We randomly sampled 1,000 triplets, respectively, to form the development and test tests, while the remaining 26,124 triplets constitute the training set.
Corpus format
The corpus is provided in JSON Lines format. Separate files are provided containing the training (train_plos.jsonl), development (dev_plos.jsonl) and test (test_plos.jsonl) sets.
Each JSON object corresponds to an article and includes the following five fields:
- doi - DOI of the article
- title - Title of the article
- abstract - Abstract of the article
- plain language summary - PLS for the article (i.e., the content of the Author Summary section)
- article - The full text of the article
Availability
The corpus is available for download according to the terms of the licence below.
Related Publication
Luo, Z., Xie, Q. & Ananiadou, S. (2022).Readability Controllable Biomedical Document Summarization. arXiv. https://doi.org/10.48550/ARXIV.2210.04705Licence
The corpus was constructed at the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, UK. It is licensed under a Creative Commons Attribution 4.0 International License. Please attribute NaCTeM when using the corpus, and please cite the following article:
Luo, Z., Xie, Q. & Ananiadou, S. (2022).Readability Controllable Biomedical Document Summarization. arXiv. https://doi.org/10.48550/ARXIV.2210.04705
Featured News
- Shared Task on Financial Misinformation Detection at FinNLP-FNP-LLMFinLegal
- New Named Entity Corpus for Occupational Substance Exposure Assessment
- FinNLP-FNP-LLMFinLegal @ COLING-2025 - Call for papers
- Keynote talk at Manchester Law and Technology Conference
- Keynote talk at ACM Summer School on Data Science, Athens
- Congratulations to PhD student Panagiotis Georgiades
Other News & Events
- Invited talk at the 8th Annual Women in Data Science Event at the American University of Beirut
- Invited talk at the 2nd Symposium on NLP for Social Good (NSG), University of Liverpool
- Invited talk at Annual Meeting of the Danish Society of Occupational and Environmental Medicine
- Advances in Data Science and Artificial Intelligence Conference 2024
- New review article on emotion detection for misinformation