NaCTeM

Occupational exposure corpus

Introduction

An individual's likelihood of developing non-communicable diseases is often influenced by the types, intensities and duration of exposures at work. Job exposure matrices provide exposure estimates associated with different occupations. However, due to their time-consuming expert curation process, job exposure matrices currently cover only a subset of possible workplace exposures and may not be regularly updated. Scientific literature articles describing exposure studies provide important supporting evidence for developing and updating job exposure matrices, since they report on exposures in a variety of occupational scenarios. However, the constant growth of scientific literature is increasing the challenges of efficiently identifying relevant articles and important content within them.

We have developed a novel annotated corpus of scientific articles to support machine learning based named entity recognition relevant to occupational substance exposures. Through incremental refinements to the annotation process, we demonstrate that expert annotators can attain high levels of agreement, and that the corpus can be used to train high-performance named entity recognition models. The corpus thus constitutes an important foundation for the wider development of natural language processing tools to support the study of occupational exposures

Corpus Description

The corpus consists of selected sections (i.e., Abstract, Methods and Results) of scientific research articles concerning occupational exposures to two different types of substance, i.e., diesel exhaust (51 articles) and respirable crystalline silica (RCS) (50 articles). The article sections have been annotated by experts in the field with 6 categories of named entities (NEs) relevant to the assessment of occupational substance exposures, particularly in the context of Job Exposure Matrices (JEMs)

Named Entity Categories

The table below provides details and examples of the six categories of NEs that have been annotated in the corpus.

CategoryDefinitionExamples
SubstanceOrExposure
Measured
Measured substance, chemical or pollutantrespirable quartz dust; elemental carbon
OccupationJobTitleJob/occupation of subject(s) of exposure studiescarpenters; concrete workers; operators in the refinery
IndustryWorkplaceWorkplace OR industry involved in the sampling seriesmining operations; diesel factory; four-lane motorway
JobTaskactivityPhysical activity/action forming part of workers' daily dutieswelding; concrete pouring; mechanical mowing of weeds
OHMeasurementDeviceDevice/apparatus used by to measure workplace exposure levelsIOM samplers; Higgins Dewell cyclones; Dräger stain tubes
SampleTypePersonalPhrases denoting that samples represent personal exposurespersonal measurements; personal breathing zone sample

Corpus Statistics

The table below provides some statistics of annotations in the corpus

  • Total annotations - Total number of spans annotated in the indicated category.
  • Unique spans - Number of distinct spans annotated in the indicated category, after converting to lower case.
  • Unique span frequency - Average number of times that each unique span in the indicated category was annotated
CategoryTotal AnnotationsUnique SpansUnique Span Frequency
SubstanceOr
Exposure
Measured
7628109.41
OccupationJob
Title
21596443.35
Industry
Workplace
27649272.98
JobTaskactivity15829731.6
OHMeasurement
Device
8964122.17
SampleType
Personal
5171154.5

Corpus Format

The corpus is available in two different formats:

  • brat standoff format - The text for each article is stored in a separate file; the corresponding NE annotations are stored in separate files from the document text. The format is fully described here.
  • JSON - The complete corpus is stored in a single file. The file includes the text for each article, metadata regarding the source of the article and the NE annotations

Corpus Download

The corpus and associated annotation guidelines may be dowloaded from the associated record on Zenodo.

NER models and code

NER models and associated code are available at: https://github.com/panagiotis-geo/Substance_Exposure_NER/

Related Publication

Thompson, P., Ananiadou, S., Basinas I., Brinchmann, B. C., Cramer, C., Galea, K. S., Ge, C., Georgiadis, P., Kirkeleit, J., Kuijpers, E., Nguyen, N., Nuñez, R., Schlünssen, V., Stokholm, Z. A., Taher, E. A., Tinnerberg, H., Van Tongeren, M. and Xie, Q. (2024). Supporting the working life exposome: annotating occupational exposure for enhanced literature search. PLoS ONE 19(8): e030784.

Licence

Creative Commons License
The corpus was constructed at the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, UK. It is licensed under a Creative Commons Attribution 4.0 International License. Please attribute NaCTeM when using the corpus, and please cite the following article:

Thompson, P., Ananiadou, S., Basinas I., Brinchmann, B. C., Cramer, C., Galea, K. S., Ge, C., Georgiadis, P., Kirkeleit, J., Kuijpers, E., Nguyen, N., Nuñez, R., Schlünssen, V., Stokholm, Z. A., Taher, E. A., Tinnerberg, H., Van Tongeren, M. and Xie, Q. (2024). Supporting the working life exposome: annotating occupational exposure for enhanced literature search. PLoS ONE 19(8): e030784.