In this example, we will develop a NLP workflow that identifies biomedical named entities (e.g., gene and protein names) in 100 abstracts retrieved from PubMed.
Step 1: Adding a collection reader
The first NLP component in every UIMA/U-Compare workflow is a collection reader. A collection reader is responsible for retrieving documents from a collection (e.g., online libraries like PubMed) and preparing the input for further processing by subsequent components. In this example, we will use Kleio Search, which can be found in the component library, under Collection readers category.
Kleio Search is a semantic search engine, developed by NaCTeM, which fetches MEDLINE abstracts matching a specific query. In our case, we use the query “content:CAT AND PROTEIN:”cat”“, which retrieves documents that are relevant to the protein cat.
Step 2: Adding processing components
In this step, we need to add a biomedical Named Entity Recogniser (NER) component. From the available ones, we choose NeMine, which identifies genes and proteins in free text (ABNER, NLPBA or Genia Tagger with Tokenisation are also applicable NERs for this case).
Once we have placed NeMine in the workflow design panel, we immediately observe that this component expects sentences as input. Hence, we first need to split the documents retrieved by Kleio Search into sentences. To do so, we need to place a Sentence Splitter component before NeMine.
Our workflow is now valid (i.e., the input of components is compatible with the output of subsequent components).
Step 3: Executing the workflow
We can execute a pipeline by simply clicking the play button (it can be found under the workflow design panel). U-Compare will then start executing the components in our workflow, sequentially. Once U-Compare has finished processing the documents, a Session Results window will pop up, as illustrated in the following screenshot.
A Session Results window contains two tabs:
- Performance Statistics: Time needed to execute the workflow, average time by each component.
- Annotations Statistics:A summary of the annotations. (e.g., number of annotations for each document)
We can view the annotations per document by clicking the corresponding show button in the Performance Statistics tab.