Evaluation for BioNLP'09 Shared Task
The evaluation for the BioNLP'09 Shared Task is based on the equality of events as defined below. That is, each submitted event will be judged either as correct or incorrect as a whole (as opposed to e.g. measuring each event argument assignment separately). Evaluation results are reported using the standard precision/recall/f-score metrics.
The evaluation thus places an emphasis on getting entire events right, as opposed to just those arguments that can be predicted most confidently.
Event equality
There are several aspects to the equality of events, including event type, the identification of the words expressing the event (event trigger expression), the event participants and arguments, and, in turn, the correctness of the entities and events that these refer to. We will apply a number of different correctness criteria:
- strict equality: for an event to be correct, it must match an event of the gold standard annotation in all of the above-mentioned aspects.
- approximate boundary matching: the spans of identified entities and event trigger expressions are allowed to differ from the exact gold spans.
- approximate recursive matching: the requirement that for an event to be correct, events that it refers to must also be correct is relaxed.
Detailed definitions are given below. Note that all criteria require the type of the event to be correct and that all participants and arguments are correct. Combinations of the criteria may also be considered.
Strict equality
The strict equality criteria require that for a submitted event to match a gold standard event:
- 1) The event types are the same
- 2) The event trigger expressions are the same
- 3) For each event argument, there is a matching argument where the referenced entities/events match:
- 3.1) Types are the same (both entities and events)
- 3.2) The text spans (entities) / trigger expressions (events) are the same
- 3.3) The arguments of events are the same (recursively following this definition)
(In (3), "for each event argument" should be understood to refer to both the answer and gold, and "matching argument" to gold or answer (resp.): there can be no extra or missing arguments.)
Two entity / trigger expression spans (beg1, end1) and (beg2, end2), are the same iff beg1 = beg2 and end1 = end2.
Although strict equality serves as the basis of the evalution criteria, considering the complexity of the problem and some of the features of the data, it may be viewed as impratically strict. We therefore provide also the relaxed evaluation criteria which are defined considering the value of extracted information from a practical point of view.
Approximate span matching
In detail, with differences to strict criteria in bold:
- 1) The event types are the same
- 2) The given event trigger expression is equivalent to that of the gold standard
- 3) For each event argument, there is a matching argument where the referenced entities/events match:
- 3.1) Types are the same (both entities and events)
- 3.2) The given text span (entities) / trigger expression (events) is equivalent to that of the gold standard
- 3.3) The arguments of events are the same (recursively following this definition)
For approximate matching, equivalent is defined as follows: a given span is equivalent to a gold span if it is entirely contained within an extension of the gold span by one word both to the left and to the right, that is, beg1 >= ebeg2 and end1 <= eend2, where (beg1, end1) is the given span and (ebeg2, eend2) is the extended gold span.
Thus, for example, the given span (underlined) A plays role in [...] is equivalent to the (hypothetical) gold span A plays role in [...] as it is contained in the extended span A plays role in [...].
(Please note that we may still fine-tune this definition of approximate span equivalence to reduce possiblity of abuse.)
Approximate recursive matching
In detail, with difference to strict criteria in bold:
- 1) The event types are the same
- 2) The event trigger expressions are the same
- 3) For each event argument, there must be a matching argument where the referenced entities/events match:
- 3.1) Types are the same (both entities and events)
- 3.2) The text spans (entities) / trigger expressions (events) are the same
- 3.3) The arguments of events match partially
For partial matching, only Theme arguments are considered. Referred events are thus considered to match even if they differ in non-Theme arguments.
Event Decomposition
In this mode, an event with more than one arguments, e.g.
event-type:trigger-id arg1-type:arg1-id arg2-type:arg2-id ... |
is decomposed into multiple predicate-argument relations, e.g.
event-type:trigger-id arg1-type:arg1-id |
event-type:trigger-id arg2-type:arg2-id |
... |
Each relation is then evaluated as if it is a single-argument event.