The annotations generated by GATE Developer are a result of the different processing resources (PR) embedded in a pipeline. Each PR must process text in a specific sequence to generate the needed results for processing in other PRs.
The first PR used in the pipeline is the Document Reset PR, which removes any previous annotations prior to processing. The second PR is the English Tokenizer, which parse the text into tokens for character strings, space tokens where there are no charaters. The third PR is the Gazetteer the contains a lists of specific terms that are of interest. These terms are captured as lookup annotations. The fourth PR is the sentence splitter that identifies sentence tokens for strings of tokens and space tokens that have an ending split token, such as a period, question mark, or exclaimation point. The fifth PR is the Part of Speech (POS) Tagger that identifies the different parts of speech that uses a modified version of the Brill tagger. The resulting annotations are fed to the Named Entity Transducer and Orthmatcher to form the location, organization, and person annotations. Finally, annotations that are not resolved through the NE Transducer and Orthomatcher are tagged with the unknown annotation.
This first evaluation demonstrates that there is significant work to be performed in order to effective evaluate the text. Further, the need for a specified vocabulary is highlighted by the fact that specific terms related to the manufacturing and vehicle industry have been miscategorized.
Note: The annotated xml is loaded directly into this page for review using your browser's Developer Tools.