Named Entity Recognition (NER) is a central task in Natural Language Processing. Information extraction, chatbots and machine translation systems benefit from good NER models. In spite of this, there haven’t been any freely available Norwegian datasets to train such models until now. The Norwegian Named Entities corpus (NorNE) was launched in June 2019. The corpus has been developed by Schibsted Media Group, the Language Technology Group at the University of Oslo and The Language Bank at the National Library of Norway, a national infrastructure service offering digital language resources for use in research and development of language technology.
NorNE is a corpus of running text where all named entities are marked and classified into categories such as e.g. person, organization and location. The annotations in the corpus are public domain (CC-ZERO), and the dataset can be used both for research and commercial development. NorNE is added on top of a previously existing corpus with morphological and syntactic annotation, the Norwegian Dependency Treebank (NDT). This corpus consists of approximately 600 000 tokens of running text from newspapers, blogs, parliamentary proceedings and reports, equally divided between Bokmål and Nynorsk, the two Norwegian written standards. There are multiple advantages to adding NorNE on top of NDT. Firstly, the license allows for the corpus to be used in commercial development. Secondly, NER systems can potentially benefit from the grammatical analysis. Thirdly, it is possible to extend the corpus with new layers of annotation, e.g. coreference annotation, which draws on both the grammatical and the NER layers.
About half of NorNE is annotated by two human annotators, while the rest of the material is annotated by one annotator, using guidelines developed during the double annotation. This annotation regime ensures high consistency, which is crucial in making NER systems of good quality.