In October 2011 The National Library initiated a new project - creating gold standard text corpora for Norwegian Bokmål and Nynorsk; "gold standard" means that the annotation is checked manually. The project has a time frame of two years, and the work is carried out by Språkbanken in collaboration with the Text Laboratory at the Department of Linguistic and Scandinavian Studies at the University of Oslo.
To develop parsers and other language technology tools for automatic analysis of running text (for instance annotation with part of speech, inflection and sentence structure), it is necessary to have corpora which can be used to the testing and training of these tools while they are being developed. In an ideal world, these corpora are error-free, i.e., they adhere to a “gold standard”. To achieve this, the morphological and syntactic analyses are checked and interpreted manually by specialists so that the resulting corpus can be used by developers to test the precision of the tools they develop.
Språkbanken has started a project to build up such corpora for Norwegian Bokmål and Nynorsk. The primary target group for these corpora is language technology developers, but they can also be a resource in linguistics research. The corpora will be annotated with part of speech for all words, morphosyntactic categories (inflection) and syntactic function (e.g., subject, object and adverbial).
The morphological analysis in the corpora follows Norsk referansegrammatikk (Norwegian Reference Grammar) by Jan Terje Faarlund, Svein Lie and Kjell Ivar Vannebo (1997), while the syntactic analysis is based on dependency grammar, a much used model in computational linguistics.
In dependency grammar, the grammatical structure of sentences are analysed as asymmetrical relations between words, and not between phrases as in traditional grammar. In a clause such as “Per buys red cars”, “buys” is the head of the clause, hand has the subject “Per” and the object “cars” as so-called dependents. In its turn, “Cars” is the head of “red”.
The goal is to annotate one million words of running text each for Bokmål and Nynorsk. The text is primarily taken from news and factual prose, but other genres will also be included.
Per Erik Solberg and Pål Kristian Eriksen work as annotators on the project until October 2013. Kari Kinn worked on the project until August 2012.