Legal Documents from Norwegian Nynorsk Municipialities

The texts in this corpus have been collected with the web crawler Veidemann in collaboration with the National Library's Web Archive, based on a revised list of municipalities from the National Association of Nynorsk Municipalities (see lnk.no).

The web crawler was set to download documents in pdf format. The resulting collection of documents was then scanned using Google's OCR API (Optical Character Recognition). Although the OCR generally is of high quality, but some errors will remain in the material.

The resulting corpus is made up of 50.000 documents (including legal documents, minutes from meetings etc.), and contains a total of some 127 million words. About 88.5 million of these are in Norwegian Nynorsk, the rest is mostly Norwegian Bokmål. All the texts in the corpus are classified by language.

The corpus is currently published as a json object, where the key is an identifier (URN) for the Veidemann download, and the value is a list of lists of pages in the document with associated page numbers and target form. A text file is also provided, containing a list of the URNs in the corpus. These URNs refer to the website (URL) from which the document was downloaded.

The original pdf files and the OCR format are available on request to Språkbanken.

The texts in this corpus have been collected with the web crawler Veidemann in collaboration with the National Library's Web Archive, based on a revised list of municipalities from the National Association of Nynorsk Municipalities (see lnk.no).

The web crawler was set to download documents in pdf format. The resulting collection of documents was then scanned using Google's OCR API (Optical Character Recognition). Although the OCR generally is of high quality, but some errors will remain in the material.

The resulting corpus is made up of 50.000 documents (including legal documents, minutes from meetings etc.), and contains a total of some 127 million words. About 88.5 million of these are in Norwegian Nynorsk, the rest is mostly Norwegian Bokmål. All the texts in the corpus are classified by language.

The corpus is currently published as a json object, where the key is an identifier (URN) for the Veidemann download, and the value is a list of lists of pages in the document with associated page numbers and target form. A text file is also provided, containing a list of the URNs in the corpus. These URNs refer to the website (URL) from which the document was downloaded.

The original pdf files and the OCR format are available on request to Språkbanken.

Extended metadata

Download resources

Download metadata