Legal Documents from Norwegian Nynorsk Municipialities

The texts in this corpus have been collected with the web crawler Veidemann in collaboration with the National Library’s Web Archive, based on a revised list of municipalities from the National Association of Nynorsk Municipalities (see lnk.no).

The web crawler was set to download documents in pdf format. The resulting collection of documents was then scanned using Google’s OCR API. Although the OCR generally is of high quality, some errors will remain in the material.

The resulting corpus is made up of 50,000 documents (legal documents, minutes from meetings etc.), and contains a total of some 127 million words. About 88.5 million of these are in Norwegian Nynorsk, the rest is mostly Norwegian Bokmål. All the texts in the corpus are classified by language.

The corpus is currently published as a json object, where the key is an identifier (URN) for the Veidemann download, and the value is a list of lists of pages in the document with associated page numbers and target form. A text file is also provided, containing a list of the URNs in the corpus. These URNs refer to the websites (URLs) from which the individual documents were downloaded.

The original pdf files and the OCR format are available upon request to Språkbanken. Please contact us using or e-mail address, sprakbanken@nb.no.

Download resources

sakspapir_nno_01.tar.gz

Extended metadata

Last ned metadata (CMDI XML)

Last ned metadata (CMDI XML) https://www.nb.no/sprakbanken/oai?verb=GetRecord&identifier=oai:nb.no:sbr-60&metadataPrefix=cmdi

dc:type	corpus
dc:title	Legal Documents from Norwegian Nynorsk Municipialities
dc:identifier	oai:nb.no:sbr-60
dc:description	The texts in this corpus have been collected with the web crawler Veidemann in collaboration with the National Library's Web Archive, based on a revised list of municipalities from the National Association of Nynorsk Municipalities (see lnk.no). The web crawler was set to download documents in pdf format. The resulting collection of documents was then scanned using Google's OCR API. Although the OCR generally is of high quality, some errors will remain in the material. The resulting corpus is made up of 50,000 documents (legal documents, minutes from meetings etc.), and contains a total of some 127 million words. About 88.5 million of these are in Norwegian Nynorsk, the rest is mostly Norwegian Bokmål. All the texts in the corpus are classified by language. The corpus is currently published as a json object, where the key is an identifier (URN) for the Veidemann download, and the value is a list of lists of pages in the document with associated page numbers and target form. A text file is also provided, containing a list of the URNs in the corpus. These URNs refer to the websites (URLs) from which the individual documents were downloaded. The original pdf files and the OCR format are available upon request to Språkbanken. Please contact us using or e-mail address, sprakbanken@nb.no.
dc:publisher
dc:format	downloadable
dc:date	2019-10-16
dc:date	2020-12-04
dc:rights	Public
dc:rights	Creative Commons (CC)
dc:rights	Creative_Commons-ZERO (CC-ZERO)
dc:rights	https://creativecommons.org/publicdomain/zero/1.0/
dc:creator	Andre Kåsen
dc:creator	National Library of Norway
dc:lang	Norwegian Nynorsk
dc:lang	Norwegian Bokmål

Legal Documents from Norwegian Nynorsk Municipialities

Download resources

Extended metadata

Dublin Core (DC)

Last ned metadata (CMDI XML)