Målfrid 2021 – Freely Available Documents from Norwegian State Institutions

This corpus consists of documents from 339 internet domains run by Norwegian state institutions, and comprises approximately 4.1 billion tokens (words and punctuation) in total, which makes it one of the largest freely available text resources for Norwegian Bokmål and Nynorsk. In addition to Norwegian, the corpus contains texts in Northern Sami, Lule Sami, Southern Sami and English.

The data were collected as part of the so-called Målfrid project, where the National Library of Norway on behalf of the Ministry of Culture and in collaboration with the The Language Council of Norway collects and aggregates data for mapping the usage of Norwegian Bokmål and Norwegian Nynorsk in Norwegian state institutions.

The corpus is the result of a focused crawl conducted between December 11th 2020 and January 18th 2021, recursively downloading text documents (HTML, DOC(X)/ODT and PDF) from a set of domains (down to and including level 12), while obeying robots.txt and politeness restrictions.

The crawled documents were further processed according to their format: text was extracted from HTML using the boilerplate removal system Justext (http://corpus.tools/wiki/Justext), from Word/ODT documents using Textract (https://textract.readthedocs.io/en/stable/) and from PDFs using Google Cloud Vision OCR.

The extracted text was classified using TextCat language identification (cf. https://www.let.rug.nl/~vannoord/TextCat/) at document level, provided as part of the metadata. The documents were deduplicated on domain level (exact duplicates).

The corpus is provided as gzipped JSON lines (jsonl), one document per line. There is one JSONL file per combination of domain, language and content type. The files are encoded as UTF-8, with ASCII escape sequences. Each document contains the following keys:

– lang: language of the document (detected using TextCat)
– url: the url of the document at crawl time
– date: crawl date
– mimetype: media type of the document (simplified): HTML, DOC or PDF
– fulltext: an array of strings, where each string represents one paragraph. An empty string denotes a new page in the PDF documents

Download resources

Extended metadata

dc:type	corpus
dc:title	Målfrid 2021 – Freely Available Documents from Norwegian State Institutions
dc:identifier	oai:nb.no:sbr-69
dc:description	This corpus consists of documents from 339 internet domains run by Norwegian state institutions, and comprises approximately 4.1 billion tokens (words and punctuation) in total, which makes it one of the largest freely available text resources for Norwegian Bokmål and Nynorsk. In addition to Norwegian, the corpus contains texts in Northern Sami, Lule Sami, Southern Sami and English. The data were collected as part of the so-called Målfrid project, where the National Library of Norway on behalf of the Ministry of Culture and in collaboration with the The Language Council of Norway collects and aggregates data for mapping the usage of Norwegian Bokmål and Norwegian Nynorsk in Norwegian state institutions. The corpus is the result of a focused crawl conducted between December 11th 2020 and January 18th 2021, recursively downloading text documents (HTML, DOC(X)/ODT and PDF) from a set of domains (down to and including level 12), while obeying robots.txt and politeness restrictions. The crawled documents were further processed according to their format: text was extracted from HTML using the boilerplate removal system Justext (http://corpus.tools/wiki/Justext), from Word/ODT documents using Textract (https://textract.readthedocs.io/en/stable/) and from PDFs using Google Cloud Vision OCR. The extracted text was classified using TextCat language identification (cf. https://www.let.rug.nl/~vannoord/TextCat/) at document level, provided as part of the metadata. The documents were deduplicated on domain level (exact duplicates). The corpus is provided as gzipped JSON lines (jsonl), one document per line. There is one JSONL file per combination of domain, language and content type. The files are encoded as UTF-8, with ASCII escape sequences. Each document contains the following keys: – lang: language of the document (detected using TextCat) – url: the url of the document at crawl time – date: crawl date – mimetype: media type of the document (simplified): HTML, DOC or PDF – fulltext: an array of strings, where each string represents one paragraph. An empty string denotes a new page in the PDF documents
dc:publisher
dc:format	downloadable
dc:date	2020-12-01
dc:date	2021-04-30
dc:rights	Public
dc:rights	DIFI
dc:rights	Norwegian Licence for Open Government Data (NLOD)
dc:rights	https://data.norge.no/nlod/en/2.0/
dc:creator	Magnus Breder Birkenes
dc:creator	Andre Kåsen
dc:lang	Norwegian Bokmål
dc:lang	Norwegian Nynorsk
dc:lang	Northern Sami
dc:lang	Southern Sami
dc:lang	Lule Sami
dc:lang	English

Last ned metadata (CMDI XML)

Last ned metadata (CMDI XML) https://www.nb.no/sprakbanken/oai?verb=GetRecord&identifier=oai:nb.no:sbr-69&metadataPrefix=cmdi

Målfrid 2021 – Freely Available Documents from Norwegian State Institutions

Download resources

Extended metadata

Dublin Core (DC)

Last ned metadata (CMDI XML)