Skip to content

Texts from Norwegian Wikipedia

This corpus is a dump from approximately March 20 2019 of all Wikipedia articles written in Norwegian Bokmål, Norwegian Nynorsk and Northern Sami. The corpus contains 492,864 articles for Norwegian Bokmål, 139,927 articles for Norwegian Nynorsk and 7,626 articles for Northern Sami. The files are structured as a JSON Array of all the articles as they appear on the web. Each article is a structured element, with one level of “key:value” pairs containing text and metadata. There are eight such key:value pairs per article:

– bytelength: length of text in number of bytes
– pageid: text identifier
– title: title as in Wikipedia
– hiddencategories: metadata
– text: text as in Wikipedia
– revised: audit information
– contentcategories: metadata
– wikidata: other data

An example of the JSON format can be found in the documentation file.

This corpus is a dump from approximately March 20 2019 of all Wikipedia articles written in Norwegian Bokmål, Norwegian Nynorsk and Northern Sami. The corpus contains 492,864 articles for Norwegian Bokmål, 139,927 articles for Norwegian Nynorsk and 7,626 articles for Northern Sami. The files are structured as a JSON Array of all the articles as they appear on the web. Each article is a structured element, with one level of “key:value” pairs containing text and metadata. There are eight such key:value pairs per article:

– bytelength: length of text in number of bytes
– pageid: text identifier
– title: title as in Wikipedia
– hiddencategories: metadata
– text: text as in Wikipedia
– revised: audit information
– contentcategories: metadata
– wikidata: other data

An example of the JSON format can be found in the documentation file.

Extended metadata

Download resources

Download metadata