This corpus is a dump of all Wikipedia articles written in Norwegian Bokmål, Norwegian Nynorsk and Northern Sami from approx. March 20, 2019. There are 492,864 articles for Bokmål, 139,927 for Nynorsk and 7,626 for Northern Sami, respectively.
The articles are split into three files, one each for Bokmål (nob.wikipedia.json, 1,3 GB), Nynorsk (nno.wikipedia.json, 300 MB) and Northern Sami (sme.wikipedia.json, 10 MB). Each file is structured as a JSON Array of all the articles as they appear on the web. Each article is a structured element, with one level of "key:value" pairs containing text and metadata. There are eight such key/value pairs per article:
bytelength: length of text in number of bytes
pageid: text identifier
title: title as in Wikipedia
text: text as in Wikipedia
revised: audit information
wikidata: other data
An example of the JSON format can be found in the pdf file (2019_wikipedia.pdf).
The link will take you to an external site: We take no responsibility whatsoever for the content of external links.