Texts from Norwegian Wikipedia

CLARINO NB – Språkbanken

Lisens: Creative_Commons-CC-BY-SA (CC-BY-SA)

Oppdatert: 2019-06-19

This corpus is a dump of all Wikipedia articles written in Norwegian Bokmål, Norwegian Nynorsk and Northern Sami from approx. March 20, 2019. There are 492,864 articles for Bokmål, 139,927 for Nynorsk and 7,626 for Northern Sami, respectively.

The articles are split into three files, one each for Bokmål (nob.wikipedia.json, 1,3 GB), Nynorsk (nno.wikipedia.json, 300 MB) and Northern Sami (sme.wikipedia.json, 10 MB). Each file is structured as a JSON Array of all the articles as they appear on the web. Each article is a structured element, with one level of "key:value" pairs containing text and metadata. There are eight such key/value pairs per article:

bytelength: length of text in number of bytes

pageid: text identifier

title: title as in Wikipedia

hiddencategories: metadata

text: text as in Wikipedia

revised: audit information

contentcategories: metadata

wikidata: other data

An example of the JSON format can be found in the pdf file (2019_wikipedia.pdf).

