This corpus is a dump of discussion threads from the Norwegian Wikipedia, where authors discuss various issues regarding the publication of specific Wikipedia articles.
The material is split into two files, one each for Norwegian Bokmål (nb.wikipedia.json) and Nynorsk (nn.wikipedia.json). Each file is a structured JSON array. One discussion corresponds to one element, with one level containing text and metadata. There are eight key/value pairs per discussion:
– title: title of article under discussion
– pageid: text identifier
– revid: audit information
– wikidata: other data
– contentcategories: metadata
– hiddencategories: metadata
– text: discussion text
– bytelength: length of text in number of bytes
An example of this can be found in the pdf file (2019_wikidisc.pdf).
This corpus is a dump of discussion threads from the Norwegian Wikipedia, where authors discuss various issues regarding the publication of specific Wikipedia articles.
The material is split into two files, one each for Norwegian Bokmål (nb.wikipedia.json) and Nynorsk (nn.wikipedia.json). Each file is a structured JSON array. One discussion corresponds to one element, with one level containing text and metadata. There are eight key/value pairs per discussion:
– title: title of article under discussion
– pageid: text identifier
– revid: audit information
– wikidata: other data
– contentcategories: metadata
– hiddencategories: metadata
– text: discussion text
– bytelength: length of text in number of bytes
An example of this can be found in the pdf file (2019_wikidisc.pdf).
Extended metadata
resource Common Info:
resource Type: corpus
identification Info:
resource Name: Diskusjonstekster frå Wikipedia
resource Name: Discussions from Wikipedia
description: Dette korpuset inneheld ein dump av diskusjonstrådar frå Wikipedia, der forfattarar diskuterer ulike problemstillingar i samband med publisering av bestemde artiklar på Wikipedia.
Artiklane er fordelte på to filer, ei for høvesvis bokmål (nb.wikipedia.json) og nynorsk (nn.wikipedia.json). Kvar diskusjon er eit element i eit json-array, med eitt nivå som inneheld tekst og diverse metadata. Det er åtte datafelt per diskusjon:
– title: tittel på artikkelen som vert diskutert
– pageid: identifikator for artikkelen
– revid: revisjonsinformasjon
– wikidata: ev. andre data
– contentcategories: metadata
– hiddencategories: metadata
– text: diskusjonstekst
– bytelength: lengde på teksten i bytes
Eit døme på dette finst i dokumentasjonsfila (2019_wikidisc.pdf).
description: This corpus is a dump of discussion threads from the Norwegian Wikipedia, where authors discuss various issues regarding the publication of specific Wikipedia articles.
The material is split into two files, one each for Norwegian Bokmål (nb.wikipedia.json) and Nynorsk (nn.wikipedia.json). Each file is a structured JSON array. One discussion corresponds to one element, with one level containing text and metadata. There are eight key/value pairs per discussion:
– title: title of article under discussion
– pageid: text identifier
– revid: audit information
– wikidata: other data
– contentcategories: metadata
– hiddencategories: metadata
– text: discussion text
– bytelength: length of text in number of bytes
An example of this can be found in the pdf file (2019_wikidisc.pdf).
multilinguality Type Details: Discussions of a similar kind in either Norwegian Bokmål or Norwegian Nynorsk
language Info:
language Id: nb
language Name: Norwegian Bokmål
size Per Language:
size Info:
size: 17000000
size Unit: words
size Info:
size: 31364
size Unit: entries
size Info:
size: 1
size Unit: files
size Info:
size: 126,4
size Unit: mb
language Variety Info:
language Variety Type: jargon
language Variety Name: Informal written language
language Info:
language Id: nn
language Name: Norwegian Nynorsk
size Per Language:
size Info:
size: 1400000
size Unit: words
size Info:
size: 5500
size Unit: entries
size Info:
size: 1
size Unit: files
size Info:
size: 10,3
size Unit: mb
language Variety Info:
language Variety Type: jargon
language Variety Name: Informal written language
modality Info:
modality Type: writtenLanguage
size Info:
size: 18400000
size Unit: words
size Info:
size: 36864
size Unit: entries
size Info:
size: 2
size Unit: files
size Info:
size: 136,7
size Unit: mb
dc:type
corpus
dc:title
Discussions from Wikipedia
dc:identifier
oai:nb.no:sbr-66
dc:description
This corpus is a dump of discussion threads from the Norwegian Wikipedia, where authors discuss various issues regarding the publication of specific Wikipedia articles.
The material is split into two files, one each for Norwegian Bokmål (nb.wikipedia.json) and Nynorsk (nn.wikipedia.json). Each file is a structured JSON array. One discussion corresponds to one element, with one level containing text and metadata. There are eight key/value pairs per discussion:
– title: title of article under discussion
– pageid: text identifier
– revid: audit information
– wikidata: other data
– contentcategories: metadata
– hiddencategories: metadata
– text: discussion text
– bytelength: length of text in number of bytes
An example of this can be found in the pdf file (2019_wikidisc.pdf).