Skip to content

Norwegian Newspaper Corpus

The Norwegian Newspaper Corpus was a project at the University of Bergen where news websites were crawled for news articles.

This version of The Norwegian Newspaper Corpus consists of text from 1998 to 2019. The corpus contains approximately 1,68 billion words for Norwegian Bokmål, and about 68 million words for Norwegian Nynorsk.

There is also a simplified version of the corpus available (1998-2011), where duplicate sentences have been removed and the sentences are ordered alphabetically.

The texts from 1998-2011 are collected in a single downloadable file, otherwise the data are structured as one file per year. See the documentation files for a description of the content and file formats.

The Norwegian Newspaper Corpus was a project at the University of Bergen where news websites were crawled for news articles.

This version of The Norwegian Newspaper Corpus consists of text from 1998 to 2019. The corpus contains approximately 1,68 billion words for Norwegian Bokmål, and about 68 million words for Norwegian Nynorsk.

There is also a simplified version of the corpus available (1998-2011), where duplicate sentences have been removed and the sentences are ordered alphabetically.

The texts from 1998-2011 are collected in a single downloadable file, otherwise the data are structured as one file per year. See the documentation files for a description of the content and file formats.

Extended metadata

Download resources

Download metadata