This page contains information about and links for downloading text resources distributed by Språkbanken.
Please send questions and feedback in connection with these resources to
sprakbanken@nb.no
.
Språkbanken's gold standard corpus - manually annotated text corpora for Norwegian Bokmål and Nynorsk - 2013-04-12
The gold standard corpus is actually two separate corpora, one for Norwegian Bokmål, and one for Norwegian Nynorsk. The corpora are morphologically and syntactically annotated. The morphological analysis follows The Norwegian Reference Grammar, while dependency grammar is used for the syntactic analysis. Read more about the project by following this link .
Thus far we offer beta versions of the corpora. Both the Bokmål and the Nynorsk corpora can be dowloaded by clicking the links below. Some documentation is available with each individual release. Version 0.3 and 0.4 contain detailed annotation guidelines (in Norwegian; an English version will be published later).
- Download version 0.4 (5.3 MB) - 2013-04-12 - Annotation guidelines (pdf)
- Download version 0.3 (4.6 MB) - 2013-02-08
- Download version 0.2 (1.5 MB) - 2012-11-20
- Download version 0.1 (1.2 MB) - 2012-08-06
Read Per Erik Solberg's presentation of the corpora at NODALIDA 2013 in Oslo: paper (pdf), poster (pdf)
Fulltext version of The Norwegian Newspaper Corpus (Norsk aviskorpus), version 0.9
This fulltext version of The Norwegian Newspaper Corpus ( http://avis.uib.no/ ) is not finished, and the texts are presented in three different formats. During 2012 and 2013, the texts will be further processed, and made available in a unified xml-format. The texts in the corpus are updated as of 28th December 2011.
Norsk aviskorpus is made available for the users of Språkbanken, and can only be used for language technology research and development. Users of the corpus are not allowed to disclose or publish any part of the texts, only knowledge gained and products developed on the basis of the texts.
- Norwegian Newspaper Corpus, description of format and content (pdf - Norwegian)
- Download Norwegian Newspaper Corpus (2.8 GB)
- Addendum: Bergens Tidende 2012 (33 MB) - 2012-12-21
- Addendum: VG 2012 (33 MB) - 2013-01-02
The corpus is also available in a simplified format, where all metadata have been removed. The corpus is tokenised at sentence level, and the clauses are ordered alphabetically. Clauses are separated by <s>- and </s>-tags.
- Norwegian Newspaper Corpus (Nynorsk), plain text, tokenised at sentence level (115 MB)
- Norwegian Newspaper Corpus (Bokmål), plain text, tokenised at sentence level (2.6 GB)
N-grams for Norwegian Nynorsk
On the basis of the Norwegian Nynorsk texts in Norsk aviskorpus (see above), and what existed of Norwegian Nynorsk texts in the text corpus of Nordisk språkteknologi AS, Språkbanken has produced n-grams (1-gram, 2-gram, 3-gram, 4-gram, 5-gram and 6-gram) for a volume of approximately 60 million words of running text. The material is made available in two versions; a “light” version listing the 1000 most frequent n-grams, and a full version where all the n-grams are collected and sorted according to different criteria. See the description below for details.
Frequency lists (i.e., 1-grams) are also avaialable as single files, sorted by frequency and alphabetically. See the description below for details.
The n-grams may be downloaded and used without restrictions for language technology research and development.
- N-grams for Norwegian Nynorsk, description of format and content (pdf – Norwegian)
- Frequency list of all words, sorted alphabetically (5 MB)
- Frequency list of words with frequency > 1, sorted by frequency (2 MB)
- Frequency list of words with frequency > 1, sorted alphabetically (2 MB)
- The 1000 most frequent n-grams (42 KB)
- The whole n-gram collection for Norwegian Nynorsk (1,8 GB)
N-grams for Norwegian Bokmål
On the basis of the Norwegian Bokmål texts in Norsk aviskorpus (see above) and the text corpus of Nordisk språkteknologi AS, Språkbanken has produced n-grams (1-gram, 2-gram, 3-gram, 4-gram, 5-gram and 6-gram) for a volume of approximately 1.1 billion words of running text. The material is made available in two versions; a “light” version listing the 1000 most frequent n-grams, and a full version where all the n-grams are collected and sorted according to different criteria. Frequency lists (i.e., 1-grams) are also avaialable as single files, sorted by frequency and alphabetically. See the description below for details
The n-grams may be downloaded and used without restrictions for language technology research and development.
-
N-grams for Norwegian Bokmål, description of format and content
(pdf – Norwegian)
- Text from Norsk aviskorpus and the NST text corpus (1175M words)
- Frequency list of all words, sorted alphabetically (33 MB)
- Frequency list of words with frequency > 1, sorted by frequency (14 MB)
- Frequency list of words with frequency > 1, sorted alphabetically (14 MB)
- The 1000 most frequent n-grams (48 KB)
- The whole n-gram collection for Norwegian Bokmål (27 GB)
- Text from Norsk aviskorpus (665M words)
- The 1000 most frequent n-grams (48 KB)
- The whole n-gram collection (16 GB)
- Text from the NST text corpus (510M words)
- The 1000 most frequent n-grams (47 KB)
- The whole n-gram collection (13GB)
- Text from Norsk aviskorpus and the NST text corpus (1175M words)
N-grams for Danish and Swedish
On the basis of the Danish and Swedish texts in the text corpora of Nordisk Språkteknologi AS, Språkbanken has produced n-grams (1-gram, 2-gram, 3-gram, 4-gram, 5-gram and 6-gram) for a volume of approximately 290 million words of running text for Danish and 400 million words of running text for Swedish. The material is made available in two versions; a “light” version listing the 1000 most frequent n-grams, and a full version where all the n-grams are collected and sorted according to different criteria.
The Swedish material is also available in a third format where one can choose which texts one wants to base the statistics. This version contains a few more texts and comprises 437 million words of running text. See the description below for details.
The n-grams may be downloaded and used without restrictions for language technology research and development.
- N-grams for Danish, description of format and content (pdf – Norwegian)
- N-gram for Swedish, description of format and content (pdf – Norwegian)
Digitised books in xml-format
The National Library is digitising its entire collection, including a vast amount of written material. This digitisation generates a large number of xml-files which contain all information about the object which is digitised (e.g., bibliographical metadata, structural analysis of the object (division into chapters, distribution of page numbers, sections), OCR-analysis, etc)
Below, you can download xml-versions of the digitalised written material (mostly books) which is not protected by copyright. This includes older material which is now in the public domain, and official publications of a more recent date. As of today, the material consists of approximately 9000 titles.
The index gives an overview over the content. The index is in tab-separated plain text format, and contains the following information:
- Digibook_ID: This ID identifies a single title in the data files (for instance: digibook_2009073101106).
- Year of publication: The four first numbers indicate the year of publication (for instance: 20011231). The four last numbers are always 1231. If the four first numbers are 9999, it means that the year of publication is unknown.
- Title: The title of the publication.
- Author: Name of author or institution which is responsible for the publication
- Publisher: Publishing house or institution.
Files:
- Index (tab-separated plain text, Unicode (UTF-8))
- Compressed version (20 GB)
- Uncompressed version (130 GB)
- Lexical Resources
- Speech Databases
- Text Resources


