Text Resources

This page contains information about and links for downloading text resources distributed by Språkbanken.
Please send questions and feedback in connection with these resources to sprakbanken@nb.no .

 

Norwegian Dependency Treebank (NDT), version 1.0.1 - 2014-03-28

NDT consists of two separate treebanks (linguistically annotated text corpora), one for Norwegian Bokmål, and one for Norwegian Nynorsk. The treebanks are morphologically and syntactically annotated. The morphological analysis follows The Norwegian Reference Grammar, while dependency grammar is used for the syntactic analysis.

The final releases (version 1.0 and 1.0.1) contain 300.000 tokens each for the two language varieties. No substantial changes have been made in version 1.0.1, but a translation to English of the annotation guidelines is included.

Read Per Erik Solberg's presentation of the corpora at NODALIDA 2013 in Oslo: paper (pdf), poster (pdf)

 

Fulltext version of The Norwegian Newspaper Corpus (Norsk aviskorpus), version 0.9

This fulltext version of The Norwegian Newspaper Corpus ( http://avis.uib.no/ ) is not finished, and the texts are presented in three different formats. During 2012 and 2013, the texts will be further processed, and made available in a unified xml-format. The texts in the corpus are updated as of 28th December 2011.

Norsk aviskorpus is made available for the users of Språkbanken, and can only be used for language technology research and development. Users of the corpus are not allowed to disclose or publish any part of the texts, only knowledge gained and products developed on the basis of the texts.

The corpus is also available in a simplified format, where all metadata have been removed. The corpus is tokenised at sentence level, and the clauses are ordered alphabetically. Clauses are separated by <s>- and </s>-tags.

 

N-grams for Norwegian Nynorsk
On the basis of the Norwegian Nynorsk texts in Norsk aviskorpus (see above), and what existed of Norwegian Nynorsk texts in the text corpus of Nordisk språkteknologi AS, Språkbanken has produced n-grams (1-gram, 2-gram, 3-gram, 4-gram, 5-gram and 6-gram) for a volume of approximately 60 million words of running text. The material is made available in two versions; a “light” version listing the 1000 most frequent n-grams, and a full version where all the n-grams are collected and sorted according to different criteria. See the description below for details.

Frequency lists (i.e., 1-grams) are also avaialable as single files, sorted by frequency and alphabetically. See the description below for details.

The n-grams may be downloaded and used without restrictions for language technology research and development.

 

N-grams for Norwegian Bokmål
On the basis of the Norwegian Bokmål texts in Norsk aviskorpus (see above) and the text corpus of Nordisk språkteknologi AS, Språkbanken has produced n-grams (1-gram, 2-gram, 3-gram, 4-gram, 5-gram and 6-gram) for a volume of approximately 1.1 billion words of running text. The material is made available in two versions; a “light” version listing the 1000 most frequent n-grams, and a full version where all the n-grams are collected and sorted according to different criteria.  Frequency lists (i.e., 1-grams) are also avaialable as single files, sorted by frequency and alphabetically. See the description below for details

The n-grams may be downloaded and used without restrictions for language technology research and development.

 

N-grams for Danish and Swedish
On the basis of the Danish and Swedish texts in the text corpora of Nordisk Språkteknologi AS, Språkbanken has produced n-grams (1-gram, 2-gram, 3-gram, 4-gram, 5-gram and 6-gram) for a volume of approximately 290 million words of running text for Danish and 400 million words of running text for Swedish. The material is made available in two versions; a “light” version listing the 1000 most frequent n-grams, and a full version where all the n-grams are collected and sorted according to different criteria.

The Swedish material is also available in a third format where one can choose which texts one wants to base the statistics. This version contains a few more texts and comprises 437 million words of running text. See the description below for details.

The n-grams may be downloaded and used without restrictions for language technology research and development.

 

Digitised books in xml-format
The National Library is digitising its entire collection, including a vast amount of written material. This digitisation generates a large number of xml-files which contain all information about the object which is digitised (e.g., bibliographical metadata, structural analysis of the object (division into chapters, distribution of page numbers, sections), OCR-analysis, etc)

Below, you can download xml-versions of the digitalised written material (mostly books) which is not protected by copyright. This includes older material which is now in the public domain, and official publications of a more recent date. As of today, the material consists of approximately 9000 titles.

The index gives an overview over the content. The index is in tab-separated plain text format, and contains the following information:

  • Digibook_ID: This ID identifies a single title in the data files (for instance: digibook_2009073101106).
  • Year of publication: The four first numbers indicate the year of publication (for instance: 20011231). The four last numbers are always 1231. If the four first numbers are 9999, it means that the year of publication is unknown.
  • Title: The title of the publication.
  • Author: Name of author or institution which is responsible for the publication
  • Publisher: Publishing house or institution.

Files:

samlingen nettsidene