N-grams from NBdigital

This resource contains n-grams – i.e. unigrams, bigrams and trigrams – from all books and newspapers that had been digitized at the National Library of Norway up to September 2013. The n-grams have been extracted from a material consisting of approximately 220,000 books and 540,000 newspapers.

The n-grams are available in two formats, CSV and SQlite: CSV is probably the most interesting format for most developers, because it is very easy to import these files into standard applications. The SQLite files contain indexed databases, which are used in the service NB N-gram. Users who want to contribute to the development of NB N-gram can download the source code on GitHub, and the SQLite databases from this page.

A word count by source (books/newspapers) and language variety (Bokmål/Nynorsk) is given in the json file.

Download resources

Extended metadata

Last ned metadata (CMDI XML)

Last ned metadata (CMDI XML) https://www.nb.no/sprakbanken/oai?verb=GetRecord&identifier=oai:nb.no:sbr-35&metadataPrefix=cmdi

dc:type	corpus
dc:title	N-grams from NBdigital
dc:identifier	oai:nb.no:sbr-35
dc:description	This resource contains n-grams – i.e. unigrams, bigrams and trigrams – from all books and newspapers that had been digitized at the National Library of Norway up to September 2013. The n-grams have been extracted from a material consisting of approximately 220,000 books and 540,000 newspapers. The n-grams are available in two formats, CSV and SQlite: CSV is probably the most interesting format for most developers, because it is very easy to import these files into standard applications. The SQLite files contain indexed databases, which are used in the service NB N-gram. Users who want to contribute to the development of NB N-gram can download the source code on GitHub, and the SQLite databases from this page. A word count by source (books/newspapers) and language variety (Bokmål/Nynorsk) is given in the json file.
dc:publisher
dc:format	downloadable
dc:date
dc:date	2015-06-02
dc:rights	Public
dc:rights	Creative Commons (CC)
dc:rights	Creative_Commons-ZERO (CC-ZERO)
dc:rights	https://creativecommons.org/publicdomain/zero/1.0/
dc:creator	National Library of Norway
dc:lang	Norwegian Bokmål
dc:lang	Norwegian Nynorsk

N-grams from NBdigital

Download resources

Extended metadata

Dublin Core (DC)

Last ned metadata (CMDI XML)