Skip to content

NST N-gram – Danish News Text

This corpus contains n-grams derived from a 290 million word corpus of Danish news text from the papers Berlingske Tidende, Ekstrabladet og Politiken. The time period covered is 1995-1999. The corpus was originally developed by Nordic Language Technology (NST) 1997-2003. The n-grams were generated by Uni Research for the National Library and the Language Bank.

Sequences of one to six words have been generated (i.e., unigrams, bigrams, trigrams, 4-grams, 5-grams and 6-grams) and ordered both by frequency and alphabetically. For convenience, a collection of the 1000 most frequent n-grams of all types listed above is also made available as a separate download.

This corpus contains n-grams derived from a 290 million word corpus of Danish news text from the papers Berlingske Tidende, Ekstrabladet og Politiken. The time period covered is 1995-1999. The corpus was originally developed by Nordic Language Technology (NST) 1997-2003. The n-grams were generated by Uni Research for the National Library and the Language Bank.

Sequences of one to six words have been generated (i.e., unigrams, bigrams, trigrams, 4-grams, 5-grams and 6-grams) and ordered both by frequency and alphabetically. For convenience, a collection of the 1000 most frequent n-grams of all types listed above is also made available as a separate download.

Extended metadata

Download resources

Download metadata