Lexical Resources

This page contains information about and links for downloading lexical resources distributed by Språkbanken.

Currently, the following lexical resources are available:

  • SNORRE Wordlist - terms extracted from the terminology database SNORRE
  • Wordnets for Norwegian Bokmål and Nynorsk
  • Norsk ordbank (Norwegian Word Bank), full form lexica for Norwegian Bokmål and Nynorsk
  • SCARRIE, a full form lexicon for Norwegian Bokmål
  • Pronunciation lexicon for Norwegian
  • Pronunciation lexicon for Swedish
  • Pronunciation lexicon for Danish

Please send any questions or feedback in connection with these resources to sprakbanken@nb.no .

 

Wordlist extracted from the SNORRE terminology database – 2013-03-12

SNORRE is a multilingual terminology database containing terminology from many subject areas. The database has been developed through standardization work by Standards Norway in cooperation with the Language Council of Norway. SNORRE contains terminology in Norwegian Bokmål, Norwegian Nynorsk, English, French and German, as well as synonyms and abbreviations. Språkbanken’s version of SNORRE contains the parallelised terms only (not the definitions). The downloadable file is in tab-separated plain text. The total number of entries (unique terms, language independent) amounts to approximately 53500.

For information about full access to SNORRE, including the definitions, please consult the home page of Standards Norway .

Contents Terms Synonyms Abbreviations
Norwegian Bokmål 49.678 2.257 945
Norwegian Nynorsk 48.875 3.448 900
English 48.213 3.012 1.024
French 17.623 843 423
German 18.749 876 203

Språkbanken’s users may freely use the SNORRE wordlist, provided Standards Norway is cited as source and copyright holder. Information that Standards Norway has developed the wordlists based on the terminology database SNORRE must always follow the resource.

 

Wordnets for Norwegian Bokmål and Nynorsk - 2013-04-09
Kaldera språkteknologi AS has developed wordnets for Norwegian (Bokmål and Nynorsk) for Språkbanken. Version 1.1.0 is now available.

 

Norsk ordbank (Norwegian Word Bank), developed at the University of Oslo
Norsk ordbank (Norwegian Word Bank) is composed of a lemma list and a set of inflection patterns. One or more inflection patterns can be applied to each lemma in the word list. The inflection patterns contain one line for every inflected word form of the lemma. One single line contains a transformation pattern and information about the morphological category and morphological features of each word form. The pattern shows how the lemma is expanded to an inflected word.

The data are stored in six tables. There is one set of tables for each of the written standards Bokmål and Nynorsk.

The table "lemma" contains all the entries in Bokmålsordboka and Nynorskordboka, respectively, with the specification of the entry number. (Bokmålsordboka and Nynorskordboka are dictionaries based on official spelling.) The table "fullformsliste" contains all inflected forms of the lemmas in accordance with official spelling.

The tables "lemma_paradigme", "paradigm", "paradigme_boying", "boyingsgruppe" and "boying" contains the information necessary for generating the fullform list from the list of lemmas. In other words, these tables connect the lemmas with inflection patterns, rules and categorcial information.

The conditions of use of Norsk ordbank are given in the form below. Please sign and return the form via email to sprakbanken@nb.no , or by regular mail to Språkbanken, The National Library of Norway, PO Box 2674 Solli, NO – 0203 Oslo, Norway.

SCARRIE, full form lexicon for Norwegian Bokmål
This full form lexicon was created as part of the development of an automatic proofreading program for Norwegian Bokmål. The word forms are tagged with information about the basic form (lemma), standardisation, style level, morphosyntactic features and alternate forms.

The main word list contains words from the open parts of speech (adjectives, adverbs, nouns and verbs). In all, the list contains approximately 361.000 full forms (72.500 lemmas).

A short description of the SCARRIE lexicon (format, license, etc.) is given in the file below. The report SCARRIE Deliverable 3.3.1 gives a more comprehensive description of, among other things, the tag set that is used in the word list.

Pronunciation lexicon for Norwegian, originally produced by NST
This pronunciation lexicon was originally produced by Nordisk språkteknologi AS (NST), and contains about 785.000 entries. The lexicon is specifically prepared for the development of speech technology. Its basis is the 100.000 most frequent word forms in NST’s Norwegian text corpora, which have later been supplied with additional words.

The entire lexicon is stored in one big file in plain text format. Each entry is one line in the text file, with 51 information fields on each line, separated by a semicolon. Not all fields are of relevance for all purposes, but given the format, it is easy to extract the information one needs. The information contained in the various fields include the linear structure of compounds (segmentation) and a phonetic transcription. The transcriptions have partly been done manually (approximately 250.000 words), but to a large extent automatically by employing various lexical tools. These tools – mostly written in Perl – can also be downloaded  (see link below).

The words are transcribed using the Speech Assessment Methods Phonetic Alphabet (SAMPA). See http://www.phon.ucl.ac.uk/home/sampa/index.html for further information about this transcription format.

The lexicon and the accompanying tools may be downloaded and used without restrictions for language technology research and development. The description below gives a more detailed description of the database. Note that this description is thus far only available in Norwegian.

Pronunciation lexicon for Swedish, originally produced by NST
This pronunciation lexicon was originally produced by Nordisk språkteknologi AS (NST), and contains about 927.000 entries. The lexicon is specifically prepared for the development of speech technology. Its basis is the 100.000 most frequent word forms in NST’s Swedish text corpus. Later, additional words have been supplied.

The entire lexicon is stored in one big file in plain text format. Each entry is one line in the text file, with 51 information fields on each line, separated by a semicolon. Not all fields are of relevance for all purposes, but given the format, it is easy to extract the information one needs. The information contained in the various fields include the linear structure of compounds (segmentation) and a phonetic transcription. The transcriptions have partly been done manually (approximately 250.000 words), but to a large extent automatically by employing various lexical tools. These tools – mostly written in Perl – can also be downloaded  (see link below).

The words are transcribed using the Speech Assessment Methods Phonetic Alphabet (SAMPA). See  http://www.phon.ucl.ac.uk/home/sampa/index.html  for further information about this transcription format.

The lexicon and the accompanying tools may be downloaded and used without restrictions for language technology research and development. The description below gives a more detailed description of the database. Note that this description is thus far only available in Norwegian.

Pronunciation lexicon for Danish, originally produced by NST
This pronunciation lexicon was originally produced by Nordisk språkteknologi AS (NST), and contains about 238.000 entries. The lexicon is specifically prepared for the development of speech technology. Its basis is the 100.000 most frequent word forms in NST’s Norwegian text corpora, which have later been supplied with additional words.

The entire lexicon is stored in one big file in plain text format. Each entry is one line in the text file, with 51 information fields on each line, separated by a semicolon. Not all fields are of relevance for all purposes, but given the format, it is easy to extract the information one needs. The information contained in the various fields include the linear structure of compounds (segmentation) and a phonetic transcription. The transcriptions have been done manually, and, unlike the Norwegian and Swedish lexica, there are no machine-generated entries in the Danish lexicon. Still, some lexical tools – mostly written in Perl – have also been developed for Danish (see link below).

The words are transcribed using the Speech Assessment Methods Phonetic Alphabet (SAMPA). See  http://www.phon.ucl.ac.uk/home/sampa/index.html  for further information about this transcription format.

The lexicon and the accompanying tools may be downloaded and used without restrictions for language technology research and development. The description below gives a more detailed description of the database. Note that this description is thus far only available in Norwegian.

samlingen nettsidene