Norwegian Parliamentary Speech Corpus

This is a beta release (version 0.2) of The Norwegian Parliamentary Speech Corpus (NPSC). The corpus is developed by the Norwegian Language Bank at the National Library of Norway. The project was initiated in 2019 and is still ongoing. The NPSC consists of audio recordings of debates in Stortinget (the Norwegian parliament), and corresponding orthographic transcriptions in either Norwegian Bokmål or Norwegian Nynorsk, as well as various metadata about the speakers. The official proceedings from the meetings are also included in the corpus for reference.

Transcription was first done automatically; subsequently, the output of the automatic process was manually checked and corrected by trained linguists or philologists. Finally, all transcriptions were proofread to ensure consistency and accuracy.

NPSC is primarily intended as an open-source dataset for ASR development.

The audio files in the corpus contain the speech of entire days of plenary meetings from 2017 and 2018 (or, if a meeting lasts more than six hours, the first six hours of a day). Since the audio files are quite large, individual audio files for each sentence are also included.

A stable and substantially larger version of the NPSC will be launched later in 2021. For the stable version to become as good as possible, we greatly appreciate any feedback and suggestions for improvement on this beta version. Please use our e-mail address, sprakbanken@nb.no.

This is a beta release (version 0.2) of The Norwegian Parliamentary Speech Corpus (NPSC). The corpus is developed by the Norwegian Language Bank at the National Library of Norway. The project was initiated in 2019 and is still ongoing. The NPSC consists of audio recordings of debates in Stortinget (the Norwegian parliament), and corresponding orthographic transcriptions in either Norwegian Bokmål or Norwegian Nynorsk, as well as various metadata about the speakers. The official proceedings from the meetings are also included in the corpus for reference.

Transcription was first done automatically; subsequently, the output of the automatic process was manually checked and corrected by trained linguists or philologists. Finally, all transcriptions were proofread to ensure consistency and accuracy.

NPSC is primarily intended as an open-source dataset for ASR development.

The audio files in the corpus contain the speech of entire days of plenary meetings from 2017 and 2018 (or, if a meeting lasts more than six hours, the first six hours of a day). Since the audio files are quite large, individual audio files for each sentence are also included.

A stable and substantially larger version of the NPSC will be launched later in 2021. For the stable version to become as good as possible, we greatly appreciate any feedback and suggestions for improvement on this beta version. Please use our e-mail address, sprakbanken@nb.no.

Extended metadata

Download resources

Download metadata