NB Samtale is a speech corpus made by the Language Bank at the National Library of Norway. The corpus contains orthographically transcribed speech from podcasts and recordings of live events at the National Library. The corpus is intended as an open source dataset for Automatic Speech Recognition (ASR) development, and is specifically aimed at improving ASR systems’ handle on conversational speech.
The corpus consists of 12,080 segments, a total of 24 hours transcribed speech from 69 speakers. The corpus ensures both gender and dialect variation, and speakers from five broad dialect areas are represented. Both Bokmål and Nynorsk transcriptions are present in the corpus, with Nynorsk making up approximately 25% of the transcriptions.
We greatly appreciate feedback and suggestions for improvements. PLease contact us at sprakbanken@nb.no.
NB Samtale is a speech corpus made by the Language Bank at the National Library of Norway. The corpus contains orthographically transcribed speech from podcasts and recordings of live events at the National Library. The corpus is intended as an open source dataset for Automatic Speech Recognition (ASR) development, and is specifically aimed at improving ASR systems’ handle on conversational speech.
The corpus consists of 12,080 segments, a total of 24 hours transcribed speech from 69 speakers. The corpus ensures both gender and dialect variation, and speakers from five broad dialect areas are represented. Both Bokmål and Nynorsk transcriptions are present in the corpus, with Nynorsk making up approximately 25% of the transcriptions.
We greatly appreciate feedback and suggestions for improvements. PLease contact us at sprakbanken@nb.no.
Extended metadata
resource Common Info:
resource Type: corpus
identification Info:
resource Name: NB Samtale
resource Name: Norwegian Conversation Speech Corpus
description: NB Samtale er et talekorpus med ortografisk transkribert lydmateriale hentet fra podkaster og opptak av arrangementer på Nasjonalbiblioteket. Korpuset inneholder samtaler mellom flere personer, og talen er spontan og har typiske trekk ved muntlig språk. Lydmaterialet er valgt ut med tanke på god balanse mellom kjønnene og god dialektvariasjon, og korpuset har transkripsjoner på både bokmål og nynorsk.
NB Samtale er tenkt som et open-source-datasett for trening av automatisk talegjenkjenning, spesifikt gjenkjenning av spontan tale mellom flere personer i samtale. Det er til sammen 24 timer transkribert tale fra 69 talere fordelt på 12.080 segmenter som hver er en individuell WAV-fil. Metadataene inneholder blant annet informasjon om segmentenes kildefil, tidskode og varighet, samt talernes kjønn, dialekt og målform.
NB Samtale er utviklet av Språkbanken ved Nasjonalbiblioteket. Vi setter stor pris på tilbakemeldinger og forslag til forbedringer. Kontakt oss på sprakbanken@nb.no.
description: NB Samtale is a speech corpus made by the Language Bank at the National Library of Norway. The corpus contains orthographically transcribed speech from podcasts and recordings of live events at the National Library. The corpus is intended as an open source dataset for Automatic Speech Recognition (ASR) development, and is specifically aimed at improving ASR systems' handle on conversational speech.
The corpus consists of 12,080 segments, a total of 24 hours transcribed speech from 69 speakers. The corpus ensures both gender and dialect variation, and speakers from five broad dialect areas are represented. Both Bokmål and Nynorsk transcriptions are present in the corpus, with Nynorsk making up approximately 25% of the transcriptions.
We greatly appreciate feedback and suggestions for improvements. PLease contact us at sprakbanken@nb.no.
annotation Mode Details: Automatic annotation followed by manual correction and proofreading by two linguists.
annotation Start Date: 01.07.2022
annotation End Date: 18.08.2023
dc:type
corpus
dc:title
Norwegian Conversation Speech Corpus
dc:identifier
oai:nb.no:sbr-85
dc:description
NB Samtale is a speech corpus made by the Language Bank at the National Library of Norway. The corpus contains orthographically transcribed speech from podcasts and recordings of live events at the National Library. The corpus is intended as an open source dataset for Automatic Speech Recognition (ASR) development, and is specifically aimed at improving ASR systems' handle on conversational speech.
The corpus consists of 12,080 segments, a total of 24 hours transcribed speech from 69 speakers. The corpus ensures both gender and dialect variation, and speakers from five broad dialect areas are represented. Both Bokmål and Nynorsk transcriptions are present in the corpus, with Nynorsk making up approximately 25% of the transcriptions.
We greatly appreciate feedback and suggestions for improvements. PLease contact us at sprakbanken@nb.no.