In January 2012, Lingit AS in Trondheim started building an acoustic-phonetic speech database for Språkbanken. The database will be a welcome resource for developers working with speech technology, but will also be valuable for phonetic research.
The speech database will cover phonological variation in Norwegian. It will not be possible to cover all the details from Norwegian dialects within the range of the project, but the main features will be documented, in a similar way as has been done in the TIMIT-corpora for American English, and it will be possible to supply the database with additional speakers later. The delivery will cover the fundamental Norwegian phoneme inventory and the most central intonation patterns.
The recorded material will consist of recorded sentences extracted from, among other sources, Norsk aviskorpus (The Norwegian Newspaper Corpus). The sentences are read in the way which is most natural for the individual speakers. Much effort will be put into preparing manuscripts which cover as much as possible of the phonological variation found in Norwegian. Acoustic variation is also essential, so the selection of speakers will be balanced as regards age and sex. The selection of speakers is concentrated around larger cities which cover the defined phoneme inventory. For dialects which are close to the Nynorsk written norm, the effort will be concentrated in core areas in Hordaland, Sogn og Fjordane, and Møre og Romsdal.
The recordings take place in professional sound studios with a close talk microphone and a studio microphone. In this way, there is good control of relevant acoustic parametres, primarily with a view to minimising background noise. The recordings are transcribed in X-SAMPA (see http://www.phon.ucl.ac.uk/home/sampa/x-sampa.htm).
The most time-consuming work is time-alignment of the recordings; the sound signal will be aligned with the transcription at the phoneme and word level. The time-alignment will have a maximal deviation of an average of 20 milliseconds. For transitions which involve stops, the deviation will typically be much less than 20 ms, while in transitions between vowels (for instance Eia) and certain consonant-vowel combinations (for instance ha), it will be larger.
The audio files will be stored in uncompressed WAVE-format (Waveform Audio File Format) with 48 kHz sampling frequency and at least 16-bit resolution. This format can easily be converted to any other format. Metadata will be coded in XML or an XML-compatible format.