The NORINT Corpus

Clarino - Textlab


Oppdatert: 2017-08-17

The NORINT Corpus consists of speech from 47 and written texts from 116 adult learners of Norwegian as second language, all of whom were taking advanced Norwegian courses (≈the CEFR level B2) at the University of Oslo during the summers of 2014 and 2015.

The NORINT Corpus is divided into three sub-parts:

- NORINT Speech: The speech part of the corpus consists of interviews and conversations, 140,000 words all together. In the interviews, a teacher asks L2 learners general questions about their background, studies, work, and future plans. In addition, the same L2 learners converse in pairs about optional themes such as culture, leisure, travel, or life in Norway. There are both audio and video recordings of the interviews and conversations.

The recordings are transcribed orthographically with the transcription tool Elan.

- NORINT Recited: 57 L2 learners, 47 of whom contributed to the NORINT Speech sub-part, recite a short story, as well as 60 non-contextualized sentences. This part of the corpus has been audio-recorded.

- NORINT Text: The text part of the corpus consists of 53,247 words from 116 exam papers written by adult L2 learners taking their Norwegian exams. The informants are partially the same as in NORINT Speech and NORINT Recited but the identification of participants is not possible in the corpus because of privacy protection.

The texts are available in three formats: one original hand written version in pdf format, one written digital copy of the original version and one version where all the orthographic errors are corrected. The original text version and the corrected version are linked together.

The corpus is searchable in the search interface Glossa, and the transcriptions are linked to audio and video files.

Vis utvidede metadata

The link will take you to an external site: We take no responsibility whatsoever for the content of external links.