Extended metadata
- resource Common Info
- resource Type: corpus
- identification Info
- resource Name: Translation Memory from Doffin
- resource Name: Omsetjingsminne frå Doffin
- description: This corpus contains data from Doffin, the Norwegian web-based database for notices of public procurement and procurement in the utility sector, managed by The Norwegian Agency for Public and Financial Management. The Language Bank received the data in the form of an XML database dump. The dump consisted of 41.143 document pairs (original and translation). 40.631 of these were translations from Norwegian to English. Only the latter are included in the corpus. Of the originally Norwegian documents, 39.893 were in Norwegian Bokmål and 736 in Norwegian Nynorsk. Original and translation were first aligned on document level using an internal document identifier, then the sentences were extracted using the NLTK Punkt Sentence Tokenizer and aligned using Hunalign. Duplicate translations (exact duplicates) were discarded. We recorded a total of 293.649 translation units (TUs) for Norwegian Bokmål to English, and 6.342 TUs for Norwegian Nynorsk to English. A TU is a translation pair with an original text and a parallelized translation, and corresponds to a more or less meaningful linguistic unit, typically a clause, a heading etc. A TU may also consist of a single word or several clauses. The translation units for the two languages are distributed as two separate files, both in TMX 1.4 format (a variant of XML).
- description: Dette korpuset inneheld data frå Doffin, den nasjonale kunngjeringsbasen for offentlege anskaffingar, forvalta av Direktoratet for Forvaltning og Økonomistyring (DFØ). Språkbanken fekk dataa i from av ein dump av ein XML-database. Dumpen bestod av 41.143 dokumentpar (originalar og omsetjingar). 40.631 av desse var omsetjingar frå norsk til engelsk. Berre dei sistnemnde er inkluderte i korpuset. Av dei opphavleg norske dokumenta er 39.893 på bokmål og 736 på nynorsk. Original og omsetjing vart først parallelliserte på dokumentnivå ved hjelp av ein intern dokumentidentifikator, deretter vart setningane identifiserte med NLTK Punkt Sentence Tokenizer og parallelliserte ved å nytte Hunalign. Dupliserte omsetjingar (eksakte duplikat) vart kasserte. Totalt fann me 293.649 omsetjingseiningar (Translation Units – TU) for bokmål til engelsk, og 6.342 TUar for nynorsk til engelsk. Ein TU er eit omsetjingspar med ei originaltekst og ei parallellisert omsetjing, og svarar til ei meir eller mindre meiningsberande språkleg eining, typisk ei setning, overskrift eller liknande. Ein TU kan òg bestå ev eit enkeltord eller fleire setningar. Omsetjingseiningane for bokmål og nynorsk vert distribuerte som to separate filer, begge i TMX 1.4-format (ein variant av XML).
- url: https://www.nb.no/sprakbanken/resource/translation-memory-from-doffin/
- identifier: sbr-63
- distribution Info
- licence Info
- user Category: Public
- distribution Access Medium: downloadable
- download Location: https://www.nb.no/sprakbanken/wp-json/resource/v1/sbr-63
- execution Location:
- attribution Text:
- licence
- licence Family: Creative Commons (CC)
- licence Name: Creative_Commons-ZERO (CC-ZERO)
- licence Url: https://creativecommons.org/publicdomain/zero/1.0/
- conditions Of Use:
- non Standard Conditions Of Use:
- distribution Rights Holder
- actor Info
- actor Type: organization
- role: Distribution Rights Holder
- organization Info
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- communication Info
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- actor Info
- licensor:
- actor Info
- actor Type: organization
- role: Licensor
- organization Info
- organization Name: Norwegian Agency for Public and Financial Management
- organization Name: Direktoratet for Forvaltning og Økonomistyring
- organization Short Name: DFØ
- organization Short Name: DFØ
- department Name: Doffin
- department Name: Doffin
- licence Info
- ipr Holder
- contact
- actor Info
- actor Type: organization
- role: Contact
- organization Info
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- department Name: The Language Bank
- department Name: Språkbanken
- communication Info
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- actor Info
- metadata Info
- metadata Creation Date: 18.12.2020
- metadata Language Name: English
- metadata Language Id: eng
- metadata Last Date Updated: 18.12.2020
- metadata Creator
- actor Info
- actor Type: person
- role: Metadata Creator
- person Info
- surname: Lindstad
- given Name: Arne Martinus
- affiliation:
- organization Info
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- communication Info
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- actor Info
- version: 2020
- revision:
- last Date Updated: 04.11.2020
- validated: yes
- validation Type: formal
- validation Mode: automatic
- validation Mode Details: sentences extracted using the NLTK Punkt Sentence Tokenizer and aligned using Hunalign
- validation Extent: full
- validator:
- actor Info
- actor Type: person
- role: Resource Validator
- person Info
- surname: Birkenes
- given Name: Mgnus Breder
- affiliation:
- organization Info
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- communication Info
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- documentation Unstructured
- role: documentation
- document Unstructured: readme file
- creation Start Date:
- creation End Date: 04.11.2020
- resource Creator
- actor Info
- actor Type: organization
- role: Resource Creator
- organization Info
- organization Name: Norwegian Agency for Public and Financial Management
- organization Name: Direktoratet for Forvaltning og Økonomistyring
- organization Short Name: DFØ
- organization Short Name: DFØ
- department Name: Doffin
- department Name: Doffin
- actor Info
- actor Type: person
- role: Resource Creator
- person Info
- surname: Birkenes
- given Name: Magnus Breder
- affiliation:
- organization Info
- organization Name: National Library of Norway
- organization Name: Nasjonalbiblioteket
- organization Short Name: NLN
- organization Short Name: NB
- communication Info
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- actor Info
- corpus Info
- corpus Type: Written Corpus
- corpus Part Info
- media Type: text
- corpus Audio Info
- corpus Text Info
- text Format Info
- mime Type: application/x-tmx+xml
- size Per Text Format
- size Info
- size: 299991
- size Unit: units
- size Info
- size: 2
- size Unit: files
- size Info
- character Encoding Info
- character Encoding: UTF-8
- text Format Info
- corpus Text Ngram Info
- ngram Info
- base Item:
- order:
- ngram Info
- corpus Part General Info
- linguality Info
- linguality Type: multilingual
- multilinguality Type: parallel
- multilinguality Type Details: translation memory
- language Info
- language Id: nob
- language Name: Norwegian Bokmål
- language Variety Info
- language Variety Type: other
- language Variety Name: technical
- language Info
- language Id: nno
- language Name: Norwegian Nynorsk
- language Variety Info
- language Variety Type: other
- language Variety Name: technical
- language Info
- language Id: eng
- language Name: English
- language Variety Info
- language Variety Type: other
- language Variety Name: technical
- modality Info
- modality Type: writtenLanguage
- modality Type Details:
- size Info
- size: 299991
- size Unit: units
- size Info
- size: 2
- size Unit: files
- annotation Info
- annotation Type: alignment
- segmentation Level: sentence
- annotation Mode: automatic
- geographic Coverage Info
- geographic Coverage: nor
- linguality Info
dc:type | corpus |
dc:title | Translation Memory from Doffin |
dc:identifier | oai:nb.no:sbr-63 |
dc:description | This corpus contains data from Doffin, the Norwegian web-based database for notices of public procurement and procurement in the utility sector, managed by The Norwegian Agency for Public and Financial Management. The Language Bank received the data in the form of an XML database dump. The dump consisted of 41.143 document pairs (original and translation). 40.631 of these were translations from Norwegian to English. Only the latter are included in the corpus. Of the originally Norwegian documents, 39.893 were in Norwegian Bokmål and 736 in Norwegian Nynorsk. Original and translation were first aligned on document level using an internal document identifier, then the sentences were extracted using the NLTK Punkt Sentence Tokenizer and aligned using Hunalign. Duplicate translations (exact duplicates) were discarded. We recorded a total of 293.649 translation units (TUs) for Norwegian Bokmål to English, and 6.342 TUs for Norwegian Nynorsk to English. A TU is a translation pair with an original text and a parallelized translation, and corresponds to a more or less meaningful linguistic unit, typically a clause, a heading etc. A TU may also consist of a single word or several clauses. The translation units for the two languages are distributed as two separate files, both in TMX 1.4 format (a variant of XML). |
dc:publisher | |
dc:format | downloadable |
dc:date | |
dc:date | 2020-11-04 |
dc:rights | Public |
dc:rights | Creative Commons (CC) |
dc:rights | Creative_Commons-ZERO (CC-ZERO) |
dc:rights | https://creativecommons.org/publicdomain/zero/1.0/ |
dc:creator | Norwegian Agency for Public and Financial Management |
dc:creator | Magnus Breder Birkenes |
dc:lang | Norwegian Bokmål |
dc:lang | Norwegian Nynorsk |
dc:lang | English |