N-grammer fra NBdigital 2022
Utvidet metadata
- resource Common Info
- resource Type: corpus
- identification Info
- resource Name: N-grammer fra NBdigital 2022
- resource Name: N-grams from NBdigital 2022
- description: Dette korpuset inneholder n-grammer – unigrammer, bigrammer og trigrammer – fra alle bøker og aviser som var blitt digitalisert ved Nasjonalbiblioteket per 15. juli 2022. N-grammene er laget på basis av et materiale bestående av om lag 610.000 bøker og 4.000.000 aviser, til sammen ca. 138,5 milliarder "tokens" (ord og tegnsetting). N-grammene finnes på CSV-format (UTF-8-kodert). Kolonnene i CSV-filene med n-grammer er som følger: – first – det første ordet i n-grammet (i unigram, bigram og trigram) – second – det andre ordet i n-grammet (i bigram og trigram) – third – det tredje ordet i n-grammet (i trigram) – lang – språkkode for n-grammet (bare i bøker, aviser har ingen språkklassifikasjon per nå) – freq – den totale frekvensen for n-grammet i samlingen av bøker eller aviser – json – et dictionary med råfrekvens per år totals.json inneholder totalfrekvenser innenfor årganger i bok- og aviskorpuset. Med disse kan man lett regne ut relativfrekvenser for sammenlikning på tvers av år som i NB N-gram. metadata-digibok.csv og metadata-digavis.csv inneholder enkle metadata for alle bøkene og avisene som inngår i bok- og aviskorpuset. Hvis du er interessert i mer utførlige metadata, henviser vi til Oria eller NBs APIer under https://api.nb.no/. Se dokumentasjonsfilene for mer informasjon.
- description: This resource contains n-grams – i.e. unigrams, bigrams and trigrams – from all books and newspapers that had been digitized at the National Library of Norway up to July 15 2022. The n-grams have been extracted from a material consisting of approximately 610,000 books and 4,000,000 newspapers, amounting to a total of 138.5 billion tokens (words and punctuation). The n-grams are offered as CSV files (UTF-8-encoded). Columns in the n-gram CSV files: – first – the first word (in unigrams, bigrams and trigrams) – second – the second word (in bigrams and trigrams) – third – the third word (in trigrams) – lang – the language of the n-gram (only in books, newspapers have no language classification as for now) – freq – the total frequency of the n-gram in the collection of books or newspapers – json – a dictionary with raw frequency for each year totals.json contains aggregated frequencies per year in the book and newspaper corpora. Using them, you can calculate relative frequencies in order to compare frequencies over time as in NB N-gram. metadata-digibok.csv and metadata-digavis.csv contain simple metadata for the books and newspapers. If you need more extensive metadata, you could use Oria or the APIs at https://api.nb.no/. See the documentation files for further information.
- resource Short Name: NBngram2022
- resource Short Name: NBngram2022
- url: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-76/
- P I D: hdl:21.11146/76
- identifier: sbr-76
- distribution Info
- licence Info
- user Category: Public
- distribution Access Medium: downloadable
- download Location: https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-76/
- licence
- licence Family: Creative Commons (CC)
- licence Name: Creative_Commons-ZERO (CC-ZERO)
- licence Url: https://creativecommons.org/publicdomain/zero/1.0/
- licensor:
- actor Info
- actor Type: organization
- role: Licensor
- organization Info
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- licence Info
- contact
- actor Info
- actor Type: organization
- role: Contact
- organization Info
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- actor Info
- metadata Info
- metadata Creation Date: 21.12.2022
- metadata Language Name: English
- metadata Language Id: en
- metadata Last Date Updated: 21.12.2022
- metadata Creator
- actor Info
- actor Type: person
- role: Metadata Creator
- person Info
- surname: Lindstad
- given Name: Arne Martinus
- affiliation:
- organization Info
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- actor Info
- version: 2022
- last Date Updated: 21.12.2022
- documentation Unstructured
- role: documentation
- document Unstructured: Documentation files in English and Norwegian. Metadata files accompanying the data.
- creation Start Date: 15.07.2022
- creation End Date: 21.12.2022
- resource Creator
- actor Info
- actor Type: person
- role: Resource Creator
- person Info
- surname: Birkenes
- given Name: Magnus Breder
- affiliation:
- organization Info
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- actor Info
- actor Info
- actor Type: person
- role: Rsource Creator
- person Info
- surname: Johnsen
- given Name: Lars
- affiliation:
- organization Info
- organization Name: Nasjonalbiblioteket
- organization Name: National Library of Norway
- organization Short Name: NB
- organization Short Name: NLN
- department Name: Språkbanken
- department Name: The Language Bank
- communication Info
- email: sprakbanken@nb.no
- url: https://www.nb.no/sprakbanken/
- address: P.O. Box 2674 Solli
- zip Code: 0203
- city: Oslo
- region: Oslo
- country: Norway
- corpus Info
- corpus Type: Ngram Corpus
- corpus Part Info
- media Type: textNgram
- corpus Text Ngram Info
- ngram Info
- base Item: word
- order: 3
- text Format Info
- mime Type: text/csv
- size Per Text Format
- size Info
- size: 8
- size Unit: files
- size Info
- size: 138446410995
- size Unit: tokens
- size Info
- size: 47,6
- size Unit: gb
- size Info
- text Format Info
- mime Type: application/json
- size Per Text Format
- size Info
- size: 1
- size Unit: files
- size Info
- size: 21
- size Unit: kb
- size Info
- character Encoding Info
- character Encoding: UTF-8
- ngram Info
- corpus Part General Info
- linguality Info
- linguality Type: multilingual
- multilinguality Type: other
- multilinguality Type Details: N-grams extracted from text in several languages. Text from digitized books of varying genre and newspapers.
- language Info
- language Id: nb
- language Name: Norwegian Bokmål
- language Info
- language Id: nn
- language Name: Norwegian Nynorsk
- language Info
- language Id: se
- language Name: Northern Sami
- language Info
- language Id: sma
- language Name: Southern Sami
- language Info
- language Id: smj
- language Name: Lule Sami
- language Info
- language Id: fkv
- language Name: Kven
- modality Info
- modality Type: writtenLanguage
- modality Type Details: Text from digitized books and newspapers.
- size Per Modality
- size Info
- size: 138446410995
- size Unit: tokens
- size Info
- size: 4610000
- size Unit: other
- size Info
- size Info
- size: 138446410995
- size Unit: tokens
- size Info
- size: 4610000
- size Unit: other
- time Coverage Info
- time Coverage: 1800-2022
- linguality Info
dc:type | corpus |
dc:title | N-grammer fra NBdigital 2022 |
dc:identifier | oai:nb.no:sbr-76 |
dc:description | Dette korpuset inneholder n-grammer – unigrammer, bigrammer og trigrammer – fra alle bøker og aviser som var blitt digitalisert ved Nasjonalbiblioteket per 15. juli 2022. N-grammene er laget på basis av et materiale bestående av om lag 610.000 bøker og 4.000.000 aviser, til sammen ca. 138,5 milliarder "tokens" (ord og tegnsetting). N-grammene finnes på CSV-format (UTF-8-kodert). Kolonnene i CSV-filene med n-grammer er som følger: – first – det første ordet i n-grammet (i unigram, bigram og trigram) – second – det andre ordet i n-grammet (i bigram og trigram) – third – det tredje ordet i n-grammet (i trigram) – lang – språkkode for n-grammet (bare i bøker, aviser har ingen språkklassifikasjon per nå) – freq – den totale frekvensen for n-grammet i samlingen av bøker eller aviser – json – et dictionary med råfrekvens per år totals.json inneholder totalfrekvenser innenfor årganger i bok- og aviskorpuset. Med disse kan man lett regne ut relativfrekvenser for sammenlikning på tvers av år som i NB N-gram. metadata-digibok.csv og metadata-digavis.csv inneholder enkle metadata for alle bøkene og avisene som inngår i bok- og aviskorpuset. Hvis du er interessert i mer utførlige metadata, henviser vi til Oria eller NBs APIer under https://api.nb.no/. Se dokumentasjonsfilene for mer informasjon. |
dc:publisher | |
dc:format | downloadable |
dc:date | 2022-07-15 |
dc:date | 2022-12-21 |
dc:rights | Public |
dc:rights | Creative Commons (CC) |
dc:rights | Creative_Commons-ZERO (CC-ZERO) |
dc:rights | https://creativecommons.org/publicdomain/zero/1.0/ |
dc:creator | Magnus Breder Birkenes |
dc:creator | Lars Johnsen |
dc:lang | bokmål |
dc:lang | nynorsk |
dc:lang | nordsamisk |
dc:lang | sørsamisk |
dc:lang | lulesamisk |
dc:lang | kvensk |
Last ned ressurser
-
ngram-2022-digavis-unigram.csv.gz
-
ngram-2022-digavis-bigram.csv.gz
-
ngram-2022-digavis-trigram.csv.gz
-
ngram-2022-digibok-unigram.csv.gz
-
ngram-2022-digibok-bigram.csv.gz
-
ngram-2022-digibok-trigram.csv.gz
-
ngram-2022-metadata-digavis.csv.gz
-
ngram-2022-metadata-digibok.csv.gz
-
ngram-2022-totals.json
-
ngram-2022-README-eng.md
-
ngram-2022-README-nob.md
-
2022_NBngram.pdf