Tranining corpus ssj500kv1.2

The ssj500k training corpus is based on two training corpora, built within the JOS project. It contains the entire jos100k corpus and additional 400.000 words from a million-word jos1M corpus. When making the training corpus, the text, consisting of a sequence of characters (letters, numbers, spaces, symbols etc.), has to be first divided into meaningful units such as paragraphs, sentences, words and punctuation. This procedure is called segmentation (sentence identification) and tokenization (identification of tokens, i.e. words and punctuation). Two other types of information are attributed to each word: a basic form or a lemma (jagodam, jagodami -> jagoda) and a morphosyntactic tag. The latter is formed as an acronym, containing the information on word class and related morphosyntactic features, for example Somei = samostalnik (noun), občno ime (common noun), moški spol (masculine gender), ednina (singular), imenovalnik (nominative). The ssj500k corpus uses the JOS tagset that contains exactly 1,902 tags with combinations of categories and features according to the specifications of the JOS project.

Extended metadata

Last ned metadata (CMDI XML)

Last ned metadata (CMDI XML) http://hdl.handle.net/11495/D8A2-CFB1-49F7-1

Go to resource page

Go to resource page http://hdl.handle.net/11495/DB26-0437-026E-4

dc:type	corpus
dc:title	Tranining corpus ssj500kv1.2
dc:identifier	oai:clarino.uib.no:slv-ssj500k-dep
dc:description	The ssj500k training corpus is based on two training corpora, built within the JOS project. It contains the entire jos100k corpus and additional 400.000 words from a million-word jos1M corpus. When making the training corpus, the text, consisting of a sequence of characters (letters, numbers, spaces, symbols etc.), has to be first divided into meaningful units such as paragraphs, sentences, words and punctuation. This procedure is called segmentation (sentence identification) and tokenization (identification of tokens, i.e. words and punctuation). Two other types of information are attributed to each word: a basic form or a lemma (jagodam, jagodami -> jagoda) and a morphosyntactic tag. The latter is formed as an acronym, containing the information on word class and related morphosyntactic features, for example Somei = samostalnik (noun), občno ime (common noun), moški spol (masculine gender), ednina (singular), imenovalnik (nominative). The ssj500k corpus uses the JOS tagset that contains exactly 1,902 tags with combinations of categories and features according to the specifications of the JOS project.
dc:publisher
dc:format	accessibleThroughInterface
dc:date
dc:date
dc:rights	Public
dc:rights	Creative Commons (CC)
dc:rights	Creative_Commons-BY-NC-SA (CC-BY-NC-SA)
dc:rights	https://creativecommons.org/licenses/by-nc-sa/4.0/
dc:lang	Slovenian

Tranining corpus ssj500kv1.2

Extended metadata

Dublin Core (DC)

Last ned metadata (CMDI XML)

Go to resource page