Parallel Corpus of documents from the Technical Regulations Information System for German-Spanish

CLARINO UiB - Corpuscle

Lisens: Creative_Commons-BY-NC-SA (CC-BY-NC-SA)

Oppdatert: 2016-03-17

The corpus TRIS (Parallel Corpus of documents from the Technical Regulations Information System for German-Spanish is a specialized parallel corpus with Spanish-German (ES-ES, DE-AT and DE-DE), texts from the European Commission between 1997-2010. The texts are technical regulations in a variety of domains. This third and final version is sentence aligned and is in TMX and TEI format. TMX files are sentence aligned while TEI encoded files have the information about sentence alignment in stand-off annotation. Every sentence includes information about the domain, the year and the file it belongs to as well as the sentence number. It contains files written in Austria and translated into European Spanish from three different domains:

- B00: Construction (205 files; 70,648 sentences; 1,563,000 words; time frame: 1999-2010)

- C00A: Agriculture, Fishing and Foodstuffs (12 files; 4879 sentences; 137,354 words; time frame: 1999-2001)

- H00: Domestic and Leisure Equipment (12 files; 1229 sentences; 58328 words; time frame: 2005-2010)

Additionally the corpus has also been Part-Of-Speech tagged using the TreeTagger POS tagger and the POS tagged files are also available.

TRIS version 0.3 is the final version and subsumes version 0.1 and 0.2, and corrects some errors that were present in the two first versions. TRIS v0.3 is encoded in TEI P5 and includes files from two domains not included in versions 0.1 and 0.2: C00A (Agriculture, Fishing and Foodstuffs), which is currently under alignment and H00 (Domestic and Leisure Equipment), which includes all files available in the database up to 2010.

