sábado, 26 de janeiro de 2008

The DGT Multilingual Translation Memory

As of November 2007, the European Commission's Directorate General for Translation (DGT) made publicly accessible its multilingual Translation Memory for the Acquis Communautaire (the body of EU law) - a collection of parallel texts (texts and their translation, also referred to as bi-texts) in 22 languages. This is a page for technical users, where you will find a summary of this unique resource and instructions on where to download it and how to produce bilingual aligned corpora for any of the 231 language pair combinations (462 language pair directions). For an example of one sentence translated into all 22 languages, click here.

Note that - if you are a non-technical user - you may be more interested in our freely accessible news analysis applications, which you find at http://emm.jrc.it/overview.html.

The release of this linguistic resource follows the public release - in May 2006 - of the JRC-Acquis multilingual parallel corpus with sentence alignment for 231 language pairs. Version 3.0 of the JRC-Acquis, which now also contains Bulgarian as a 22nd language and which comprises a total of over 1 Billion words, has been made available in April 2007. The data releases of DGT and JRC are in line with the general effort of the European Commission to support multilingualism, language diversity and the re-use of Commission information.

The Acquis communautaire is the entire body of European legislation, including all the treaties, regulations and directives adopted by the European Union (EU) and the rulings of the European Court of Justice (see the Wikipedia entry). Since each new country joining the EU is required to accept the whole Acquis Communautaire, this body of legislation is translated into 22 official languages. As a result, the Acquis now exists as parallel texts in the following 22 languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and Swedish. For the 23rd official EU language, Irish, the Acquis is not translated on a regular basis.

A translation memory is a collection of small text segments and their translation. These segments can be sentences or sentence parts. Translation memories are used to support translators by ensuring that pieces of text that have already been translated do not need to be translated again.

Both translation memories and parallel texts are an important linguistic resource that can be used for a variety of purposes, including:

  • training automatic systems for Statistical Machine Translation (SMT);
  • producing monolingual or multilingual lexical and semantic resources such as dictionaries and ontologies;
  • training and testing multilingual information extraction software;
  • checking translation consistency automatically;
  • testing and benchmarking alignment software (for sentences, words, etc.).

Generally speaking, parallel corpora are useful for all types of cross-lingual research. The value of a parallel corpus grows with its size and with the number of languages for which translations exist. While parallel corpora for some languages exist abundantly, there are few or no parallel corpora for most other language pairs. To our knowledge, the Acquis Communautaire is the biggest parallel corpus in existence, if we take into consideration both its size and the large number of languages involved. The most outstanding advantage of the Acquis Communautaire - apart from being freely available - is the number of rare language pair combinations (e.g. Maltese-Estonian, Slovene-Finnish, etc.).

The distribution consists of 12 zip files (Volume_1.zip, ... Volume_12.zip), each of approximately 100 MB. Each zip file has dozens of tmx-files identified by the EUR-Lex number of the underlying documents of the Acquis and a file list in txt specifying the languages in which the documents are available.

Get it!

Part 1
Part 2
Part 3
Part 4
Part 5
Part 6
Part 7
Part 8
Part 9
Part 10
Part 11
Part 12

You need to also download the extraction program and copy them into the same directory as the zip files with the data. The program consists of two files:
Program File
Library DLL

Enjoy it!

