Corpus non distribués librement - Non freely distributed corpora
Présentation : This double CD-ROM contains extensive corpora, both spoken and written, in more than 21 languages of Western, Central and Eastern Europe, for instance Lithuanian, Polish, Hungarian, and Slovene. The corpora are available in plain text and SGML encoding, and have been successfully aligned. Also available are various tools including a corpus query language, concordancer, alignment tools, software, POS taggers, lexica in 6 languages and samples of research work involving the data. Distribué par Elsnet Multilingual Corpus I (ECI/MCI) of over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, Malay and more. The EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu) (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include Urdu monolingual and parallel corpora annotated for parts-of-speech, and 20 written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES- compliant SGML and encoded using Unicode.
The EMILLE Lancaster Corpus consists of monolingual corpora containing approximately 58,880,000 words for seven South Asian languages (Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil and Urdu) (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include Urdu monolingual and parallel corpora annotated for parts-of-speech, and 20 written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode. Journal officiel de la communauté européenne (JOC) Etiquetage: Parties du discours Langues: Anglais - Français - Allemand - Italien - Espagnol. Corpus disponible à l'UCREL(University Centre For Computer Corpus Research on Language - Lancaster) Documents officiels de la CE sur les télécommunications Aligné (1 250 000 mots pour chaque langue) Langues : Anglais - Français Etiquetage : Parties du discours, lemmatisation
Cet e-mail est protégé contre les robots collecteurs de mails, votre navigateur doit accepter le Javascript pour le voir
Corpus aligné Multilingual Corpora and Contrastive Linguistics. Langues: Anglais / Français, Anglais / Allemand Corpus of German and English translations. The corpus is not available for copyright reasons.
|