Uncovering Machine Translationese Using Corpus Analysis Techniques to Distinguish between Original and Machine-Translated French
HAL (Le Centre pour la Communication Scientifique Directe)(2021)
摘要
This paper investigates the linguistic characteristics of English to French machine-translatedtexts in comparison with French original, untranslated texts in order to uncover what has been called “machine translationese”. In the same vein as corpus-based translation studies which have focused on human-translated texts, and using a corpus-based statistical approach (Principal Component Analysis), we analyzed a ca. 1.8-million-word corpus of English to French translations of press texts, corresponding to the output of four machine translation systems: one statistical (SMT) and three neural (NMT) systems, namely DeepL, Google Translate, and the European Commission’s eTranslation MT tool, in both its SMT and NMT versions. In particular, to complement a previous study on language-specific features in French(e.g. derived adverbs, existential constructions, coordinator et, preposition avec), a series of language-independent linguistic features were extracted for each text in our corpus, ranging from superficial text characteristics such as average word and sentence length to frequencies of closed class lexical categories and measures of lexical diversity. Our results, which compare the machine-translated data with a corpus of French untranslated data, allow us to uncoverlinguistic features in French machine-translated texts that clearly deviate from the observed norms in original French (e.g.average sentence length, ngram features, lexicaldiversity), and which might serve as information for the post-diting process in order to optimize translation quality.
更多查看译文
关键词
machine translationese,corpus analysis,corpus analysis techniques
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要