Uncovering Machine Translationese Using Corpus Analysis Techniques to Distinguish between Original and Machine­-Translated French

HAL (Le Centre pour la Communication Scientifique Directe)(2021)

引用 0|浏览0
暂无评分
摘要
This paper investigates the linguistic characteristics of English to French machine­-translatedtexts in comparison with French original, untranslated texts in order to uncover what has been called “machine translationese”. In the same vein as corpus­-based translation studies which have focused on human­-translated texts, and using a corpus­-based statistical approach (Principal Component Analysis), we analyzed a ca. 1.8­-million­-word corpus of English to French translations of press texts, corresponding to the output of four machine translation sy­stems: one statistical (SMT) and three neural (NMT) systems, namely DeepL, Google Trans­late, and the European Commission’s eTranslation MT tool, in both its SMT and NMT ver­sions. In particular, to complement a previous study on language­-specific features in French(e.g. derived adverbs, existential constructions, coordinator et, preposition avec), a series of language­-independent linguistic features were extracted for each text in our corpus, ranging from superficial text characteristics such as average word and sentence length to frequencies of closed­ class lexical categories and measures of lexical diversity. Our results, which compare the machine­-translated data with a corpus of French untranslated data, allow us to uncoverlinguistic features in French machine­-translated texts that clearly deviate from the observed norms in original French (e.g.average sentence length, n­gram features, lexicaldiversity), and which might serve as information for the post­-diting process in order to optimize translation quality.
更多
查看译文
关键词
machine translationese,corpus analysis,corpus analysis techniques
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要