Measuring Lexical Distance between Parallel Corpora: The Case of AI-Generated News Translation
| dc.contributor.author | Conway, Kyle | |
| dc.contributor.author | Gramaccia, Julie Alice | |
| dc.contributor.author | Scholz, Nikita | |
| dc.contributor.author | Averbeck, Téana | |
| dc.date.accessioned | 2025-11-28T19:23:09Z | |
| dc.date.available | 2025-11-28T19:23:09Z | |
| dc.date.issued | 2025-11-25 | |
| dc.description | This is an Accepted Manuscript of an article published by Taylor & Francis in Perspectives: Studies in Translation Theory and Practice, on 25 Nov 2025, available at: https://doi.org/10.1080/0907676X.2025.2590066. | en |
| dc.description.abstract | Since the University of Warwick's news translation project in the mid-2000s, it has been a truism that journalists rarely translate whole articles but instead compose stories using texts in other languages as one source among others. However, the development of AI-based machine translation has brought about a shift in journalistic practices. Increasingly, multilingual news agencies are using these tools to produce similar stories in multiple languages. One consequence has been that researchers can now compile parallel corpora of translated stories. This article proposes a method to characterize such corpora by measuring the distance between source and target texts, a method it applies to stories published in English and French on the website SwissInfo.ch. It describes the mechanics of corpus-building, article vectorization, and the creation of a lexical substitution list that makes measurement possible. It then proposes three measures -- Euclidean, Jaccard, and cosine -- which have complementary strengths and weaknesses. The value of these measurement tools is heuristic: they make it possible to identify patterns that can be investigated using other methods more familiar to news translation researchers, such as interviews or direct observation. | |
| dc.description.sponsorship | This project was undertaken thanks to funding from IVADO and the Canada First Research Excellence Fund. | |
| dc.identifier.citation | Conway, K., Gramaccia, J.A., Scholz, N., & Averbeck, T. (2025). Measuring lexical distance between parallel corpora: The case of AI-generated news translation. Perspectives: Studies in Translation Theory and Practice. https://doi.org/10.1080/0907676X.2025.2590066. | en |
| dc.identifier.doi | 10.1080/0907676X.2025.2590066 | |
| dc.identifier.issn | 2643-7791 | |
| dc.identifier.uri | https://doi.org/10.1080/0907676X.2025.2590066 | |
| dc.identifier.uri | http://hdl.handle.net/10393/51108 | |
| dc.language.iso | en | |
| dc.relation.uri | https://doi.org/10.5683/SP3/MFZTWZ | en |
| dc.rights | Attribution-NonCommercial-ShareAlike 4.0 International | en |
| dc.rights.uri | http://creativecommons.org/licenses/by-nc-sa/4.0/ | |
| dc.subject | Artificial intelligence | |
| dc.subject | Corpus-based translation studies | |
| dc.subject | Large language models | |
| dc.subject | Lexicometry | |
| dc.subject | Machine translation | |
| dc.subject | Methodology | |
| dc.title | Measuring Lexical Distance between Parallel Corpora: The Case of AI-Generated News Translation | |
| dc.type | Article |
