Repository logo

Measuring Lexical Distance between Parallel Corpora: The Case of AI-Generated News Translation

dc.contributor.authorConway, Kyle
dc.contributor.authorGramaccia, Julie Alice
dc.contributor.authorScholz, Nikita
dc.contributor.authorAverbeck, Téana
dc.date.accessioned2025-11-28T19:23:09Z
dc.date.available2025-11-28T19:23:09Z
dc.date.issued2025-11-25
dc.descriptionThis is an Accepted Manuscript of an article published by Taylor & Francis in Perspectives: Studies in Translation Theory and Practice, on 25 Nov 2025, available at: https://doi.org/10.1080/0907676X.2025.2590066.en
dc.description.abstractSince the University of Warwick's news translation project in the mid-2000s, it has been a truism that journalists rarely translate whole articles but instead compose stories using texts in other languages as one source among others. However, the development of AI-based machine translation has brought about a shift in journalistic practices. Increasingly, multilingual news agencies are using these tools to produce similar stories in multiple languages. One consequence has been that researchers can now compile parallel corpora of translated stories. This article proposes a method to characterize such corpora by measuring the distance between source and target texts, a method it applies to stories published in English and French on the website SwissInfo.ch. It describes the mechanics of corpus-building, article vectorization, and the creation of a lexical substitution list that makes measurement possible. It then proposes three measures -- Euclidean, Jaccard, and cosine -- which have complementary strengths and weaknesses. The value of these measurement tools is heuristic: they make it possible to identify patterns that can be investigated using other methods more familiar to news translation researchers, such as interviews or direct observation.
dc.description.sponsorshipThis project was undertaken thanks to funding from IVADO and the Canada First Research Excellence Fund.
dc.identifier.citationConway, K., Gramaccia, J.A., Scholz, N., & Averbeck, T. (2025). Measuring lexical distance between parallel corpora: The case of AI-generated news translation. Perspectives: Studies in Translation Theory and Practice. https://doi.org/10.1080/0907676X.2025.2590066.en
dc.identifier.doi10.1080/0907676X.2025.2590066
dc.identifier.issn2643-7791
dc.identifier.urihttps://doi.org/10.1080/0907676X.2025.2590066
dc.identifier.urihttp://hdl.handle.net/10393/51108
dc.language.isoen
dc.relation.urihttps://doi.org/10.5683/SP3/MFZTWZen
dc.rightsAttribution-NonCommercial-ShareAlike 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/4.0/
dc.subjectArtificial intelligence
dc.subjectCorpus-based translation studies
dc.subjectLarge language models
dc.subjectLexicometry
dc.subjectMachine translation
dc.subjectMethodology
dc.titleMeasuring Lexical Distance between Parallel Corpora: The Case of AI-Generated News Translation
dc.typeArticle

Files