BackTranslation2.0
A Linguistically Motivated Metric to Assess
Sign Language Production

ECCV 2026
Oliver Cory1, Maksym Ivashechkin1, Oline Ranum1, Jianhe Low1, Edward Fish1, Anton Pelykh1, Karahan Sahin1, Ozge Mercanoglu Sincan1, Richard Bowden1
1CVSSP, University of Surrey, United Kingdom
BackTranslation2.0 pipeline overview

Overview of the BackTranslation2.0 (BT2) pipeline. Given a source sentence and generated sign output, multi-modal extraction produces a structured sample for segment- and sequence-level analysis. Phase 1 base tools extract lexical, spatial, phonological, motion, and visual evidence stored in a shared memory trace. Phase 2 comparison tools cross-reference this evidence against linguistic expectations to produce grounded judgements. Deterministic aggregation yields per-dimension scores and an overall understandability score, and a final LLM stage writes a grounded natural-language assessment.

Abstract

Sign Languages (SLs) are the primary means of communication for millions of Deaf individuals, yet existing evaluation metrics for generated SL remain simplistic and poorly aligned with human judgements. We introduce BackTranslation2.0, a linguistically grounded evaluation metric for text-to-sign translation that moves beyond naïve backtranslation. Our approach adopts an agentic framework in which a deterministic pipeline orchestrates a suite of specialised tools to assess four scoring dimensions — grammatical correctness, phonological accuracy, motion fluency, and generation fidelity — aligned with human rater assessments.

Tool outputs are not treated independently: a set of LLM-based cross-referential comparison modules evaluates consistency across tools and checks outputs against linguistic expectations, enabling structured reasoning over grammatical, phonological, and motion-level evidence. Final dimension scores are computed through deterministic weighted formulas over validated tool outputs. To validate BackTranslation2.0, we introduce and evaluate on a British Sign Language (BSL) dataset annotated by native Deaf raters across the same quality dimensions, benchmarking against six baseline metrics. Our method demonstrates strong correlation with human judgements across all dimensions, providing a more comprehensive, interpretable, and linguistically principled evaluation framework for sign language production systems.


Contributions

1
We present BackTranslation2.0, a comprehensive linguistically motivated metric for evaluating text-to-sign translation that is superior to naïve backtranslation.
2
We propose an agentic evaluation framework that integrates specialised tools with cross-referential LLM-based reasoning for structured, interpretable assessment.
3
We introduce a specialised BSL evaluation dataset annotated by native Deaf raters across multiple quality dimensions.
4
We benchmark BackTranslation2.0 against six baseline metrics, demonstrating strong correlation with human judgement across all dimensions and providing a more interpretable and linguistically grounded evaluation paradigm for sign language production systems.

Method

BackTranslation2.0 (BT2) is a deterministic agentic framework for linguistically grounded evaluation of text-to-sign translation. Rather than relying on a single end-to-end metric, BT2 runs a fixed two-phase pipeline of specialised tools that assess grammatical correctness, phonological accuracy, motion fluency, and generation fidelity.

Phase 1
Base Evidence Extraction

Multi-modal feature extraction produces a structured sample at segment and sequence level. Specialised tools extract lexical, spatial, phonological, motion, and visual evidence, stored in a shared Tool-Output Memory.

Lexical Structure Spatial Grammar Phonological Form Motion Fluency Scorer Visual Quality Scorer
Phase 2
Cross-Reference Comparison

LLM-based modules evaluate consistency across Phase 1 tool outputs and check them against linguistic expectations, enabling structured reasoning over grammatical, phonological, and motion-level evidence. An audit trail supports deterministic scoring and grounded reasoning.

Spot-Gloss Comparison Manual Features Comparison Non-Manual Features Comparison Directionality Comparison
Output
Deterministic Dimension Scoring & Final Grounded Assessment

Final dimension scores are computed via deterministic weighted formulas over validated tool outputs. A final LLM stage receives the shared trace and deterministic scores to produce an auditable natural-language assessment.

Grammatical Correctness Phonological Accuracy Motion Fluency Generation Fidelity Overall Score

Scoring Dimensions

BT2 evaluates four dimensions aligned with how native Deaf raters assess sign language quality.

📐
Grammatical Correctness
Adherence to target sign language syntax, including use of signing space, pronominal indexing, and directional verb agreement (directionality).
🤲
Phonological Accuracy
Correctness of sub-lexical components, including handshape, movement, location, orientation, and non-manual features.
🌊
Motion Fluency
Naturalness and temporal smoothness of signing motion — capturing the fluid, continuous articulation expected in native signing.
🎯
Generation Fidelity
Visual quality and anatomical plausibility of the generated output, ensuring the signer representation is realistic and interpretable.

BibTeX

@article{backtranslation2_2026,
  title     = {BackTranslation2.0: A Linguistically Motivated Metric
               to Assess Sign Language Production},
  author    = {Cory, Oliver and Ivashechkin, Maksym and Ranum, Oline and
               Low, Jianhe and Fish, Edward and Pelykh, Anton and
               Sahin, Karahan and Mercanoglu Sincan, Ozge and
               Bowden, Richard},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026},
}