BackTranslation2.0: A Linguistically Motivated Metric to Assess Sign Language Production

Overview of the BackTranslation2.0 (BT2) pipeline. Given a source sentence and generated sign output, multi-modal extraction produces a structured sample for segment- and sequence-level analysis. Phase 1 base tools extract lexical, spatial, phonological, motion, and visual evidence stored in a shared memory trace. Phase 2 comparison tools cross-reference this evidence against linguistic expectations to produce grounded judgements. Deterministic aggregation yields per-dimension scores and an overall understandability score, and a final LLM stage writes a grounded natural-language assessment.

Abstract

Sign Languages (SLs) are the primary means of communication for millions of Deaf individuals, yet existing evaluation metrics for generated SL remain simplistic and poorly aligned with human judgements. We introduce BackTranslation2.0, a linguistically grounded evaluation metric for text-to-sign translation that moves beyond naïve backtranslation. Our approach adopts an agentic framework in which a deterministic pipeline orchestrates a suite of specialised tools to assess four scoring dimensions — grammatical correctness, phonological accuracy, motion fluency, and generation fidelity — aligned with human rater assessments.

Tool outputs are not treated independently: a set of LLM-based cross-referential comparison modules evaluates consistency across tools and checks outputs against linguistic expectations, enabling structured reasoning over grammatical, phonological, and motion-level evidence. Final dimension scores are computed through deterministic weighted formulas over validated tool outputs. To validate BackTranslation2.0, we introduce and evaluate on a British Sign Language (BSL) dataset annotated by native Deaf raters across the same quality dimensions, benchmarking against six baseline metrics. Our method demonstrates strong correlation with human judgements across all dimensions, providing a more comprehensive, interpretable, and linguistically principled evaluation framework for sign language production systems.

Contributions

1

We present BackTranslation2.0, a comprehensive linguistically motivated metric for evaluating text-to-sign translation that is superior to naïve backtranslation.

2

We propose an agentic evaluation framework that integrates specialised tools with cross-referential LLM-based reasoning for structured, interpretable assessment.

3

We introduce a specialised BSL evaluation dataset annotated by native Deaf raters across multiple quality dimensions.

4

We benchmark BackTranslation2.0 against six baseline metrics, demonstrating strong correlation with human judgement across all dimensions and providing a more interpretable and linguistically grounded evaluation paradigm for sign language production systems.

Method

BackTranslation2.0 (BT2) is a deterministic agentic framework for linguistically grounded evaluation of text-to-sign translation. Rather than relying on a single end-to-end metric, BT2 runs a fixed two-phase pipeline of specialised tools that assess grammatical correctness, phonological accuracy, motion fluency, and generation fidelity.

Phase 1

Base Evidence Extraction

Multi-modal feature extraction produces a structured sample at segment and sequence level. Specialised tools extract lexical, spatial, phonological, motion, and visual evidence, stored in a shared Tool-Output Memory.

Lexical Structure Spatial Grammar Phonological Form Motion Fluency Scorer Visual Quality Scorer

Phase 2

Cross-Reference Comparison

LLM-based modules evaluate consistency across Phase 1 tool outputs and check them against linguistic expectations, enabling structured reasoning over grammatical, phonological, and motion-level evidence. An audit trail supports deterministic scoring and grounded reasoning.

Spot-Gloss Comparison Manual Features Comparison Non-Manual Features Comparison Directionality Comparison

Output

Deterministic Dimension Scoring & Final Grounded Assessment

Final dimension scores are computed via deterministic weighted formulas over validated tool outputs. A final LLM stage receives the shared trace and deterministic scores to produce an auditable natural-language assessment.

Grammatical Correctness Phonological Accuracy Motion Fluency Generation Fidelity Overall Score

Scoring Dimensions

BT2 evaluates four dimensions aligned with how native Deaf raters assess sign language quality.

📐

Grammatical Correctness

Adherence to target sign language syntax, including use of signing space, pronominal indexing, and directional verb agreement (directionality).

🤲

Phonological Accuracy

Correctness of sub-lexical components, including handshape, movement, location, orientation, and non-manual features.

🌊

Motion Fluency

Naturalness and temporal smoothness of signing motion — capturing the fluid, continuous articulation expected in native signing.

🎯

Generation Fidelity

Visual quality and anatomical plausibility of the generated output, ensuring the signer representation is realistic and interpretable.

BibTeX

@article{backtranslation2_2026,
  title     = {BackTranslation2.0: A Linguistically Motivated Metric
               to Assess Sign Language Production},
  author    = {Cory, Oliver and Ivashechkin, Maksym and Ranum, Oline and
               Low, Jianhe and Fish, Edward and Pelykh, Anton and
               Sahin, Karahan and Mercanoglu Sincan, Ozge and
               Bowden, Richard},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026},
}

BackTranslation2.0 A Linguistically Motivated Metric to AssessSign Language Production

Abstract

Contributions

Method

Scoring Dimensions

BibTeX

BackTranslation2.0
A Linguistically Motivated Metric to Assess
Sign Language Production