Verifiable Natural Language to Linear Temporal Logic Translation: A Benchmark Dataset and Evaluation Suite

William English1, Chase Walker1, Dominic Simon1, Sumit Jha2, Rickard Ewetz1
1University of Florida   2Florida International University
Overview of the NL-to-LTL translation pipeline showing lifting, grounding, translation, and verification stages

The NL-to-LTL translation pipeline evaluated by VLTL-Bench. Our benchmark provides ground truth annotations at each stage—lifting, grounding, translation, and verification—enabling isolated evaluation of every pipeline component.

Abstract

Current benchmarks for natural language (NL) to temporal logic (TL) translation report near-perfect accuracy on logic translation alone but neglect grounding atomic propositions to new environments—a step essential for actual verification. We introduce VLTL-Bench, a comprehensive benchmark measuring both the verification and verifiability of automated translation systems.

VLTL-Bench provides a dataset consisting of four unique state spaces and thousands of diverse natural language specifications with corresponding formal temporal logic expressions. Crucially, the benchmark includes intermediate evaluation points with ground truths for lifting, grounding, translation, and verification substeps, as well as sample traces for validating temporal logic expressions.

Our empirical evaluation reveals that state-of-the-art frameworks that achieve near-perfect lifted translation accuracy suffer dramatic performance drops when grounding is required, exposing a critical gap between reported benchmark results and real-world deployment capabilities. These findings demonstrate that comprehensive evaluation requires domain-general systems capable of handling extensible grounding scenarios, not just translating logical structure.

Key Contributions

Unified Benchmark

First benchmark supporting evaluation of all NL-to-TL pipeline components with ground truth annotations at every stage: lifting, grounding, translation, and verification.

Verification Evaluation

Novel verification evaluation methodology using satisfying and violating execution traces, enabling assessment of whether translated formulas are actually usable for runtime verification.

Exposing Failure Modes

Empirical study revealing critical failure modes in current methods—particularly in grounding tasks—where reported near-perfect accuracy drops dramatically in end-to-end evaluation.

Evaluation Pipeline

VLTL-Bench evaluates four complementary stages of the NL-to-LTL translation pipeline.

1 Lifting
2 Translation
3 Grounding
4 Verification
Dataset Lifting Translation Grounding Verification
Cleanup World
GLTL
Navi
Conformal
VLTL-Bench (Ours)

Dataset Overview

32K+
Specifications
4
State Spaces
25K
Unique LTL Formulas
4
Evaluation Stages
VLTL-Bench data synthesis pipeline showing template-based generation across four scenarios

Data synthesis pipeline: expert-crafted templates are instantiated with scenario-specific atomic propositions to generate diverse NL-LTL pairs with full annotation.

🍳

Kitchen Assistant

10,000
376 unique words · 7,385 unique LTL
Ambiguous action/argument references
🚦

Traffic Light

7,319
217 unique words · 6,196 unique LTL
Large argument library, few actions
🛩

Search & Rescue

7,304
245 unique words · 5,425 unique LTL
Multi-step temporal dependencies
📦

Warehouse

7,457
1,028 unique words · 5,991 unique LTL
High semantic variability (80 COCO objects)

Benchmark Results

Comprehensive benchmark comparison results across all evaluation stages

Comprehensive results across all evaluation stages. Systems achieving near-perfect lifted translation accuracy exhibit dramatic performance drops on grounded end-to-end evaluation.

Key Findings

Finding Details
Lifted translation is “solved” NL2TL achieves 99.9–100% accuracy on lifted translation when given ground-truth atomic propositions, masking critical downstream failures.
Grounding is the bottleneck Few-shot grounding accuracy ranges from only 56.9–82.3% per atomic proposition, even with scenario-informed prompting. This is the primary failure mode.
End-to-end collapse NL2TL drops to 54.4–60.1% and Lang2LTL to 37.9–72.1% on end-to-end grounded translation, far below real-world requirements.
Verification gap Systems that perform well on lifted translation see 20–30 percentage point declines in verification accuracy when using grounded formulas.

BibTeX

@article{english2025vltlbench,
  title={Verifiable Natural Language to Linear Temporal Logic Translation: A Benchmark Dataset and Evaluation Suite},
  author={English, William and Walker, Chase and Simon, Dominic and Jha, Sumit and Ewetz, Rickard},
  journal={arXiv preprint arXiv:2507.00877},
  year={2025}
}