Verifiable Natural Language to Linear Temporal Logic Translation: A Benchmark Dataset and Evaluation Suite

English, William; Walker, Chase; Simon, Dominic; Jha, Sumit; Ewetz, Rickard

Verifiable Natural Language to Linear Temporal Logic Translation: A Benchmark Dataset and Evaluation Suite

William English¹, Chase Walker¹, Dominic Simon¹, Sumit Jha², Rickard Ewetz¹

¹University of Florida ²Florida International University

arXiv Paper Dataset Coming Soon Code Coming Soon

Overview of the NL-to-LTL translation pipeline showing lifting, grounding, translation, and verification stages

The NL-to-LTL translation pipeline evaluated by VLTL-Bench. Our benchmark provides ground truth annotations at each stage—lifting, grounding, translation, and verification—enabling isolated evaluation of every pipeline component.

Abstract

Current benchmarks for natural language (NL) to temporal logic (TL) translation report near-perfect accuracy on logic translation alone but neglect grounding atomic propositions to new environments—a step essential for actual verification. We introduce VLTL-Bench, a comprehensive benchmark measuring both the verification and verifiability of automated translation systems.

VLTL-Bench provides a dataset consisting of four unique state spaces and thousands of diverse natural language specifications with corresponding formal temporal logic expressions. Crucially, the benchmark includes intermediate evaluation points with ground truths for lifting, grounding, translation, and verification substeps, as well as sample traces for validating temporal logic expressions.

Our empirical evaluation reveals that state-of-the-art frameworks that achieve near-perfect lifted translation accuracy suffer dramatic performance drops when grounding is required, exposing a critical gap between reported benchmark results and real-world deployment capabilities. These findings demonstrate that comprehensive evaluation requires domain-general systems capable of handling extensible grounding scenarios, not just translating logical structure.

Key Contributions

Unified Benchmark

First benchmark supporting evaluation of all NL-to-TL pipeline components with ground truth annotations at every stage: lifting, grounding, translation, and verification.

Verification Evaluation

Novel verification evaluation methodology using satisfying and violating execution traces, enabling assessment of whether translated formulas are actually usable for runtime verification.

Exposing Failure Modes

Empirical study revealing critical failure modes in current methods—particularly in grounding tasks—where reported near-perfect accuracy drops dramatically in end-to-end evaluation.

Evaluation Pipeline

VLTL-Bench evaluates four complementary stages of the NL-to-LTL translation pipeline.

1 Lifting

2 Translation

3 Grounding

4 Verification

Dataset	Lifting	Translation	Grounding	Verification
Cleanup World
GLTL
Navi
Conformal
VLTL-Bench (Ours)

Dataset Overview

32K+

Specifications

4

State Spaces

25K

Unique LTL Formulas

4

Evaluation Stages

VLTL-Bench data synthesis pipeline showing template-based generation across four scenarios

Data synthesis pipeline: expert-crafted templates are instantiated with scenario-specific atomic propositions to generate diverse NL-LTL pairs with full annotation.

🍳

Kitchen Assistant

10,000

376 unique words · 7,385 unique LTL
Ambiguous action/argument references

🚦

Traffic Light

7,319

217 unique words · 6,196 unique LTL
Large argument library, few actions

🛩

Search & Rescue

7,304

245 unique words · 5,425 unique LTL
Multi-step temporal dependencies

📦

Warehouse

7,457

1,028 unique words · 5,991 unique LTL
High semantic variability (80 COCO objects)

Benchmark Results

Comprehensive benchmark comparison results across all evaluation stages

Comprehensive results across all evaluation stages. Systems achieving near-perfect lifted translation accuracy exhibit dramatic performance drops on grounded end-to-end evaluation.

Key Findings

Finding	Details
Lifted translation is “solved”	NL2TL achieves 99.9–100% accuracy on lifted translation when given ground-truth atomic propositions, masking critical downstream failures.
Grounding is the bottleneck	Few-shot grounding accuracy ranges from only 56.9–82.3% per atomic proposition, even with scenario-informed prompting. This is the primary failure mode.
End-to-end collapse	NL2TL drops to 54.4–60.1% and Lang2LTL to 37.9–72.1% on end-to-end grounded translation, far below real-world requirements.
Verification gap	Systems that perform well on lifted translation see 20–30 percentage point declines in verification accuracy when using grounded formulas.

BibTeX

@article{english2025vltlbench,
  title={Verifiable Natural Language to Linear Temporal Logic Translation: A Benchmark Dataset and Evaluation Suite},
  author={English, William and Walker, Chase and Simon, Dominic and Jha, Sumit and Ewetz, Rickard},
  journal={arXiv preprint arXiv:2507.00877},
  year={2025}
}

Related Works

Lang2LTL: Translating Natural Language Commands to Temporal Robot Task Specification

NL2TL: Transforming Natural Languages to Temporal Logics using LLMs

Conformal Temporal Logic Planning using LLMs

Verifiable Natural Language to Linear Temporal Logic Translation: A Benchmark Dataset and Evaluation Suite

The NL-to-LTL translation pipeline evaluated by VLTL-Bench. Our benchmark provides ground truth annotations at each stage—lifting, grounding, translation, and verification—enabling isolated evaluation of every pipeline component.

Abstract

Key Contributions

Unified Benchmark

Verification Evaluation

Exposing Failure Modes

Evaluation Pipeline

Dataset Overview

Kitchen Assistant

Traffic Light

Search & Rescue

Warehouse

Benchmark Results

Key Findings

BibTeX