PolicyBench

A public benchmark for natural-language policy compilation quality. Measures how reliably tools and frontier LLMs convert plain-English policy statements into OPA Rego that compiles, parses, and behaves correctly against fixtures.

7 entries evaluated on the v1.0 corpus (147 natural-language statements). Entry-level fixture pass rates range 25.2% — 90.5% (median 46.9%). Median per-call latency varies from 1.93 s to 29.89 s.

Leaderboard

All entries are evaluated against the same fixture corpus by the same harness. LLMs are run at Tier 3 (full context: schema + worked examples); purpose-built tools are run via their own API. New entries from any provider are welcome - see Submit your tool for the process.

Compilation tools

Purpose-built natural-language → policy compilers.

Entry	Compile rate	Entry-level pass	Fixture-level pass	Median latency
PolicyAsLanguage	99.3%	90.5%	96.1%	1.93 s

Frontier LLMs

The strongest models each provider currently ships.

Entry	Compile rate	Entry-level pass	Fixture-level pass	Median latency
Gemini 3 Pro	100.0%	57.8%	82.8%	18.26 s
Claude Opus 4.7	100.0%	46.9%	78.1%	3.26 s
GPT-5	83.0%	36.7%	76.5%	29.89 s

Cost-tier LLMs

Smaller / cheaper variants of the frontier models, same Tier 3 prompt.

Entry	Compile rate	Entry-level pass	Fixture-level pass	Median latency
Gemini 3 Flash	100.0%	54.4%	80.5%	11.75 s
Claude Sonnet 4.6	81.6%	36.1%	78.2%	5.38 s
GPT-5 mini	58.5%	25.2%	75.9%	19.21 s

See methodology for definitions of each metric and the source repository for raw verdict data.

What PolicyBench measures

Every entry in the corpus is a natural-language policy statement (e.g. "Reject pods that run as root", "S3 buckets must have versioning enabled", "Only admins can delete production resources") paired with hand-crafted fixtures that exercise its intent.

For each tool, the harness compiles the NL into Rego, runs opa parse, runs opa eval against every fixture, and records both what the tool self-reported and what the harness independently verified. Divergence between the two is itself a measurement.

All compile checks and fixture evaluations use the same OPA binary, the same extraction logic, and the same package/rule discovery - so any tool that emits valid Rego with a recognisable decision rule is evaluated on equal footing.

Reproducibility

The repository at github.com/jbeaven/policybench contains the full evaluator, every runner script, the prompts used for each tier, and the published verdicts. Anyone can clone the repo, populate .env for one runner, and reproduce that runner's numbers.

Test fixtures are private in v1.0 (see methodology). The harness itself is open and reviewers can supply their own fixtures via $PB_FIXTURES_PATH.