PolicyBench
A public benchmark for natural-language policy compilation quality. Measures how reliably tools and frontier LLMs convert plain-English policy statements into OPA Rego that compiles, parses, and behaves correctly against fixtures.
7 entries evaluated on the v1.0 corpus (147 natural-language statements). Entry-level fixture pass rates range 25.2% — 90.5% (median 46.9%). Median per-call latency varies from 1.93 s to 29.89 s.
Leaderboard
All entries are evaluated against the same fixture corpus by the same harness. LLMs are run at Tier 3 (full context: schema + worked examples); purpose-built tools are run via their own API. New entries from any provider are welcome - see Submit your tool for the process.
Compilation tools
| Entry | Compile rate | Entry-level pass | Fixture-level pass | Median latency |
|---|---|---|---|---|
| PolicyAsLanguage | 99.3% | 90.5% | 96.1% | 1.93 s |
Frontier LLMs
| Entry | Compile rate | Entry-level pass | Fixture-level pass | Median latency |
|---|---|---|---|---|
| Gemini 3 Pro | 100.0% | 57.8% | 82.8% | 18.26 s |
| Claude Opus 4.7 | 100.0% | 46.9% | 78.1% | 3.26 s |
| GPT-5 | 83.0% | 36.7% | 76.5% | 29.89 s |
Cost-tier LLMs
| Entry | Compile rate | Entry-level pass | Fixture-level pass | Median latency |
|---|---|---|---|---|
| Gemini 3 Flash | 100.0% | 54.4% | 80.5% | 11.75 s |
| Claude Sonnet 4.6 | 81.6% | 36.1% | 78.2% | 5.38 s |
| GPT-5 mini | 58.5% | 25.2% | 75.9% | 19.21 s |
What PolicyBench measures
Every entry in the corpus is a natural-language policy statement (e.g. "Reject pods that run as root", "S3 buckets must have versioning enabled", "Only admins can delete production resources") paired with hand-crafted fixtures that exercise its intent.
For each tool, the harness compiles the NL into Rego, runs opa parse,
runs opa eval against every fixture, and records both what the tool
self-reported and what the harness independently verified. Divergence between the
two is itself a measurement.
All compile checks and fixture evaluations use the same OPA binary, the same extraction logic, and the same package/rule discovery - so any tool that emits valid Rego with a recognisable decision rule is evaluated on equal footing.
Reproducibility
The repository at github.com/jbeaven/policybench
contains the full evaluator, every runner script, the prompts used for each tier,
and the published verdicts. Anyone can clone the repo, populate .env for
one runner, and reproduce that runner's numbers.
Test fixtures are private in v1.0 (see methodology). The harness itself is open and
reviewers can supply their own fixtures via $PB_FIXTURES_PATH.