PolicyBench Methodology v1.0

Status: skeleton - sections will be filled in as v1.0 is built.

1. Purpose and scope

PolicyBench measures the quality of natural-language → OPA Rego compilation across LLMs and policy tools. It does not measure runtime enforcement, multi-policy composition, or compilation to non-Rego targets.

2. The corpus

The natural-language corpus lives in data/policybench-nl-v1.0.md. Categories: kubernetes_admission, iac_scanning, application_authz.

The v1.0 corpus draws from four public upstream rule sources - Checkov, Kyverno, Gatekeeper-library, and the CIS Kubernetes Benchmark - plus a small set of corpus-native starter specs. NL inputs are used verbatim from upstream where possible; PolicyBench does not rewrite descriptions to make them more compilable.

For the full provenance methodology - mining sources, NL-handling principles, fixture-authoring conventions, the per-policy-shape coverage matrix, and the categorisation rules - see docs/corpus-provenance.md.

3. The three tiers

Tier 1 - Bare NL (prompts/tier1-bare-nl.md): minimal context. Baseline LLM capability.
Tier 2 - With Context (prompts/tier2-with-context.md): NL plus category and runtime target. The realistic developer comparison.
Tier 3 - Full Context (prompts/tier3-full-context.md): NL plus schema and worked examples. Best-effort baseline.

Each prompt's hash (first 8 hex chars of SHA-256) is recorded in every result file so prompt versions can never be conflated.

4. The runners

[List of participating tools/models, version pins - TBD.]

Each runner is a self-contained directory under runners/. It reads the corpus, applies one or more tier prompts, calls the model/service, and writes result-<tier>-<prompthash>.json files. See runners/README.md.

5. The evaluator

A single shared script: evaluator/evaluate.py. It reads runner result files and produces verdict files:

Rego extraction (evaluator/lib/extract_rego.py) - handles fenced blocks, package-prefixed blocks, and ambiguous extraction (flagged).
Compile check (evaluator/lib/compile_check.py) - opa parse --v0-compatible subprocess. Pinned to OPA 1.16.x for v1.0; --v0-compatible accepts both Rego v0 and v1 syntax so models trained on pre-2025 examples are not penalised purely for the v0→v1 syntax migration.
Fixture evaluation (evaluator/lib/fixture_runner.py) - opa eval against per-entry fixtures. Inputs are wrapped per-category to match the conventions the Rego is written against: kubernetes_admission fixtures are wrapped as {"review": {"object": <entity>}} (Gatekeeper convention); iac_scanning and application_authz fixtures pass through unchanged.
Quality flags (evaluator/lib/quality_flags.py) - advisory signals, not pass/fail.

6. Verdict state machine

harness_verdict is one of: RETURNED_NOTHING, RETURNED_NON_REGO, COMPILE_ERROR, COMPILE_OK_NO_FIXTURES, FIXTURE_PASS, FIXTURE_PARTIAL, FIXTURE_FAIL, EVALUATION_ERROR.

[Full state machine with examples - TBD.]

7. Reproducibility

See README.md.

8. Limitations

Single-shot per (model, tier, entry). No retry. A user in the wild won't retry the NL until the Rego looks most correct, so PolicyBench measures first-shot quality.
Fixtures private in v1.0. Methodologically weaker than full transparency; see Disclosure boundary below for rationale.
Tier coverage. v1.0 publishes Tier 3 (full context: schema + worked examples) results only. Tier 1 (bare NL) and Tier 2 (with-context, no examples) prompts and runners are in the repo and ready, but not run for v1.0. They will be exercised in v1.1 to surface prompting sensitivity as a separate measurement.
Model choice - frontier only. v1.0 evaluates frontier-tier LLMs (claude-opus-4-7, gpt-5, gemini-3-pro) - the "ceiling" comparison. This answers "what can the best LLMs do?" but not "what should I actually deploy?" - most production systems would use a cost-tier model (Sonnet, GPT-5 mini, Gemini Flash) at 10-30× lower cost. A cost-tier comparison axis is queued as a v1.0.x follow-up. The thinking-on-vs-thinking-off ablation is queued separately.
Model nondeterminism. Documented per runner. Temperature pinned to 0 where the API supports it.
Prompt sensitivity. Tier 3 is one operating point on a band; Tier 1 / Tier 2 will surface sensitivity in v1.1.
Rego version. OPA 1.x defaults to Rego v1 (requires if keyword, contains for partial set rules). PolicyBench evaluates with --v0-compatible so v0-style outputs are not failed on syntax migration alone. A future legacy_syntax quality flag may surface entries that only compile under v0 mode.

9. Versioning

Top-level VERSION pins corpus, prompts, evaluator, and fixtures versions together.

Corpus updates → may invalidate verdicts for changed entries.
Prompt updates → invalidate result files (different hash).
Evaluator updates → invalidate verdicts only; result files remain valid.
Fixture updates → invalidate verdicts only; result files remain valid.

10. Disclosure boundary

Disclosed (in repo): corpus, prompts, runner code, evaluator code, verdicts, reports.

Not disclosed in v1.0: test fixtures. Kept private to enable a paid evaluation service. Reviewers can supply their own fixtures via PB_FIXTURES_PATH.

PolicyBench evaluates outputs, not implementations. Every entry on the leaderboard is treated as a black box: the harness sees only the Rego the entry produced and the metadata fields the runner self-reports. Internal architecture, model routing, intermediate representations, retry logic, and pipeline composition are out of scope and are not stored in any public artefact. Tools that integrate via a thin runner adapter - calling their hosted compilation API and producing a result file - are welcome additions, with no expectation that they expose anything beyond their Rego output.

11. Change log

v1.0 (early-May 2026) - initial release.

View source on GitHub