PolicyBench Methodology v1.0
Status: skeleton - sections will be filled in as v1.0 is built.
1. Purpose and scope
PolicyBench measures the quality of natural-language → OPA Rego compilation across LLMs and policy tools. It does not measure runtime enforcement, multi-policy composition, or compilation to non-Rego targets.
2. The corpus
The natural-language corpus lives in data/policybench-nl-v1.0.md. Categories: kubernetes_admission, iac_scanning, application_authz.
The v1.0 corpus draws from four public upstream rule sources - Checkov, Kyverno, Gatekeeper-library, and the CIS Kubernetes Benchmark - plus a small set of corpus-native starter specs. NL inputs are used verbatim from upstream where possible; PolicyBench does not rewrite descriptions to make them more compilable.
For the full provenance methodology - mining sources, NL-handling principles, fixture-authoring conventions, the per-policy-shape coverage matrix, and the categorisation rules - see docs/corpus-provenance.md.
3. The three tiers
- Tier 1 - Bare NL (
prompts/tier1-bare-nl.md): minimal context. Baseline LLM capability. - Tier 2 - With Context (
prompts/tier2-with-context.md): NL plus category and runtime target. The realistic developer comparison. - Tier 3 - Full Context (
prompts/tier3-full-context.md): NL plus schema and worked examples. Best-effort baseline.
Each prompt's hash (first 8 hex chars of SHA-256) is recorded in every result file so prompt versions can never be conflated.
4. The runners
[List of participating tools/models, version pins - TBD.]
Each runner is a self-contained directory under runners/. It reads the corpus, applies one or more tier prompts, calls the model/service, and writes result-<tier>-<prompthash>.json files. See runners/README.md.
5. The evaluator
A single shared script: evaluator/evaluate.py. It reads runner result files and produces verdict files:
- Rego extraction (
evaluator/lib/extract_rego.py) - handles fenced blocks,package-prefixed blocks, and ambiguous extraction (flagged). - Compile check (
evaluator/lib/compile_check.py) -opa parse --v0-compatiblesubprocess. Pinned to OPA 1.16.x for v1.0;--v0-compatibleaccepts both Rego v0 and v1 syntax so models trained on pre-2025 examples are not penalised purely for the v0→v1 syntax migration. - Fixture evaluation (
evaluator/lib/fixture_runner.py) -opa evalagainst per-entry fixtures. Inputs are wrapped per-category to match the conventions the Rego is written against:kubernetes_admissionfixtures are wrapped as{"review": {"object": <entity>}}(Gatekeeper convention);iac_scanningandapplication_authzfixtures pass through unchanged. - Quality flags (
evaluator/lib/quality_flags.py) - advisory signals, not pass/fail.
6. Verdict state machine
harness_verdict is one of: RETURNED_NOTHING, RETURNED_NON_REGO, COMPILE_ERROR, COMPILE_OK_NO_FIXTURES, FIXTURE_PASS, FIXTURE_PARTIAL, FIXTURE_FAIL, EVALUATION_ERROR.
[Full state machine with examples - TBD.]
7. Reproducibility
See README.md.
8. Limitations
- Single-shot per (model, tier, entry). No retry. A user in the wild won't retry the NL until the Rego looks most correct, so PolicyBench measures first-shot quality.
- Fixtures private in v1.0. Methodologically weaker than full transparency; see
Disclosure boundarybelow for rationale. - Tier coverage. v1.0 publishes Tier 3 (full context: schema + worked examples) results only. Tier 1 (bare NL) and Tier 2 (with-context, no examples) prompts and runners are in the repo and ready, but not run for v1.0. They will be exercised in v1.1 to surface prompting sensitivity as a separate measurement.
- Model choice - frontier only. v1.0 evaluates frontier-tier LLMs (
claude-opus-4-7,gpt-5,gemini-3-pro) - the "ceiling" comparison. This answers "what can the best LLMs do?" but not "what should I actually deploy?" - most production systems would use a cost-tier model (Sonnet, GPT-5 mini, Gemini Flash) at 10-30× lower cost. A cost-tier comparison axis is queued as a v1.0.x follow-up. The thinking-on-vs-thinking-off ablation is queued separately. - Model nondeterminism. Documented per runner. Temperature pinned to 0 where the API supports it.
- Prompt sensitivity. Tier 3 is one operating point on a band; Tier 1 / Tier 2 will surface sensitivity in v1.1.
- Rego version. OPA 1.x defaults to Rego v1 (requires
ifkeyword,containsfor partial set rules). PolicyBench evaluates with--v0-compatibleso v0-style outputs are not failed on syntax migration alone. A futurelegacy_syntaxquality flag may surface entries that only compile under v0 mode.
9. Versioning
Top-level VERSION pins corpus, prompts, evaluator, and fixtures versions together.
- Corpus updates → may invalidate verdicts for changed entries.
- Prompt updates → invalidate result files (different hash).
- Evaluator updates → invalidate verdicts only; result files remain valid.
- Fixture updates → invalidate verdicts only; result files remain valid.
10. Disclosure boundary
Disclosed (in repo): corpus, prompts, runner code, evaluator code, verdicts, reports.
Not disclosed in v1.0: test fixtures. Kept private to enable a paid evaluation service. Reviewers can supply their own fixtures via PB_FIXTURES_PATH.
PolicyBench evaluates outputs, not implementations. Every entry on the leaderboard is treated as a black box: the harness sees only the Rego the entry produced and the metadata fields the runner self-reports. Internal architecture, model routing, intermediate representations, retry logic, and pipeline composition are out of scope and are not stored in any public artefact. Tools that integrate via a thin runner adapter - calling their hosted compilation API and producing a result file - are welcome additions, with no expectation that they expose anything beyond their Rego output.
11. Change log
- v1.0 (early-May 2026) - initial release.