PolicyBench

PolicyBench Methodology v1.0

Status: skeleton - sections will be filled in as v1.0 is built.

1. Purpose and scope

PolicyBench measures the quality of natural-language → OPA Rego compilation across LLMs and policy tools. It does not measure runtime enforcement, multi-policy composition, or compilation to non-Rego targets.

2. The corpus

The natural-language corpus lives in data/policybench-nl-v1.0.md. Categories: kubernetes_admission, iac_scanning, application_authz.

The v1.0 corpus draws from four public upstream rule sources - Checkov, Kyverno, Gatekeeper-library, and the CIS Kubernetes Benchmark - plus a small set of corpus-native starter specs. NL inputs are used verbatim from upstream where possible; PolicyBench does not rewrite descriptions to make them more compilable.

For the full provenance methodology - mining sources, NL-handling principles, fixture-authoring conventions, the per-policy-shape coverage matrix, and the categorisation rules - see docs/corpus-provenance.md.

3. The three tiers

Each prompt's hash (first 8 hex chars of SHA-256) is recorded in every result file so prompt versions can never be conflated.

4. The runners

[List of participating tools/models, version pins - TBD.]

Each runner is a self-contained directory under runners/. It reads the corpus, applies one or more tier prompts, calls the model/service, and writes result-<tier>-<prompthash>.json files. See runners/README.md.

5. The evaluator

A single shared script: evaluator/evaluate.py. It reads runner result files and produces verdict files:

6. Verdict state machine

harness_verdict is one of: RETURNED_NOTHING, RETURNED_NON_REGO, COMPILE_ERROR, COMPILE_OK_NO_FIXTURES, FIXTURE_PASS, FIXTURE_PARTIAL, FIXTURE_FAIL, EVALUATION_ERROR.

[Full state machine with examples - TBD.]

7. Reproducibility

See README.md.

8. Limitations

9. Versioning

Top-level VERSION pins corpus, prompts, evaluator, and fixtures versions together.

10. Disclosure boundary

Disclosed (in repo): corpus, prompts, runner code, evaluator code, verdicts, reports.

Not disclosed in v1.0: test fixtures. Kept private to enable a paid evaluation service. Reviewers can supply their own fixtures via PB_FIXTURES_PATH.

PolicyBench evaluates outputs, not implementations. Every entry on the leaderboard is treated as a black box: the harness sees only the Rego the entry produced and the metadata fields the runner self-reports. Internal architecture, model routing, intermediate representations, retry logic, and pipeline composition are out of scope and are not stored in any public artefact. Tools that integrate via a thin runner adapter - calling their hosted compilation API and producing a result file - are welcome additions, with no expectation that they expose anything beyond their Rego output.

11. Change log