Submit your tool
PolicyBench is open to any tool that compiles natural language into OPA Rego. New entries make the benchmark stronger; submissions from any provider are welcome.
What we evaluate
Anything that takes a natural-language policy statement and emits Rego. Examples include purpose-built compilation tools, hosted services, frontier LLMs, smaller / cost-tier models, and hand-prompted assistant pipelines. There is no requirement that the tool be commercial, open-source, locally-hosted, or hosted at all - only that it produce Rego from NL.
What gets published
- The Rego your tool produced for each natural-language entry in the corpus.
- Aggregate scores: compile rate, entry-level fixture pass, fixture-level pass rate, median latency.
- A per-tool report page on this site, generated automatically from the verdict files.
- An embeddable badge with your tool's score that you can put on your own site.
What is not published
PolicyBench evaluates outputs, not implementations. Your internal architecture, model routing, prompts, intermediate representations, retry logic, and pipeline composition are out of scope and never appear in any public artefact. The harness sees only the Rego your runner submits.
The runner
A runner is a small directory under runners/<your-tool>/
in the source repository. For LLM-style entries,
cloning an existing runner is config-only: copy a sibling directory, edit
.env.example to set the model id, and the run script picks it
up. For purpose-built tools, the runner is a thin adapter that calls your
hosted API and writes a single result file per the canonical schema.
The harness - extraction, compile check, fixture evaluation, scoring - is shared across every entry. There is no per-tool special-casing.
How to submit
- Open an issue at https://github.com/jbeaven/policybench/issues/new describing the tool - name, type (compilation tool / LLM / other), contact, and whether you'd like to open the PR yourself or have us add the runner on your behalf.
- Agree on the runner shape. For most tools this takes one back-and-forth: confirm the API endpoint, auth model, and any quirks around how your tool surfaces its Rego output.
- The runner lands as a PR. We run the full corpus once, verify the result file is well-formed, and merge. Reports and the leaderboard auto-update on the next site rebuild.
Typical turn-around is 1-2 weeks from first contact to leaderboard publication. Re-runs after model updates take a few hours.
Cost and timeline
Your API costs
Hosted-API tools and LLM entries pay for their own corpus runs (147 calls per evaluation, plus any re-runs). API-side costs are typically a few cents to a few dollars per full pass, depending on your tool's pricing.
Inclusion fee
Adding a new entry takes PolicyBench maintainer time: writing or reviewing the runner adapter, running the full corpus through the harness, validating the result file, producing the per-tool report and badge, and adding the entry to the leaderboard. We charge a one-time inclusion fee to cover this work, and a smaller maintenance fee for re-runs after model updates.
Pricing depends on the entry shape (LLM submission, hosted compilation API, on-prem) and is shared on first contact via GitHub issues.
Disclosure and neutrality
PolicyBench is vendor-neutral. The methodology, the corpus, the evaluator, and every runner's source are public. No tool is privileged over another in the leaderboard structure or in any individual metric. See the methodology document for the verdict state machine, the package/rule discovery logic, and the input-shape wrapping conventions that apply uniformly to every entry.
If you spot anything that looks like favouritism - in framing, in scoring, in the corpus, anywhere - please flag it. Neutrality is the project's most important property, and we treat any concrete report of bias as a fix-before-next-publish issue.
Contact
Primary channel: GitHub issues.
Issues marked with the submission label are tracked for
inclusion on the leaderboard.