PolicyBench

GPT-5 mini - PolicyBench v1.0 Results

37 of 147 (25.2%) policies fully verified, 75.9% fixture-level pass rate.

TierCost-tier LLM Model versiongpt-5-mini Evaluated2026-05-06 Evaluator1.0 OPA1.16.1 Corpus147 entries

Headline

Self-reported VALID Compile OK Fixture PASS (entry-level) Fixture pass rate (fixture-level) Self-report agreement
100.0% 58.5% 25.2% 75.9% 0.252

By category

Category Entries Compile OK Fixture PASS (entry) Fixture pass rate
application_authz 3 100.0% 66.7% 83.3%
iac_scanning 65 56.9% 20.0% 74.2%
kubernetes_admission 79 58.2% 27.8% 77.0%

Performance

Per-call latency from the runner's recorded latency_ms.

Calls timed p50 mean p95 max Total
147 19.21 s 21.01 s 47.70 s 60.46 s 51.5 min

Quality flag distribution

No quality flags raised.

Notable disagreements

Entries where the runner's self-reported status disagrees with the harness verdict. These are the most informative entries for understanding model blind spots.

Entry Harness verdict Disagreement
sp_iac_01_rds_backup COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_02_s3_public COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_116 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_118 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_129 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_13 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_16 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_17 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_20 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_21 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_211 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_247 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_248 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_252 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_317 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_318 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_324 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_325 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_326 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_327 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_35 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_354 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_5 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_50 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_57 COMPILE_ERROR claimed VALID but Rego does not compile

…and 36 more.

Badge

Embed this on your site to show your PolicyBench score:

PolicyBench score badge for GPT-5 mini

HTML

<a href="https://policybench.dev/models/gpt-5-mini.html">
  <img src="https://policybench.dev/badges/gpt-5-mini.svg" alt="PolicyBench: 25.2%" />
</a>

Markdown

[![PolicyBench: 25.2%](https://policybench.dev/badges/gpt-5-mini.svg)](https://policybench.dev/models/gpt-5-mini.html)

Direct link: https://policybench.dev/badges/gpt-5-mini.svg. The badge is regenerated whenever the underlying score changes.

Other tools