GPT-5 mini - PolicyBench v1.0 Results

37 of 147 (25.2%) policies fully verified, 75.9% fixture-level pass rate.

TierCost-tier LLM Model versiongpt-5-mini Evaluated2026-05-06 Evaluator1.0 OPA1.16.1 Corpus147 entries

Headline

Self-reported VALID	Compile OK	Fixture PASS (entry-level)	Fixture pass rate (fixture-level)	Self-report agreement
100.0%	58.5%	25.2%	75.9%	0.252

By category

Category	Entries	Compile OK	Fixture PASS (entry)	Fixture pass rate
application_authz	3	100.0%	66.7%	83.3%
iac_scanning	65	56.9%	20.0%	74.2%
kubernetes_admission	79	58.2%	27.8%	77.0%

Performance

Per-call latency from the runner's recorded latency_ms.

Calls timed	p50	mean	p95	max	Total
147	19.21 s	21.01 s	47.70 s	60.46 s	51.5 min

Quality flag distribution

No quality flags raised.

Notable disagreements

Entries where the runner's self-reported status disagrees with the harness verdict. These are the most informative entries for understanding model blind spots.

Entry	Harness verdict	Disagreement
`sp_iac_01_rds_backup`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_02_s3_public`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_116`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_118`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_129`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_13`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_16`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_17`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_20`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_21`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_211`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_247`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_248`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_252`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_317`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_318`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_324`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_325`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_326`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_327`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_35`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_354`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_5`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_50`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_57`	COMPILE_ERROR	claimed VALID but Rego does not compile

…and 36 more.

Badge

Embed this on your site to show your PolicyBench score:

HTML

<a href="https://policybench.dev/models/gpt-5-mini.html">
  <img src="https://policybench.dev/badges/gpt-5-mini.svg" alt="PolicyBench: 25.2%" />
</a>

Markdown

[![PolicyBench: 25.2%](https://policybench.dev/badges/gpt-5-mini.svg)](https://policybench.dev/models/gpt-5-mini.html)

Direct link: https://policybench.dev/badges/gpt-5-mini.svg. The badge is regenerated whenever the underlying score changes.

Source

Harness verdict (JSON) - what PolicyBench's evaluator recorded
Runner result (JSON) - the raw output the runner captured from the model
Runner source - the script that called the model

Other tools

PolicyAsLanguage - 90.5% entry-level pass
Gemini 3 Pro - 57.8% entry-level pass
Gemini 3 Flash - 54.4% entry-level pass
Claude Opus 4.7 - 46.9% entry-level pass
GPT-5 - 36.7% entry-level pass
Claude Sonnet 4.6 - 36.1% entry-level pass