GPT-5 - PolicyBench v1.0 Results

54 of 147 (36.7%) policies fully verified, 76.5% fixture-level pass rate.

TierFrontier LLM Model versiongpt-5 Evaluated2026-05-06 Evaluator1.0 OPA1.16.1 Corpus147 entries

Headline

Self-reported VALID	Compile OK	Fixture PASS (entry-level)	Fixture pass rate (fixture-level)	Self-report agreement
100.0%	83.0%	36.7%	76.5%	0.367

By category

Category	Entries	Compile OK	Fixture PASS (entry)	Fixture pass rate
application_authz	3	66.7%	33.3%	75.0%
iac_scanning	65	86.2%	24.6%	74.6%
kubernetes_admission	79	81.0%	46.8%	78.6%

Performance

Per-call latency from the runner's recorded latency_ms.

Calls timed	p50	mean	p95	max	Total
147	29.89 s	33.54 s	66.68 s	148.54 s	82.2 min

Quality flag distribution

No quality flags raised.

Notable disagreements

Entries where the runner's self-reported status disagrees with the harness verdict. These are the most informative entries for understanding model blind spots.

Entry	Harness verdict	Disagreement
`sp_authz_01_admin_delete`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_115`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_117`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_129`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_20`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_23`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_272`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_317`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_318`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_iac_ckv_ckv_aws_84`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_k8s_cis_5_1_3`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_k8s_cis_5_2_8`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_k8s_gk_k8sblockendpointeditdefaultrole_block-endpoint-default-role`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_k8s_gk_k8sblocknodeport_block-nodeport-services`	FIXTURE_FAIL	claimed VALID but every fixture fails
`sp_k8s_gk_k8sdisallowanonymous_disallow-anonymous`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_k8s_gk_k8sdisallowanonymous_disallow-authenticated`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_k8s_gk_k8sexternalips_allowed-ip`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_k8s_gk_k8spspfsgroup_fsgroup`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_k8s_gk_k8spsphostnamespace_host-namespace`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_k8s_gk_k8spsphostnetworkingports_port-range-with-host-network-forbidden`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_k8s_gk_k8spsphostprocess_host-process-disallowed`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_k8s_gk_k8spspseccompv2_seccomp-restricted`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_k8s_gk_noupdateserviceaccount`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_k8s_gk_verifydeprecatedapi_verifydeprecatedapi-1_29`	COMPILE_ERROR	claimed VALID but Rego does not compile
`sp_k8s_kyv_block-updates-deletes`	COMPILE_ERROR	claimed VALID but Rego does not compile

…and 1 more.

Badge

Embed this on your site to show your PolicyBench score:

HTML

<a href="https://policybench.dev/models/gpt-5.html">
  <img src="https://policybench.dev/badges/gpt-5.svg" alt="PolicyBench: 36.7%" />
</a>

Markdown

[![PolicyBench: 36.7%](https://policybench.dev/badges/gpt-5.svg)](https://policybench.dev/models/gpt-5.html)

Direct link: https://policybench.dev/badges/gpt-5.svg. The badge is regenerated whenever the underlying score changes.

Source

Harness verdict (JSON) - what PolicyBench's evaluator recorded
Runner result (JSON) - the raw output the runner captured from the model
Runner source - the script that called the model

Other tools

PolicyAsLanguage - 90.5% entry-level pass
Gemini 3 Pro - 57.8% entry-level pass
Gemini 3 Flash - 54.4% entry-level pass
Claude Opus 4.7 - 46.9% entry-level pass
Claude Sonnet 4.6 - 36.1% entry-level pass
GPT-5 mini - 25.2% entry-level pass