PolicyBench

GPT-5 - PolicyBench v1.0 Results

54 of 147 (36.7%) policies fully verified, 76.5% fixture-level pass rate.

TierFrontier LLM Model versiongpt-5 Evaluated2026-05-06 Evaluator1.0 OPA1.16.1 Corpus147 entries

Headline

Self-reported VALID Compile OK Fixture PASS (entry-level) Fixture pass rate (fixture-level) Self-report agreement
100.0% 83.0% 36.7% 76.5% 0.367

By category

Category Entries Compile OK Fixture PASS (entry) Fixture pass rate
application_authz 3 66.7% 33.3% 75.0%
iac_scanning 65 86.2% 24.6% 74.6%
kubernetes_admission 79 81.0% 46.8% 78.6%

Performance

Per-call latency from the runner's recorded latency_ms.

Calls timed p50 mean p95 max Total
147 29.89 s 33.54 s 66.68 s 148.54 s 82.2 min

Quality flag distribution

No quality flags raised.

Notable disagreements

Entries where the runner's self-reported status disagrees with the harness verdict. These are the most informative entries for understanding model blind spots.

Entry Harness verdict Disagreement
sp_authz_01_admin_delete COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_115 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_117 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_129 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_20 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_23 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_272 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_317 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_318 COMPILE_ERROR claimed VALID but Rego does not compile
sp_iac_ckv_ckv_aws_84 COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_cis_5_1_3 COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_cis_5_2_8 COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_gk_k8sblockendpointeditdefaultrole_block-endpoint-default-role COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_gk_k8sblocknodeport_block-nodeport-services FIXTURE_FAIL claimed VALID but every fixture fails
sp_k8s_gk_k8sdisallowanonymous_disallow-anonymous COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_gk_k8sdisallowanonymous_disallow-authenticated COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_gk_k8sexternalips_allowed-ip COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_gk_k8spspfsgroup_fsgroup COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_gk_k8spsphostnamespace_host-namespace COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_gk_k8spsphostnetworkingports_port-range-with-host-network-forbidden COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_gk_k8spsphostprocess_host-process-disallowed COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_gk_k8spspseccompv2_seccomp-restricted COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_gk_noupdateserviceaccount COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_gk_verifydeprecatedapi_verifydeprecatedapi-1_29 COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_kyv_block-updates-deletes COMPILE_ERROR claimed VALID but Rego does not compile

…and 1 more.

Badge

Embed this on your site to show your PolicyBench score:

PolicyBench score badge for GPT-5

HTML

<a href="https://policybench.dev/models/gpt-5.html">
  <img src="https://policybench.dev/badges/gpt-5.svg" alt="PolicyBench: 36.7%" />
</a>

Markdown

[![PolicyBench: 36.7%](https://policybench.dev/badges/gpt-5.svg)](https://policybench.dev/models/gpt-5.html)

Direct link: https://policybench.dev/badges/gpt-5.svg. The badge is regenerated whenever the underlying score changes.

Other tools