GPT-5 - PolicyBench v1.0 Results
54 of 147 (36.7%) policies fully verified, 76.5% fixture-level pass rate.
Headline
| Self-reported VALID | Compile OK | Fixture PASS (entry-level) | Fixture pass rate (fixture-level) | Self-report agreement |
|---|---|---|---|---|
| 100.0% | 83.0% | 36.7% | 76.5% | 0.367 |
By category
| Category | Entries | Compile OK | Fixture PASS (entry) | Fixture pass rate |
|---|---|---|---|---|
| application_authz | 3 | 66.7% | 33.3% | 75.0% |
| iac_scanning | 65 | 86.2% | 24.6% | 74.6% |
| kubernetes_admission | 79 | 81.0% | 46.8% | 78.6% |
Performance
Per-call latency from the runner's recorded latency_ms.
| Calls timed | p50 | mean | p95 | max | Total |
|---|---|---|---|---|---|
| 147 | 29.89 s | 33.54 s | 66.68 s | 148.54 s | 82.2 min |
Quality flag distribution
No quality flags raised.
Notable disagreements
Entries where the runner's self-reported status disagrees with the harness verdict. These are the most informative entries for understanding model blind spots.
| Entry | Harness verdict | Disagreement |
|---|---|---|
sp_authz_01_admin_delete |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_iac_ckv_ckv_aws_115 |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_iac_ckv_ckv_aws_117 |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_iac_ckv_ckv_aws_129 |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_iac_ckv_ckv_aws_20 |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_iac_ckv_ckv_aws_23 |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_iac_ckv_ckv_aws_272 |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_iac_ckv_ckv_aws_317 |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_iac_ckv_ckv_aws_318 |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_iac_ckv_ckv_aws_84 |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_k8s_cis_5_1_3 |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_k8s_cis_5_2_8 |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_k8s_gk_k8sblockendpointeditdefaultrole_block-endpoint-default-role |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_k8s_gk_k8sblocknodeport_block-nodeport-services |
FIXTURE_FAIL | claimed VALID but every fixture fails |
sp_k8s_gk_k8sdisallowanonymous_disallow-anonymous |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_k8s_gk_k8sdisallowanonymous_disallow-authenticated |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_k8s_gk_k8sexternalips_allowed-ip |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_k8s_gk_k8spspfsgroup_fsgroup |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_k8s_gk_k8spsphostnamespace_host-namespace |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_k8s_gk_k8spsphostnetworkingports_port-range-with-host-network-forbidden |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_k8s_gk_k8spsphostprocess_host-process-disallowed |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_k8s_gk_k8spspseccompv2_seccomp-restricted |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_k8s_gk_noupdateserviceaccount |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_k8s_gk_verifydeprecatedapi_verifydeprecatedapi-1_29 |
COMPILE_ERROR | claimed VALID but Rego does not compile |
sp_k8s_kyv_block-updates-deletes |
COMPILE_ERROR | claimed VALID but Rego does not compile |
…and 1 more.
Badge
Embed this on your site to show your PolicyBench score:
HTML
<a href="https://policybench.dev/models/gpt-5.html">
<img src="https://policybench.dev/badges/gpt-5.svg" alt="PolicyBench: 36.7%" />
</a>
Markdown
[](https://policybench.dev/models/gpt-5.html)
Source
- Harness verdict (JSON) - what PolicyBench's evaluator recorded
- Runner result (JSON) - the raw output the runner captured from the model
- Runner source - the script that called the model
Other tools
- PolicyAsLanguage - 90.5% entry-level pass
- Gemini 3 Pro - 57.8% entry-level pass
- Gemini 3 Flash - 54.4% entry-level pass
- Claude Opus 4.7 - 46.9% entry-level pass
- Claude Sonnet 4.6 - 36.1% entry-level pass
- GPT-5 mini - 25.2% entry-level pass