PolicyBench

Claude Sonnet 4.6 - PolicyBench v1.0 Results

53 of 147 (36.1%) policies fully verified, 78.2% fixture-level pass rate.

TierCost-tier LLM Model versionclaude-sonnet-4-6 Evaluated2026-05-06 Evaluator1.0 OPA1.16.1 Corpus147 entries

Headline

Self-reported VALID Compile OK Fixture PASS (entry-level) Fixture pass rate (fixture-level) Self-report agreement
84.4% 81.6% 36.1% 78.2% 0.517

By category

Category Entries Compile OK Fixture PASS (entry) Fixture pass rate
application_authz 3 100.0% 66.7% 83.3%
iac_scanning 65 96.9% 33.8% 76.8%
kubernetes_admission 79 68.4% 36.7% 79.9%

Performance

Per-call latency from the runner's recorded latency_ms.

Calls timed p50 mean p95 max Total
147 5.38 s 11.78 s 48.95 s 63.36 s 28.9 min

Quality flag distribution

No quality flags raised.

Notable disagreements

Entries where the runner's self-reported status disagrees with the harness verdict. These are the most informative entries for understanding model blind spots.

Entry Harness verdict Disagreement
sp_k8s_cis_5_2_11 COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_cis_5_2_14 COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_gk_k8spsphostnetworkingports_port-range-with-host-network-allowed COMPILE_ERROR claimed VALID but Rego does not compile
sp_k8s_gk_k8spspseccompv2_seccomp-restricted COMPILE_ERROR claimed VALID but Rego does not compile

Badge

Embed this on your site to show your PolicyBench score:

PolicyBench score badge for Claude Sonnet 4.6

HTML

<a href="https://policybench.dev/models/claude-sonnet-4-6.html">
  <img src="https://policybench.dev/badges/claude-sonnet-4-6.svg" alt="PolicyBench: 36.1%" />
</a>

Markdown

[![PolicyBench: 36.1%](https://policybench.dev/badges/claude-sonnet-4-6.svg)](https://policybench.dev/models/claude-sonnet-4-6.html)

Direct link: https://policybench.dev/badges/claude-sonnet-4-6.svg. The badge is regenerated whenever the underlying score changes.

Other tools