PolicyBench

Gemini 3 Pro - PolicyBench v1.0 Results

85 of 147 (57.8%) policies fully verified, 82.8% fixture-level pass rate.

TierFrontier LLM Model versiongemini-3-pro-preview Evaluated2026-05-05 Evaluator1.0 OPA1.16.1 Corpus147 entries

Headline

Self-reported VALID Compile OK Fixture PASS (entry-level) Fixture pass rate (fixture-level) Self-report agreement
100.0% 100.0% 57.8% 82.8% 0.578

By category

Category Entries Compile OK Fixture PASS (entry) Fixture pass rate
application_authz 3 100.0% 66.7% 83.3%
iac_scanning 65 100.0% 43.1% 77.1%
kubernetes_admission 79 100.0% 69.6% 88.4%

Performance

Per-call latency from the runner's recorded latency_ms.

Calls timed p50 mean p95 max Total
147 18.26 s 20.50 s 43.31 s 72.59 s 50.2 min

Quality flag distribution

No quality flags raised.

Notable disagreements

Entries where the runner's self-reported status disagrees with the harness verdict. These are the most informative entries for understanding model blind spots.

Entry Harness verdict Disagreement
sp_iac_ckv_ckv_aws_248 FIXTURE_FAIL claimed VALID but every fixture fails
sp_k8s_gk_k8sdisallowanonymous_disallow-anonymous FIXTURE_FAIL claimed VALID but every fixture fails

Badge

Embed this on your site to show your PolicyBench score:

PolicyBench score badge for Gemini 3 Pro

HTML

<a href="https://policybench.dev/models/gemini-3-pro.html">
  <img src="https://policybench.dev/badges/gemini-3-pro.svg" alt="PolicyBench: 57.8%" />
</a>

Markdown

[![PolicyBench: 57.8%](https://policybench.dev/badges/gemini-3-pro.svg)](https://policybench.dev/models/gemini-3-pro.html)

Direct link: https://policybench.dev/badges/gemini-3-pro.svg. The badge is regenerated whenever the underlying score changes.

Other tools