PolicyAsLanguage - PolicyBench v1.0 Results
133 of 147 (90.5%) policies fully verified, 96.1% fixture-level pass rate.
Headline
| Self-reported VALID | Compile OK | Fixture PASS (entry-level) | Fixture pass rate (fixture-level) | Self-report agreement |
|---|---|---|---|---|
| 99.3% | 99.3% | 90.5% | 96.1% | 0.912 |
By category
| Category | Entries | Compile OK | Fixture PASS (entry) | Fixture pass rate |
|---|---|---|---|---|
| application_authz | 3 | 100.0% | 100.0% | 100.0% |
| iac_scanning | 65 | 100.0% | 90.8% | 96.1% |
| kubernetes_admission | 79 | 98.7% | 89.9% | 96.0% |
Performance
Per-call latency from the runner's recorded latency_ms.
| Calls timed | p50 | mean | p95 | max | Total |
|---|---|---|---|---|---|
| 146 | 1.93 s | 1.98 s | 2.98 s | 4.75 s | 4.8 min |
Quality flag distribution
No quality flags raised.
Notable disagreements
Entries where the runner's self-reported status disagrees with the harness verdict. These are the most informative entries for understanding model blind spots.
No notable self-report / harness disagreements.
Tool homepage: https://policyaslanguage.com
Badge
Embed this on your site to show your PolicyBench score:
HTML
<a href="https://policybench.dev/models/policyaslanguage.html">
<img src="https://policybench.dev/badges/policyaslanguage.svg" alt="PolicyBench: 90.5%" />
</a>
Markdown
[](https://policybench.dev/models/policyaslanguage.html)
Source
- Harness verdict (JSON) - what PolicyBench's evaluator recorded
- Runner result (JSON) - the raw output the runner captured from the model
- Runner source - the script that called the model
Other tools
- Gemini 3 Pro - 57.8% entry-level pass
- Gemini 3 Flash - 54.4% entry-level pass
- Claude Opus 4.7 - 46.9% entry-level pass
- GPT-5 - 36.7% entry-level pass
- Claude Sonnet 4.6 - 36.1% entry-level pass
- GPT-5 mini - 25.2% entry-level pass