§ Evaluability surface [ schema 2026-05-14.3 ]

What this audits — and what it doesn’t.

Per-call security and safety audit for LLM applications, sold via MPP-priced HTTP endpoint. Anchored on NIST AI RMF + AI 600-1, with OWASP LLM Top 10 and MITRE ATLAS as engineer cross-references. Every number on this page is published in the machine-readable companion at /eval.json.

§ 01 [ NIST AI RMF coverage matrix ]

One row per applicable NIST AI RMF subcategory, three columns for the tier ladder. ● covered, ◐ partial, · not in tier. Subcategories not covered at any tier are listed below as permanently out of scopewith an honest reason — we do not claim coverage we do not have.

Subcategory	surface	active	compliance	Description
MAP-2.1	●	●	●	Tasks, methods, and side-effect characterization of the AI system.
MAP-2.2	●	●	●	Knowledge limits, scope-of-competence, refusal-on-uncertainty, and instruction-precedence in the system prompt.
MAP-3.3	●	●	●	Targeted application scope vs. tool capability blast radius.
MAP-3.4	●	●	●	Operator proficiency requirements vs. supporting affordances; autonomy interrupt/budget gates.
MAP-4.1	·	◐	●	Component and legal-risk surface, including dependencies. Compliance tier adds code-side dependency and source findings.
MEASURE-1.1	●	●	●	Evidence that measurement approaches are appropriate.
MEASURE-2.5	·	◐	●	AI system is valid and reliable for the intended task.
MEASURE-2.6	·	·	●	AI system is safe — harm avoidance and dangerous-output gating.
MEASURE-2.7	·	●	●	AI system is secure and resilient — adversarial probes. Active tier adds budget-capped adversarial runs against declared endpoints. Surface uses static analysis only.
MEASURE-2.8	·	◐	●	Transparency and accountability characterization.
MEASURE-2.9	·	●	●	Model explanation and output interpretation evidence.
MEASURE-2.10	·	●	●	Privacy risk — PII leakage, training-data extraction.
MEASURE-2.11	·	·	●	Fairness and harmful-bias evaluation.

Permanently out of scope

GOVERNOrganizational policy and accountability — not testable per-call.
MANAGERisk response and remediation processes — not testable per-call.
MAP-1Context establishment beyond the technical surface.
MAP-5Impact assessment beyond the technical surface.
MEASURE-2.12Environmental impact and sustainability — not testable per-call.
MEASURE-2.13TEVV effectiveness — process-level over time, not per-call.

§ 02 [ AI 600-1 risk-category mapping ]

The risk taxonomy from the NIST GenAI Profile (AI 600-1, Jul 2024). Each risk is the union of cross_references.nist_ai_600_1 tags emitted by rubric files at the relevant tier.

AI 600-1 risk	surface	active	compliance
Confabulation	●	●	●
Dangerous, Violent, or Hateful Content	·	●	●
Data Privacy	·	●	●
Environmental Impact	·	·	·
Harmful Bias and Homogenization	·	·	●
Human-AI Configuration	●	●	●
Information Integrity	●	●	●
Information Security	●	●	●
Intellectual Property	·	·	●
Obscene, Degrading, and/or Abusive Content	·	●	●
Toxicity, Bias, and Homogenization	·	·	●
Value Chain / Component Integration	●	●	●

§ 03 [ Pricing per call ]

surfacelive

$25.00/ call

Static categorization + system-prompt analysis. No live calls.

Categorize the system, run rubric-driven static analysis on the system prompt + tool list, return a NIST-tagged findings report. No outbound traffic to your endpoints.

activelimited rollout

$250.00/ call

Attacker-agent runs against your declared endpoint, N=50.

Surface tier + budget-capped attacker agent against your declared endpoint with reproductions. Available in limited rollout while the runner hardens.

complianceplanned

$800.00/ call

Active + code-scan (SAST) + compliance evidence pack.

Active tier + MEASURE-2.6/2.11 probes + code-scan findings + an evidence pack suitable for SOC 2 / customer questionnaires.

§ 04 [ 402 challenge — example payload ]

The Surface-tier endpoint at https://buildpilled.io/agent-audit returns 402 with the inlined challenge below. The WWW-Authenticate header carries the same content for clients that follow the wire protocol; the inline body.challenge mirror exists because Google’s Front End strips that header on Cloud Run egress. Either source produces a valid SPT mint.

HTTP/1.1 402 Payment Required
content-type: application/problem+json
www-authenticate: Payment realm="buildpilled.io", method="stripe", intent="charge", id="<challenge-id>", request="<base64-encoded-PaymentRequest>"

{
  "type": "https://paymentauth.org/problems/payment-required",
  "title": "Payment Required",
  "status": 402,
  "detail": "Payment is required (BuildPilled agent-audit · Surface tier).",
  "challengeId": "<challenge-id>",
  "challenge": {
    "id": "<challenge-id>",
    "method": "stripe",
    "intent": "charge",
    "realm": "buildpilled.io",
    "request": {
      "amount": "2500",
      "currency": "usd",
      "methodDetails": {
        "networkId": "buildpilled",
        "paymentMethodTypes": [
          "card"
        ]
      }
    },
    "description": "BuildPilled agent-audit · Surface tier",
    "expires": "<iso8601-expiry>"
  }
}

§ 05 [ Sandbox verification ]

Probing-before-paying is a first-class agent affordance. The Surface endpoint accepts sandbox Shared Payment Tokens end-to-end; receipts identify whether the call settled on the sandbox or live payment rail. No signup wall, no waitlist gate — agents can call the endpoint, read /eval.json, and decide.

Sandbox SPTs mint with a sandbox card payment method; the same flow is covered by automated verification before deployment.

§ 06 [ Latency ]

surface

—

measurement pending

active

—

not yet measured

compliance

—

not yet measured

Surface-tier latency-measurement instrumentation in flight; numbers will land here once we have a representative window. Manual single-call timing is currently sub-second.

§ 07 [ Eval baselines ]

Three real public LLM apps audited at pinned commits. Each report is summarized here and in the machine-readable companion, including finding counts, maximum severity, source commit, license, and aggregate coverage.

autonomous-loop, high-risk
Anthropic computer-use
findings
5
max severity
MEDIUM
commit
4b2549e8
license
MIT
report summarized in /eval.json
developer-facing, medium-risk
Aider editblock coder
findings
9
max severity
HIGH
commit
3ec8ec5a
license
Apache-2.0
report summarized in /eval.json
file-management, medium-risk
MCP filesystem (modelcontextprotocol/servers)
findings
12
max severity
MEDIUM
commit
4503e2d1
license
MIT
report summarized in /eval.json

§ 08 [ Machine-readable companion ]

Everything on this page, plus the canonical 402 response shape and the runtime endpoint URLs, available as JSON for crawlers and agents:

GET /eval.json →