ASAASA Standard

Benchmark Methodology

Cross-Module · Priority: P1

How We Measure

The Trust Score is based on 32 automated safety checks (Phase 1) across five modules — Billing (8), Auth (8), Admin (4), Architecture (8), and Foundation (4). Each check produces a deterministic result: PASS, FAIL, UNKNOWN, or NOT APPLICABLE. The same codebase always produces the same score.

Checks are executed locally via the Vibecodiq CLI. No source code leaves your machine. The scanner analyzes file patterns, configuration, database schema, and API route structure — it does not upload code to any server.


What We Publish

When we publish benchmark data (e.g., "73% of AI-built apps fail this check"), the data comes exclusively from anonymized, aggregated, opt-in scans. Here's how:

Opt-In at Scan Time

When you run the CLI scanner, you see:

"Your scan results are private. We may use anonymized, aggregated data to publish industry benchmarks. No individual app, company, or user is ever identified."

You can opt out. Only opt-in scans contribute to benchmark data.

What We Collect (Opt-In Only)

Data Collected? Published?
Check results (PASS/FAIL per check) ✅ Aggregated only
Tech stack (Next.js, Supabase, Stripe) ✅ Stack-level aggregates
Trust Score grade ❌ Never individual
File paths, code snippets ❌ Never
Company name, user email ❌ Never
Repository URL ❌ Never
Builder tool (Lovable, Bolt, etc.) ❌ Never per-tool

What We Never Collect

  • Source code, file contents, or code snippets
  • Personal identifiable information (PII)
  • Company names or project names
  • API keys, secrets, or credentials
  • Anything that could identify a specific application

Minimum Sample Sizes

We do not publish benchmark claims until sufficient data exists to make them statistically meaningful.

Claim Type Minimum Sample Example
Overall check prevalence n ≥ 200 scans "73% of AI-built apps fail BIL-02"
Per-stack prevalence n ≥ 50 per stack "Next.js + Supabase: 81% fail AUTH-02"
Trend data (quarter-over-quarter) n ≥ 100 per quarter "Improving ↓ (was 85% in Q1)"
"Most common" rankings n ≥ 300 scans "Top 5 most failed checks"

Every published benchmark includes the sample size: "(n=Y)".


Privacy Rules

  1. k-anonymity: No data slice is published where fewer than 10 distinct apps contributed
  2. No single-customer dominance: No single customer accounts for >20% of any published data cell
  3. No per-tool shaming: We never publish "Lovable apps fail X% of checks." Stack-level aggregates only
  4. Bias warning: Every benchmark includes: "Based on N scans of AI-built applications. Self-selection bias may apply."
  5. Anonymization at ingestion: Repository URLs, company names, and user emails are stripped before data enters the benchmark pipeline

What Appears on Check Pages

When benchmark thresholds are met, individual check pages may display:

  • Prevalence bar: "73% of AI-built apps fail this check" (n=Y)
  • Stack comparison: "Next.js + Supabase: 81% fail | Remix + Supabase: 62% fail"
  • Trend arrow: "Improving ↓ (was 85% in Q1 2026)"

These are always linked back to this methodology page.


What the Trust Score Is Not

Trust Score evaluates billing, auth, and admin production-readiness patterns detected at scan time. It is not a guarantee, certification, or full security audit. Results reflect the state of the codebase at the moment of scanning.

The Trust Score:

  • ✅ Measures specific, known safety patterns in three modules
  • ✅ Is deterministic and reproducible
  • ✅ Is based on an open standard (ASA Standard)
  • ❌ Does not replace penetration testing
  • ❌ Does not guarantee security or compliance
  • ❌ Does not evaluate business logic correctness
  • ❌ Does not cover all possible vulnerabilities

Open Standard

The safety checks are based on the ASA Standard — an open architecture standard for AI-generated codebases. The check definitions, their rationale, and remediation guidance are published on this site. The detection methodology (how the scanner identifies issues) is proprietary.

What's open: Check names, descriptions, why they matter, remediation guidance, code examples.

What's proprietary: Detection patterns, scoring formulas, false-positive reduction, combination logic.


Want to contribute to the benchmark? Run Free Scan →