Benchmark Methodology

Cross-Module · Priority: P1

How We Measure

The Trust Score and AI Chaos Index are based on 34 automated checks (Phase 1) across five modules — Billing (8), Auth (8), Admin (4), Architecture (10), and Foundation (4). Each check produces a deterministic result: PASS, FAIL, UNKNOWN, or NOT APPLICABLE. The same codebase always produces the same score.

The Trust Score (A–F, 0–100) measures safety across billing, auth, admin, and foundation. The AI Chaos Index (0–100) measures structural risk across architecture checks. Two separate scores, two separate reports.

Checks are executed locally via the Vibecodiq CLI. No source code leaves your machine. The scanner analyzes file patterns, configuration, database schema, and API route structure — it does not upload code to any server.

What We Publish

When we publish benchmark data (e.g., "73% of AI-built apps fail this check"), the data comes exclusively from anonymized, aggregated, opt-in scans. Here's how:

Opt-In at Scan Time

When you run the CLI scanner, you see:

"Your scan results are private. We may use anonymized, aggregated data to publish industry benchmarks. No individual app, company, or user is ever identified."

You can opt out. Only opt-in scans contribute to benchmark data.

What We Collect (Opt-In Only)

Data	Collected?	Published?
Check results (PASS/FAIL per check)	✅	✅ Aggregated only
Tech stack (Next.js, Supabase, Stripe)	✅	✅ Stack-level aggregates
Trust Score grade	✅	❌ Never individual
File paths, code snippets	❌ Never	❌
Company name, user email	❌ Never	❌
Repository URL	❌ Never	❌
Builder tool (Lovable, Bolt, etc.)	✅	❌ Never per-tool

What We Never Collect

Source code, file contents, or code snippets
Personal identifiable information (PII)
Company names or project names
API keys, secrets, or credentials
Anything that could identify a specific application

Minimum Sample Sizes

We do not publish benchmark claims until sufficient data exists to make them statistically meaningful.

Claim Type	Minimum Sample	Example
Overall check prevalence	n ≥ 200 scans	"73% of AI-built apps fail BIL-02"
Per-stack prevalence	n ≥ 50 per stack	"Next.js + Supabase: 81% fail AUTH-02"
Trend data (quarter-over-quarter)	n ≥ 100 per quarter	"Improving ↓ (was 85% in Q1)"
"Most common" rankings	n ≥ 300 scans	"Top 5 most failed checks"

Every published benchmark includes the sample size: "(n=Y)".

Privacy Rules

k-anonymity: No data slice is published where fewer than 10 distinct apps contributed
No single-customer dominance: No single customer accounts for >20% of any published data cell
No per-tool shaming: We never publish "Lovable apps fail X% of checks." Stack-level aggregates only
Bias warning: Every benchmark includes: "Based on N scans of AI-built applications. Self-selection bias may apply."
Anonymization at ingestion: Repository URLs, company names, and user emails are stripped before data enters the benchmark pipeline

What Appears on Check Pages

When benchmark thresholds are met, individual check pages may display:

Prevalence bar: "73% of AI-built apps fail this check" (n=Y)
Stack comparison: "Next.js + Supabase: 81% fail | Remix + Supabase: 62% fail"
Trend arrow: "Improving ↓ (was 85% in Q1 2026)"

These are always linked back to this methodology page.

What These Scores Are Not

Trust Score and AI Chaos Index evaluate production-readiness patterns detected at scan time. They are not a guarantee, certification, or full security audit. Results reflect the state of the codebase at the moment of scanning.

These scores:

✅ Measure specific, known safety and architecture patterns
✅ Is deterministic and reproducible
✅ Is based on an open standard (ASA Standard)
❌ Does not replace penetration testing
❌ Does not guarantee security or compliance
❌ Does not evaluate business logic correctness
❌ Does not cover all possible vulnerabilities

Open Standard

The safety checks are based on the ASA Standard — an open architecture standard for AI-generated codebases. The check definitions, their rationale, and remediation guidance are published on this site. The detection methodology (how the scanner identifies issues) is proprietary.

What's open: Check names, descriptions, why they matter, remediation guidance, code examples.

What's proprietary: Detection patterns, scoring formulas, false-positive reduction, combination logic.

Want to contribute to the benchmark? Run Free Scan →