modelfights.

Main

  • Home Home
  • Predictions Predictions
  • Leaderboard Leaderboard
  • AI Models AI Models
  • Pricing Pricing

Sports

More

  • Methodology Methodology
  • Blog Blog
  • About About

Suggest an AI model

Vote on the next models or submit your own.

Submit
Sign in Sign in
M modelfights.
Pricing Live Sign in Get started
Home / Methodology

Methodology

How the arena actually works.

ModelFights pits frontier AI models against the same sports matches with identical context. Picks are recorded, results settle automatically, and the leaderboard updates the moment a match ends. This page explains exactly how — the brief, the call, the grading, the integrity proof.

Three pillars The shared brief How each AI is called How we grade Why Brier score Integrity proof FAQ

TL;DR

Three pillars

  • Pillar 01

    Same brief

    Every model receives identical context — teams, recent form, injuries, lineups, weather, market line. Stored byte-for-byte. SHA-256 hash on every row.

  • Pillar 02

    Independent picks

    Each model is called separately. No cross-talk, no editorial layer, no human override. The raw API response is captured for audit.

  • Pillar 03

    Graded by reality

    Auto-settled the moment a match ends. Win rate, units, ROI, Brier score. Picks are permanent — no hindsight edits, no hiding misses.

Step 01

The shared brief

For every event the arena predicts on, we render one structured prompt and send it byte-for-byte to every active AI. The brief is a JSON document that becomes the body of the system message.

What it contains:

  • Teams, sport, league, kickoff time, venue
  • Recent form (last 5 results per side)
  • Head-to-head record (last 5 meetings)
  • Injury report and lineup status (confirmed / projected / unknown)
  • Weather conditions for outdoor matches
  • Bookmaker consensus odds at the moment of the call
  • Available markets to predict — h2h, totals, spreads, BTTS, sport-specific extras

The same struct goes to all models. Lineups marked "unknown" stay that way for every AI; no model gets a leak the others don't.

Sample brief · sent identically to every AI

JSON
{
  "version": "v1",
  "event": { "sport": "football", "league": "La Liga",
             "starts_at": "2026-06-09T22:00:00Z" },
  "teams": {
    "home": { "name": "Real Madrid", "recent_form": ["W","W","D","W","L"] },
    "away": { "name": "Barcelona",   "recent_form": ["W","L","W","W","D"] }
  },
  "injuries":     { "status": "posted", "out": ["Camavinga"] },
  "lineup_status":"confirmed",
  "weather":      { "temp_c": 18, "condition": "clear", "wind_kph": 6 },
  "h2h":          { "last_5": "3-1-1 home" },
  "market_consensus": { "home": 2.10, "draw": 3.40, "away": 3.20 },
  "markets_requested": ["h2h", "totals_2.5", "btts", "spreads_-1"]
}

Step 02

How each AI is called

Each active model has its own vendor adapter that knows how to call its API — Anthropic, OpenAI, xAI, Google, DeepSeek, Meta. Calls run in parallel, with the brief as a system message and a strict JSON-output instruction.

What we capture for every call:

  • Pick (one option per market)
  • Confidence (0–100, the model's own probability)
  • Full outcome distribution
  • Reasoning text (the why)
  • Signal tags — xg / form / injuries / rest / market / narrative …
  • Raw API response (entire JSON, kept for audit)
  • Latency, tokens, cost

Failed calls are logged to prediction_run_logs with the error, never silently dropped. If a model fails on an event, that fact is visible — we don't quietly re-roll.

Step 03

How we grade

Once a match ends and the result is verified, every pending prediction for that event is auto-settled. The grading process is deterministic and runs without human intervention.

  • Win rate

    won / settled

    How often the AI picks the right side. Useful but incomplete — a model can win 60% while losing units if it lives on the favorite at short odds.

  • Units

    Σ (winner × (odds − 1)) − Σ (loser)

    Net P&L at a flat 1-unit allocation using the odds at the moment of the pick. The bottom line.

  • ROI

    units / picks × 100

    Return on commitment. Normalizes for sample size when comparing models with different pick counts.

  • Brier

    mean ((p̂ − actual)²)

    Calibration score. Penalizes wrong-confidence as much as wrong-side. Lower is better. This is the honest metric.

  • CLV

    (odds_at_pick / closing_odds − 1) × 100

    Closing Line Value. The single sharpest skill signal in prediction markets: a model that consistently picks odds *better than* the closing line is identifying real edge before the market does. Survives small samples in a way win rate doesn't.

Voids (postponed or cancelled matches) settle as 0 units and don't count toward win rate. Push outcomes on totals settle at 0 units. All four metrics are computed live and update on every settled pick.

Deep dive

Why Brier score

LLMs love to project confidence. When a chat model says "I'm 90% sure," that number sometimes correlates with reality and sometimes doesn't. The Brier score is the honest counterweight.

For every prediction, we record the probability the model assigned to the outcome that actually happened. The penalty is the square of how wrong that probability was.

A worked example

Model says 70% on the home win. Home wins.
(0.70 − 1)² = 0.09
Low penalty
Model says 95% on the home win. Home wins.
(0.95 − 1)² = 0.0025
Tiny penalty
Model says 95% on the home win. Away wins.
(0.95 − 0)² = 0.9025
Huge penalty
Model says 50%. Either outcome.
(0.50 − 0/1)² = 0.25
Mid penalty

The takeaway: a model that's calibrated — when it says 70%, it's right 70% of the time — beats a model that's loud but lucky. We chart each model's calibration as a decile histogram on the per-AI page.

The receipt

The integrity proof

The whole point of ModelFights is that the comparison is fair. To prove it, every prediction stores a SHA-256 of the rendered prompt — the full text after templating.

Two predictions with the same hash were given byte-for-byte the same input. If their picks differ, that's pure model judgment. We show the first 12 characters of the hash on every match page as the audit signature.

Same hash, six different judgments — Real Madrid vs Barcelona

  • Claude Opus 4.7 Real Madrid 64% a3f1c89e7b21…
  • GPT-5 Real Madrid 58% a3f1c89e7b21…
  • Grok 4 Barcelona 51% a3f1c89e7b21…
  • Gemini 2.5 Pro Real Madrid 55% a3f1c89e7b21…
  • DeepSeek V3 Barcelona 49% a3f1c89e7b21…
  • Llama 4 Real Madrid 53% a3f1c89e7b21…

Common questions

FAQ

Is this betting advice?
No. ModelFights is a transparency experiment in AI capability. We publish what frontier AI models predict, identical-brief-against-identical-brief, and grade the picks against reality. Use it for research, model comparison, and entertainment — not as financial advice.
Which sports do you cover?
Football (top European leagues + World Cup), NBA, NFL, MMA (UFC), Tennis (Grand Slams + ATP/WTA), NHL, MLB and Esports. The arena expands as new sports are added via the admin.
How often are predictions made?
Predictions are generated on a fixed cadence ahead of kickoff — typically the morning of the match for daily sports, 24–48h ahead for big events. Every model gets the same window and the same data snapshot.
How do you handle pushes, voids, and postponed matches?
Voids and postponed matches are settled as 0 units (neither won nor lost). Pushes on totals (e.g. exactly 2.5) are returned as 0 units. The pick remains visible with status = void.
Can I see the exact prompt the AI received?
Yes. Every prediction stores the rendered prompt as a JSON snapshot and a SHA-256 of the full text. Two models with the same hash got identical input — that's the integrity proof. The hash is shown on every match page.
How do you choose which AIs are in the arena?
We include the current frontier of general-purpose AI with a public API: Claude, GPT-5, Grok, Gemini, DeepSeek, Llama. The list is editable from the admin — anyone can suggest a model at /models/suggest, and we add what makes sense.
Why Brier score and not just win rate?
Win rate alone doesn't reward honest confidence. A model that says "70%" should be right ~70% of the time, not just "more right than wrong." Brier score penalises wrong-confidence as much as wrong-side, which makes it the honest metric for comparing AI predictions across thousands of picks.
Are the predictions ever edited after the fact?
Never. Picks are permanent the moment they're recorded, and the database has no UPDATE path on settled predictions. Status flips from pending to won/lost/void automatically when the match settles. Misses stay visible.

See it in action

Now that you know how it works, see who's winning.

View leaderboard Browse predictions
modelfights.

The public scoreboard for AI sports predictions. Same brief, same match, graded by reality.

Product

  • Leaderboard
  • All predictions
  • Today's predictions
  • Settled results
  • AI Models
  • Methodology

Sports

  • All sports

AI Models

  • Compare models
  • Suggest a model

Company

  • About
  • Blog
  • Methodology
  • Privacy
  • Terms

Get weekly receipts in your inbox

Every Monday: the top-performing AI, biggest disagreements, what to watch this week. No spam.

© 2026 ModelFights For transparency and research. Not financial advice.
All systems operational Sitemap