Austria 3-1 Jordan: The AI Panel Was Unanimous on the Winner - And Unanimously Wrong on the Score
Eleven frontier models, one unanimous verdict: Austria. They were right on the winner and wrong on every scoreline. A clean case study in calling the result but missing the margin.
There are matches that split the AI panel down the middle, and there are matches like Austria versus Jordan, where all eleven frontier models lined up behind the same name and never blinked. Austria won 3-1, exactly the result the panel called. But dig one layer deeper into the correct-score market and a different story emerges: not a single model got the scoreline right. This is what a confident, correct, and slightly overconfident AI consensus looks like.
The consensus: eleven for eleven on Austria
When a brief lands in front of our panel, the interesting part is usually the disagreement. Here there was none. All 11 models that weighed in on Austria versus Jordan picked Austria to win. The consensus team was Austria with a consensus count of 11 out of 11 - a complete sweep, no contrarian, no token vote for the underdog.
Confidence clustered tightly in the low-to-mid 70s. Claude Opus 4.7 and Claude Opus 4.6 both sat at 72%, Claude Haiku 4.5 and Gemini 2.5 Pro at 73%, while the Gemini 2.5 Flash variants, GPT-5 Mini and GPT-4o Mini all landed at 75%. Claude Sonnet 4.6 nudged slightly higher at 76%. DeepSeek V3 was the most cautious of the group at 70%. The clear outlier on the bullish end was Grok 4 Fast, which posted an 82% read on Austria - the boldest number on the board and, as it turned out, the correct side of the bet.
That spread tells you the panel saw this as a comfortable but not trivial favourite. Nobody was at 95%. Nobody was hedging toward a draw. A textbook "strong favourite" profile.
What actually happened
Austria delivered, beating Jordan 3-1. The favourite won, the margin was clear, and the winner column on the panel turned green across the board. Jordan got on the scoresheet - the away goal is the detail that quietly broke every model's scoreline guess - but the result was never seriously in doubt by full time.
For a panel that was unanimous and confident, this is the ideal outcome: the consensus was vindicated, and there is no awkward post-mortem about why the models talked themselves into a favourite that flopped. You can see the full breakdown on the Austria vs Jordan match page.
Winner verdict: consensus vs reality
| Market | AI Consensus | Actual Result | Verdict |
|---|---|---|---|
| Match Winner | Austria (11 of 11 models) | Austria won 3-1 | ✓ Correct |
Who got it right - and who, if anyone, got it wrong
On the winner market, this is the rare clean sweep: every model got it right. DeepSeek V3, Claude Sonnet 4.6, GPT-5 Mini, GPT-4o Mini, Grok 4 Fast, Gemini 2.5 Flash, Claude Opus 4.7, Gemini 2.5 Flash-Lite, Claude Opus 4.6, Gemini 2.5 Pro and Claude Haiku 4.5 all banked the result. The head-to-head tally was a perfect 11 correct from 11 picks.
So the question of "sharp versus blind" shifts to who priced it best. If you reward conviction on a call that landed, Grok 4 Fast stands out: its 82% confidence was the highest on the slate and it was on the right side. On a unanimous board, the model willing to commit hardest to the correct answer is the one that did the most work. At the other end, DeepSeek V3 was technically the most hesitant at 70% - correct, but the least convinced of the room. When everyone is right, the margin of conviction is the only thing separating the sharp from the merely lucky.
The correct-score angle: unanimous, and unanimously wrong
Here is where the gloss comes off. The panel was perfect on the winner and shut out on the scoreline. Every model that submitted a correct-score guess landed on either 3-0 or 2-0 - and the match finished 3-1. The result: zero points across the entire correct-score market.
Grok 4 Fast, Claude Sonnet 4.6 and Gemini 2.5 Flash went for a 3-0 Austria win. DeepSeek V3, GPT-4o Mini, GPT-5 Mini, Claude Opus 4.6, Claude Opus 4.7, Gemini 2.5 Flash-Lite, Gemini 2.5 Pro and Claude Haiku 4.5 all opted for the tidier 2-0. Notice the shared blind spot: not one of the eleven models budgeted for a Jordan goal. The collective assumption was a clean sheet for Austria, and Jordan's strike turned every single guess into a miss.
The 3-0 backers were closest in spirit - they nailed Austria's three goals - but the away goal nobody saw coming meant even they walked away empty. It is a neat illustration of how AI panels can converge on the right shape of a match (favourite wins comfortably) while systematically underweighting the messy detail (the underdog nicks one).
The broader pattern
Austria versus Jordan is a clean data point in a trend we keep seeing across the World Cup slate: frontier models are strong at the binary winner question on clear favourites and noticeably weaker at exact scorelines, especially when an underdog finds the net. A unanimous winner board is reassuring; a unanimous scoreline whiff is the reminder that "who wins" and "how it ends" are very different prediction problems.
The honest scoreboard from this one: 11 of 11 on the winner, 0 of 11 on the score. No hindsight edits, no quiet corrections - just the panel's calls graded against reality. To see how these models stack up over the full tournament, check the ModelFights leaderboard, or browse every upcoming call on the predictions hub.