Why this run exists
The January 2026 benchmark ranked tools without weighted rubrics or published responses, making independent rescoring impossible. This v2 run tightened the rubric and preserved the raw outputs.
Shumi was excluded from the scored field to avoid self-ranking. The result is a category benchmark, not a vendor leaderboard disguised as research.
Protocol
- One 24-hour submission window across all prompts.
- Nine trader-native questions covering regime, backtests, funding, and tactical planning.
- Weighted rubric: comprehension 20%, actionability 25%, originality 15%, accuracy 25%, risk awareness 15%.
- Raw responses kept public for independent rescoring.
Snapshot
- Grok led the field at 8.67 average and took 7 question wins.
- Gemini won the single-signal question with the strongest originality score.
- DeepSeek produced the sharpest crash-cascade answer.
- The spread between first and last was 4.57 points, large enough to matter operationally.
Gaps in this run
- Single evaluator scored all 270 outcomes.
- Accuracy claims were not cross-checked with a separate fact-check pass.
- Submissions spanned 26 hours across tools, creating timing asymmetries in market-sensitive answers.