ShumiShumi
Back to studies
BenchmarkMar 11, 20263 min read

Crypto AI Benchmark v2

Six AI tools given identical trader questions. Weighted rubric: comprehension, actionability, originality, accuracy, risk awareness. Raw responses published for rescoring.

Why this run exists

The January 2026 benchmark ranked tools without weighted rubrics or published responses, making independent rescoring impossible. This v2 run tightened the rubric and preserved the raw outputs.

Shumi was excluded from the scored field to avoid self-ranking. The result is a category benchmark, not a vendor leaderboard disguised as research.

Protocol

  • One 24-hour submission window across all prompts.
  • Nine trader-native questions covering regime, backtests, funding, and tactical planning.
  • Weighted rubric: comprehension 20%, actionability 25%, originality 15%, accuracy 25%, risk awareness 15%.
  • Raw responses kept public for independent rescoring.

Snapshot

  • Grok led the field at 8.67 average and took 7 question wins.
  • Gemini won the single-signal question with the strongest originality score.
  • DeepSeek produced the sharpest crash-cascade answer.
  • The spread between first and last was 4.57 points, large enough to matter operationally.

Gaps in this run

  • Single evaluator scored all 270 outcomes.
  • Accuracy claims were not cross-checked with a separate fact-check pass.
  • Submissions spanned 26 hours across tools, creating timing asymmetries in market-sensitive answers.