AI Benchmarks
AI benchmark rankings, model scores, and performance data.
Track live AI benchmark rankings, coding scores, math scores, and benchmark results across leading models from OpenAI, Anthropic, Google, Meta, DeepSeek, and more.
24 of 24 models
| # | Model | Org | Intelligence | Coding | Math | MMLU Pro | GPQA | LiveCodeBench | AIME 2025 | MATH 500 | SciCode | IFBench | HLE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) | Anthropic | 64.9 | 62.0 | — | — | 9260.0 | — | — | — | 6020.0 | 6346.9 | 5330.0 |
| 2 | Claude Opus 4.8 (Adaptive Reasoning, Max Effort) | Anthropic | 61.4 | 56.7 | — | — | 9200.0 | — | — | — | 5350.0 | 6224.5 | 4570.0 |
| 3 | GPT-5.5 (xhigh) | OpenAI | 60.2 | 59.1 | — | — | 9350.0 | — | — | — | 5610.0 | 7585.0 | 4430.0 |
| 4 | GPT-5.5 (high) | OpenAI | 58.9 | 58.5 | — | — | 9320.0 | — | — | — | 5590.0 | 7163.3 | 4300.0 |
| 5 | Claude Opus 4.7 (Adaptive Reasoning, Max Effort) | Anthropic | 57.3 | 52.5 | — | — | 9140.0 | — | — | — | 5450.0 | 5863.9 | 3960.0 |
| 6 | Gemini 3.1 Pro Preview | 57.2 | 55.5 | — | — | 9410.0 | — | — | — | 5890.0 | 7714.3 | 4470.0 | |
| 7 | GPT-5.4 (xhigh) | OpenAI | 56.8 | 57.2 | — | — | 9200.0 | — | — | — | 5660.0 | 7394.6 | 4160.0 |
| 8 | GPT-5.5 (medium) | OpenAI | 56.7 | 56.2 | — | — | 9260.0 | — | — | — | 5350.0 | 7095.2 | 4060.0 |
| 9 | Qwen3.7 Max | Alibaba | 56.6 | 50.1 | — | — | 9230.0 | — | — | — | 4880.0 | 8054.4 | 3810.0 |
| 10 | Gemini 3.5 Flash (high) | 55.3 | 45.0 | — | — | 9220.0 | — | — | — | 5310.0 | 7632.7 | 4100.0 | |
| 11 | Gemini 3.5 Flash (medium) | 54.8 | 43.9 | — | — | 9210.0 | — | — | — | 5300.0 | 7455.8 | 3990.0 | |
| 12 | MiniMax-M3 | MiniMax | 54.7 | 43.4 | — | — | 9290.0 | — | — | — | 4540.0 | 8285.7 | 3710.0 |
| 13 | Kimi K2.6 | Kimi | 53.9 | 47.1 | — | — | 9110.0 | — | — | — | 5350.0 | 7598.6 | 3590.0 |
| 14 | MiMo-V2.5-Pro | Xiaomi | 53.8 | 45.5 | — | — | 8660.0 | — | — | — | 5020.0 | 7986.4 | 3380.0 |
| 15 | GPT-5.3 Codex (xhigh) | OpenAI | 53.6 | 53.1 | — | — | 9150.0 | — | — | — | 5320.0 | 7537.4 | 3990.0 |
| 16 | Qwen3.7 Plus | Alibaba | 53.3 | 46.5 | — | — | 9000.0 | — | — | — | 4550.0 | 7795.9 | 3340.0 |
| 17 | Grok 4.3 (high) | xAI | 53.2 | 41.0 | — | — | 9010.0 | — | — | — | 4730.0 | 8129.3 | 3500.0 |
| 18 | Claude Opus 4.6 (Adaptive Reasoning, Max Effort) | Anthropic | 52.9 | 48.1 | — | — | 8960.0 | — | — | — | 5190.0 | 5312.9 | 3670.0 |
| 19 | Muse Spark | Meta | 52.2 | 47.5 | — | — | 8840.0 | — | — | — | 5150.0 | 7591.8 | 3990.0 |
| 20 | Qwen3.6 Max Preview | Alibaba | 51.8 | 44.9 | — | — | 8880.0 | — | — | — | 4690.0 | 7659.9 | 2890.0 |
| 21 | Claude Opus 4.7 (Non-reasoning, High Effort) | Anthropic | 51.8 | 53.1 | — | — | 8850.0 | — | — | — | 5010.0 | 4360.5 | 3120.0 |
| 22 | Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) | Anthropic | 51.7 | 50.9 | — | — | 8750.0 | — | — | — | 4680.0 | 5659.9 | 3000.0 |
| 23 | DeepSeek V4 Pro (Reasoning, Max Effort) | DeepSeek | 51.5 | 47.5 | — | — | 8880.0 | — | — | — | 5000.0 | 7646.3 | 3590.0 |
| 24 | GLM-5.1 (Reasoning) | Z AI | 51.4 | 43.4 | — | — | 8680.0 | — | — | — | 4380.0 | 7625.9 | 2800.0 |