AI Benchmarks

AI benchmark rankings, model scores, and performance data.

Track live AI benchmark rankings, coding scores, math scores, and benchmark results across leading models from OpenAI, Anthropic, Google, Meta, DeepSeek, and more.

24 of 24 models

#	Model	Org	Intelligence ↓	Coding	Math	MMLU Pro	GPQA	LiveCodeBench	AIME 2025	MATH 500	SciCode	IFBench	HLE
1	Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)	Anthropic	64.9	62.0	—	—	9260.0	—	—	—	6020.0	6346.9	5330.0
2	Claude Opus 4.8 (Adaptive Reasoning, Max Effort)	Anthropic	61.4	56.7	—	—	9200.0	—	—	—	5350.0	6224.5	4570.0
3	GPT-5.5 (xhigh)	OpenAI	60.2	59.1	—	—	9350.0	—	—	—	5610.0	7585.0	4430.0
4	GPT-5.5 (high)	OpenAI	58.9	58.5	—	—	9320.0	—	—	—	5590.0	7163.3	4300.0
5	Claude Opus 4.7 (Adaptive Reasoning, Max Effort)	Anthropic	57.3	52.5	—	—	9140.0	—	—	—	5450.0	5863.9	3960.0
6	Gemini 3.1 Pro Preview	Google	57.2	55.5	—	—	9410.0	—	—	—	5890.0	7714.3	4470.0
7	GPT-5.4 (xhigh)	OpenAI	56.8	57.2	—	—	9200.0	—	—	—	5660.0	7394.6	4160.0
8	GPT-5.5 (medium)	OpenAI	56.7	56.2	—	—	9260.0	—	—	—	5350.0	7095.2	4060.0
9	Qwen3.7 Max	Alibaba	56.6	50.1	—	—	9230.0	—	—	—	4880.0	8054.4	3810.0
10	Gemini 3.5 Flash (high)	Google	55.3	45.0	—	—	9220.0	—	—	—	5310.0	7632.7	4100.0
11	Gemini 3.5 Flash (medium)	Google	54.8	43.9	—	—	9210.0	—	—	—	5300.0	7455.8	3990.0
12	MiniMax-M3	MiniMax	54.7	43.4	—	—	9290.0	—	—	—	4540.0	8285.7	3710.0
13	Kimi K2.6	Kimi	53.9	47.1	—	—	9110.0	—	—	—	5350.0	7598.6	3590.0
14	MiMo-V2.5-Pro	Xiaomi	53.8	45.5	—	—	8660.0	—	—	—	5020.0	7986.4	3380.0
15	GPT-5.3 Codex (xhigh)	OpenAI	53.6	53.1	—	—	9150.0	—	—	—	5320.0	7537.4	3990.0
16	Qwen3.7 Plus	Alibaba	53.3	46.5	—	—	9000.0	—	—	—	4550.0	7795.9	3340.0
17	Grok 4.3 (high)	xAI	53.2	41.0	—	—	9010.0	—	—	—	4730.0	8129.3	3500.0
18	Claude Opus 4.6 (Adaptive Reasoning, Max Effort)	Anthropic	52.9	48.1	—	—	8960.0	—	—	—	5190.0	5312.9	3670.0
19	Muse Spark	Meta	52.2	47.5	—	—	8840.0	—	—	—	5150.0	7591.8	3990.0
20	Qwen3.6 Max Preview	Alibaba	51.8	44.9	—	—	8880.0	—	—	—	4690.0	7659.9	2890.0
21	Claude Opus 4.7 (Non-reasoning, High Effort)	Anthropic	51.8	53.1	—	—	8850.0	—	—	—	5010.0	4360.5	3120.0
22	Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)	Anthropic	51.7	50.9	—	—	8750.0	—	—	—	4680.0	5659.9	3000.0
23	DeepSeek V4 Pro (Reasoning, Max Effort)	DeepSeek	51.5	47.5	—	—	8880.0	—	—	—	5000.0	7646.3	3590.0
24	GLM-5.1 (Reasoning)	Z AI	51.4	43.4	—	—	8680.0	—	—	—	4380.0	7625.9	2800.0