What benchmarks are used in the LLM leaderboard?

The leaderboard uses multiple benchmarks including MMLU (Massive Multitask Language Understanding), GPQA (Graduate-Level Physics Q&A), MMMU (Massive Multimodal Understanding), SWE-Bench (Software Engineering Benchmark), Terminal-Bench, TAU-Bench (Thinking and Understanding Benchmark), Multilingual capabilities, and AIME2025 (AI for Mathematical Olympiad).

How is the overall score calculated?

The overall score is a weighted average of all benchmarks, normalized to a 0-100 scale. It provides a single, at-a-glance measure of a model's overall capability across different domains.

Which LLM are included in the leaderboard?

The leaderboard includes top-tier models from major AI companies including OpenAI (ChatGPT series), Anthropic (Claude series), Google (Gemini series), Meta (Llama series), Alibaba (Qwen series), xAI (Grok series), Mistral AI, and DeepSeek.

LLM Leaderboard

ChatGPT 5.4

OpenAI's frontier reasoning model combining thinking capabilities with advanced coding from GPT-5.3-Codex.

Top Tier Reasoning

Overall Score:

98.8

MMLU:

91.2

MMMU:

84.0

GPQA:

92.8

Coding:

77.5

TAU-Bench:

84.0

Multilingual:

92.0

AIME 2025:

96.0

Gemini 3.1 Pro

Google DeepMind's most advanced model with breakthrough reasoning, multimodal capabilities and 1M token context window.

Top Tier Reasoning

Overall Score:

97.9

MMLU:

91.4

MMMU:

88.5

GPQA:

94.3

Coding:

74.5

TAU-Bench:

85.0

Multilingual:

92.6

AIME 2025:

91.2

Claude 4.6 Opus

Anthropic's most powerful model with exceptional reasoning and creative capabilities.

Top Tier Reasoning

Overall Score:

94.9

MMLU:

90.8

MMMU:

80.9

GPQA:

91.3

Coding:

73.1

TAU-Bench:

82.4

Multilingual:

91.1

AIME 2025:

87.0

Gemini 3 Pro

Google's flagship model with exceptional multimodal capabilities and massive context window.

Top Tier Reasoning

Overall Score:

94.0

MMLU:

93.4

MMMU:

87.6

GPQA:

91.9

Coding:

65.2

TAU-Bench:

80.0

Multilingual:

91.8

AIME 2025:

95.0

ChatGPT 5.2

OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving.

Top Tier Reasoning

Overall Score:

93.4

MMLU:

84.6

MMMU:

80.4

GPQA:

92.4

Coding:

68.8

TAU-Bench:

82.0

Multilingual:

91.0

AIME 2025:

94.0

Claude 4.5 Opus

Anthropic's most powerful model with exceptional reasoning and creative capabilities.

Top Tier Reasoning

Overall Score:

93.0

MMLU:

90.9

MMMU:

80.7

GPQA:

87.0

Coding:

70.1

TAU-Bench:

82.7

Multilingual:

90.0

AIME 2025:

87.0

ChatGPT 5.1

OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving.

Great for Creative Tasks

Overall Score:

89.7

MMLU:

84.6

MMMU:

80.4

GPQA:

85.7

Coding:

62.0

TAU-Bench:

81.1

Multilingual:

91.0

AIME 2025:

94.0

Claude 4.5 Sonnet

Anthropic's most powerful model with exceptional reasoning and creative capabilities.

Top Tier Reasoning

Overall Score:

89.3

MMLU:

89.1

MMMU:

77.8

GPQA:

83.4

Coding:

63.6

TAU-Bench:

81.4

Multilingual:

90.0

AIME 2025:

87.0

ChatGPT 5

OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving.

Top Tier Reasoning

Overall Score:

87.6

MMLU:

84.6

MMMU:

84.2

GPQA:

85.7

Coding:

56.5

TAU-Bench:

81.1

Multilingual:

88.8

AIME 2025:

92.6

Gemini 2.5 Pro

Google's flagship model with exceptional multimodal capabilities and massive context window.

Top Tier Reasoning

Overall Score:

86.4

MMLU:

89.8

MMMU:

84.0

GPQA:

88.4

Coding:

52.3

TAU-Bench:

80.0

Multilingual:

89.0

AIME 2025:

89.0

Claude 4.1 Opus

Anthropic's most powerful model with exceptional reasoning and creative capabilities.

Great for Creative Tasks

Overall Score:

85.6

MMLU:

88.8

MMMU:

77.1

GPQA:

79.6

Coding:

58.9

TAU-Bench:

82.4

Multilingual:

89.5

AIME 2025:

78.0

Claude 4 Opus

Anthropic's most powerful model with exceptional reasoning and creative capabilities.

Great for Creative Tasks

Overall Score:

83.8

MMLU:

88.8

MMMU:

76.5

GPQA:

79.6

Coding:

55.9

TAU-Bench:

81.4

Multilingual:

88.8

AIME 2025:

75.5

ChatGPT o3

OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving.

Overall Score:

82.8

MMLU:

85.6

MMMU:

82.9

GPQA:

83.3

Coding:

49.6

TAU-Bench:

70.4

Multilingual:

88.8

AIME 2025:

88.9

ChatGPT o1

OpenAI's reasoning model optimized for complex problem-solving, mathematics, and coding tasks.

Overall Score:

80.7

MMLU:

89.3

MMMU:

78.2

GPQA:

78.0

Coding:

50.5

TAU-Bench:

73.5

Multilingual:

82.1

AIME 2025:

79.3

Claude 4 Sonnet

Anthropic's balanced model offering excellent performance across all domains.

Great for Creative Tasks

Overall Score:

79.2

MMLU:

74.4

MMMU:

74.4

GPQA:

75.4

Coding:

54.1

TAU-Bench:

80.5

Multilingual:

86.5

AIME 2025:

70.5

Qwen3 480B

Open Source

Alibaba's most powerful Qwen3 model with state-of-the-art performance across all benchmarks.

Overall Score:

79.0

MMLU:

82.3

MMMU:

82.4

GPQA:

78.3

Coding:

47.1

TAU-Bench:

70.9

Multilingual:

80.8

AIME 2025:

83.6

Claude 3.7 Sonnet

Anthropic's most powerful model with exceptional reasoning and creative capabilities.

Great for Creative Tasks

Overall Score:

78.2

MMLU:

88.8

MMMU:

75.0

GPQA:

68.0

Coding:

52.8

TAU-Bench:

81.2

Multilingual:

83.2

AIME 2025:

61.3

Gemini 2.5 Flash

Google's optimized model balancing speed and performance for efficient deployment.

Top Tier Reasoning

Overall Score:

77.7

MMLU:

88.4

MMMU:

79.7

GPQA:

82.8

Coding:

42.6

TAU-Bench:

72.3

Multilingual:

87.2

AIME 2025:

72.0

Mistral Large 3

Mistral AI's most advanced model with superior multilingual and coding performance.

Great for Creative Tasks

Overall Score:

75.8

MMLU:

81.3

MMMU:

73.8

GPQA:

80.2

Coding:

40.9

TAU-Bench:

78.9

Multilingual:

86.4

AIME 2025:

72.6

ChatGPT 4o

OpenAI's omni-modal model with native audio, vision, and text capabilities.

Great for Creative Tasks

Overall Score:

75.6

MMLU:

88.7

MMMU:

69.1

GPQA:

53.6

Coding:

46.7

TAU-Bench:

78.0

Multilingual:

90.1

AIME 2025:

76.6

ChatGPT 4.1

OpenAI's enhanced multimodal model with improved reasoning and efficiency.

Overall Score:

72.7

MMLU:

74.8

MMMU:

71.8

GPQA:

66.3

Coding:

42.5

TAU-Bench:

68.0

Multilingual:

83.7

AIME 2025:

79.5

Claude 3.5 Sonnet

Anthropic's most capable model, excelling at coding, writing, and complex reasoning tasks.

Overall Score:

69.1

MMLU:

88.7

MMMU:

68.3

GPQA:

59.4

Coding:

54.6

TAU-Bench:

71.5

Multilingual:

79.2

AIME 2025:

16.0

DeepSeek-V3

Open Source

DeepSeek's advanced model with strong coding and reasoning capabilities.

Overall Score:

68.7

MMLU:

84.1

MMMU:

65.2

GPQA:

70.9

Coding:

48.6

TAU-Bench:

70.0

Multilingual:

71.8

AIME 2025:

32.0

Gemini 1.5 Pro

Google's advanced model with 2M token context window and strong multimodal capabilities.

Great for Creative Tasks

Overall Score:

66.4

MMLU:

85.9

MMMU:

62.2

GPQA:

63.9

Coding:

36.5

TAU-Bench:

75.7

Multilingual:

88.0

AIME 2025:

40.0

Qwen2.5 72B

Open Source

Alibaba's flagship open-source model with exceptional multilingual and coding capabilities.

Overall Score:

64.5

MMLU:

72.3

MMMU:

75.2

GPQA:

49.8

Coding:

40.8

TAU-Bench:

72.4

Multilingual:

85.0

AIME 2025:

37.0

Llama 3.1 70B

Open Source

Meta's efficient large model offering strong performance with lower computational requirements.

Overall Score:

64.1

MMLU:

79.6

MMMU:

68.9

GPQA:

46.7

Coding:

34.5

TAU-Bench:

73.8

Multilingual:

60.3

AIME 2025:

70.2

DeepSeek-V3

Open Source

DeepSeek's advanced model with strong coding and reasoning capabilities.

Overall Score:

57.0

MMLU:

81.7

MMMU:

65.2

GPQA:

43.9

Coding:

29.4

TAU-Bench:

70.0

Multilingual:

71.8

AIME 2025:

32.0

Grok 4

xAI's most advanced model with real-time information access and enhanced reasoning.

Top Tier Reasoning

Overall Score:

52.4

MMLU:

87.6

MMMU:

72.1

GPQA:

87.5

Coding:

-

TAU-Bench:

-

Multilingual:

83.2

AIME 2025:

74.5

Grok 3

xAI's improved model with enhanced conversational abilities and real-time data access.

Overall Score:

52.3

MMLU:

85.7

MMMU:

76.0

GPQA:

80.2

Coding:

-

TAU-Bench:

-

Multilingual:

80.1

AIME 2025:

83.0

Claude 3 Haiku

Anthropic's fastest model, optimized for speed while maintaining strong capabilities.

Overall Score:

48.3

MMLU:

75.2

MMMU:

46.4

GPQA:

33.3

Coding:

26.0

TAU-Bench:

61.8

Multilingual:

65.4

AIME 2025:

23.0

DeepSeek R1

Open Source

DeepSeek's advanced model with strong coding and reasoning capabilities.

Overall Score:

39.5

MMLU:

-

MMMU:

76.0

GPQA:

71.5

Coding:

25.2

TAU-Bench:

-

Multilingual:

-

AIME 2025:

79.8

Llama 4 405B

Open Source

Meta's next-generation open-source model with state-of-the-art capabilities.

Overall Score:

39.5

MMLU:

85.5

MMMU:

73.4

GPQA:

69.8

Coding:

-

TAU-Bench:

-

Multilingual:

84.6

AIME 2025:

-

LLM Leaderboard

Comparing top-tier general purpose models on key reasoning, language benchmarks, and open source leaderboard.

Top Performers

1st Place

ChatGPT 5.4

2nd Place

Gemini 3.1 Pro

3rd Place

Claude 4.6 Opus

Model
ChatGPT 5.4 OpenAI's frontier reasoning model combining thinking capabilities with advanced coding from GPT-5.3-Codex. Top Tier Reasoning	98.8	91.2	84.0	92.8	77.5	84.0	92.0	96.0
Gemini 3.1 Pro Google DeepMind's most advanced model with breakthrough reasoning, multimodal capabilities and 1M token context window. Top Tier Reasoning	97.9	91.4	88.5	94.3	74.5	85.0	92.6	91.2
Claude 4.6 Opus Anthropic's most powerful model with exceptional reasoning and creative capabilities. Top Tier Reasoning	94.9	90.8	80.9	91.3	73.1	82.4	91.1	87.0
Gemini 3 Pro Google's flagship model with exceptional multimodal capabilities and massive context window. Top Tier Reasoning	94.0	93.4	87.6	91.9	65.2	80.0	91.8	95.0
ChatGPT 5.2 OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving. Top Tier Reasoning	93.4	84.6	80.4	92.4	68.8	82.0	91.0	94.0
Claude 4.5 Opus Anthropic's most powerful model with exceptional reasoning and creative capabilities. Top Tier Reasoning	93.0	90.9	80.7	87.0	70.1	82.7	90.0	87.0
ChatGPT 5.1 OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving. Great for Creative Tasks	89.7	84.6	80.4	85.7	62.0	81.1	91.0	94.0
Claude 4.5 Sonnet Anthropic's most powerful model with exceptional reasoning and creative capabilities. Top Tier Reasoning	89.3	89.1	77.8	83.4	63.6	81.4	90.0	87.0
ChatGPT 5 OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving. Top Tier Reasoning	87.6	84.6	84.2	85.7	56.5	81.1	88.8	92.6
Gemini 2.5 Pro Google's flagship model with exceptional multimodal capabilities and massive context window. Top Tier Reasoning	86.4	89.8	84.0	88.4	52.3	80.0	89.0	89.0
Claude 4.1 Opus Anthropic's most powerful model with exceptional reasoning and creative capabilities. Great for Creative Tasks	85.6	88.8	77.1	79.6	58.9	82.4	89.5	78.0
Claude 4 Opus Anthropic's most powerful model with exceptional reasoning and creative capabilities. Great for Creative Tasks	83.8	88.8	76.5	79.6	55.9	81.4	88.8	75.5
ChatGPT o3 OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving.	82.8	85.6	82.9	83.3	49.6	70.4	88.8	88.9
ChatGPT o1 OpenAI's reasoning model optimized for complex problem-solving, mathematics, and coding tasks.	80.7	89.3	78.2	78.0	50.5	73.5	82.1	79.3
Claude 4 Sonnet Anthropic's balanced model offering excellent performance across all domains. Great for Creative Tasks	79.2	74.4	74.4	75.4	54.1	80.5	86.5	70.5
Qwen3 480B Open Source Alibaba's most powerful Qwen3 model with state-of-the-art performance across all benchmarks.	79.0	82.3	82.4	78.3	47.1	70.9	80.8	83.6
Claude 3.7 Sonnet Anthropic's most powerful model with exceptional reasoning and creative capabilities. Great for Creative Tasks	78.2	88.8	75.0	68.0	52.8	81.2	83.2	61.3
Gemini 2.5 Flash Google's optimized model balancing speed and performance for efficient deployment. Top Tier Reasoning	77.7	88.4	79.7	82.8	42.6	72.3	87.2	72.0
Mistral Large 3 Mistral AI's most advanced model with superior multilingual and coding performance. Great for Creative Tasks	75.8	81.3	73.8	80.2	40.9	78.9	86.4	72.6
ChatGPT 4o OpenAI's omni-modal model with native audio, vision, and text capabilities. Great for Creative Tasks	75.6	88.7	69.1	53.6	46.7	78.0	90.1	76.6
ChatGPT 4.1 OpenAI's enhanced multimodal model with improved reasoning and efficiency.	72.7	74.8	71.8	66.3	42.5	68.0	83.7	79.5
Claude 3.5 Sonnet Anthropic's most capable model, excelling at coding, writing, and complex reasoning tasks.	69.1	88.7	68.3	59.4	54.6	71.5	79.2	16.0
DeepSeek-V3 Open Source DeepSeek's advanced model with strong coding and reasoning capabilities.	68.7	84.1	65.2	70.9	48.6	70.0	71.8	32.0
Gemini 1.5 Pro Google's advanced model with 2M token context window and strong multimodal capabilities. Great for Creative Tasks	66.4	85.9	62.2	63.9	36.5	75.7	88.0	40.0
Qwen2.5 72B Open Source Alibaba's flagship open-source model with exceptional multilingual and coding capabilities.	64.5	72.3	75.2	49.8	40.8	72.4	85.0	37.0
Llama 3.1 70B Open Source Meta's efficient large model offering strong performance with lower computational requirements.	64.1	79.6	68.9	46.7	34.5	73.8	60.3	70.2
DeepSeek-V3 Open Source DeepSeek's advanced model with strong coding and reasoning capabilities.	57.0	81.7	65.2	43.9	29.4	70.0	71.8	32.0
Grok 4 xAI's most advanced model with real-time information access and enhanced reasoning. Top Tier Reasoning	52.4	87.6	72.1	87.5	-	-	83.2	74.5
Grok 3 xAI's improved model with enhanced conversational abilities and real-time data access.	52.3	85.7	76.0	80.2	-	-	80.1	83.0
Claude 3 Haiku Anthropic's fastest model, optimized for speed while maintaining strong capabilities.	48.3	75.2	46.4	33.3	26.0	61.8	65.4	23.0
DeepSeek R1 Open Source DeepSeek's advanced model with strong coding and reasoning capabilities.	39.5	-	76.0	71.5	25.2	-	-	79.8
Llama 4 405B Open Source Meta's next-generation open-source model with state-of-the-art capabilities.	39.5	85.5	73.4	69.8	-	-	84.6	-