LLM Leaderboard

    Comparing top-tier general purpose models on key reasoning, language benchmarks, and open source leaderboard.

    Top Performers

    1st Place

    ChatGPT 5.4 Logo

    ChatGPT 5.4

    OpenAI's frontier reasoning model combining thinking capabilities with advanced coding from GPT-5.3-Codex.

    Overall Score

    98.8

    2nd Place

    Gemini 3.1 Pro Logo

    Gemini 3.1 Pro

    Google DeepMind's most advanced model with breakthrough reasoning, multimodal capabilities and 1M token context window.

    Overall Score

    97.9

    3rd Place

    Claude 4.6 Opus Logo

    Claude 4.6 Opus

    Anthropic's most powerful model with exceptional reasoning and creative capabilities.

    Overall Score

    94.9

    OpenAI's frontier reasoning model combining thinking capabilities with advanced coding from GPT-5.3-Codex.

    Top Tier Reasoning
    Overall Score:
    98.8
    MMLU:
    91.2
    MMMU:
    84.0
    GPQA:
    92.8
    Coding:
    77.5
    TAU-Bench:
    84.0
    Multilingual:
    92.0
    AIME 2025:
    96.0

    Google DeepMind's most advanced model with breakthrough reasoning, multimodal capabilities and 1M token context window.

    Top Tier Reasoning
    Overall Score:
    97.9
    MMLU:
    91.4
    MMMU:
    88.5
    GPQA:
    94.3
    Coding:
    74.5
    TAU-Bench:
    85.0
    Multilingual:
    92.6
    AIME 2025:
    91.2

    Anthropic's most powerful model with exceptional reasoning and creative capabilities.

    Top Tier Reasoning
    Overall Score:
    94.9
    MMLU:
    90.8
    MMMU:
    80.9
    GPQA:
    91.3
    Coding:
    73.1
    TAU-Bench:
    82.4
    Multilingual:
    91.1
    AIME 2025:
    87.0

    Google's flagship model with exceptional multimodal capabilities and massive context window.

    Top Tier Reasoning
    Overall Score:
    94.0
    MMLU:
    93.4
    MMMU:
    87.6
    GPQA:
    91.9
    Coding:
    65.2
    TAU-Bench:
    80.0
    Multilingual:
    91.8
    AIME 2025:
    95.0

    OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving.

    Top Tier Reasoning
    Overall Score:
    93.4
    MMLU:
    84.6
    MMMU:
    80.4
    GPQA:
    92.4
    Coding:
    68.8
    TAU-Bench:
    82.0
    Multilingual:
    91.0
    AIME 2025:
    94.0

    Anthropic's most powerful model with exceptional reasoning and creative capabilities.

    Top Tier Reasoning
    Overall Score:
    93.0
    MMLU:
    90.9
    MMMU:
    80.7
    GPQA:
    87.0
    Coding:
    70.1
    TAU-Bench:
    82.7
    Multilingual:
    90.0
    AIME 2025:
    87.0

    OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving.

    Great for Creative Tasks
    Overall Score:
    89.7
    MMLU:
    84.6
    MMMU:
    80.4
    GPQA:
    85.7
    Coding:
    62.0
    TAU-Bench:
    81.1
    Multilingual:
    91.0
    AIME 2025:
    94.0

    Anthropic's most powerful model with exceptional reasoning and creative capabilities.

    Top Tier Reasoning
    Overall Score:
    89.3
    MMLU:
    89.1
    MMMU:
    77.8
    GPQA:
    83.4
    Coding:
    63.6
    TAU-Bench:
    81.4
    Multilingual:
    90.0
    AIME 2025:
    87.0

    OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving.

    Top Tier Reasoning
    Overall Score:
    87.6
    MMLU:
    84.6
    MMMU:
    84.2
    GPQA:
    85.7
    Coding:
    56.5
    TAU-Bench:
    81.1
    Multilingual:
    88.8
    AIME 2025:
    92.6

    Google's flagship model with exceptional multimodal capabilities and massive context window.

    Top Tier Reasoning
    Overall Score:
    86.4
    MMLU:
    89.8
    MMMU:
    84.0
    GPQA:
    88.4
    Coding:
    52.3
    TAU-Bench:
    80.0
    Multilingual:
    89.0
    AIME 2025:
    89.0

    Anthropic's most powerful model with exceptional reasoning and creative capabilities.

    Great for Creative Tasks
    Overall Score:
    85.6
    MMLU:
    88.8
    MMMU:
    77.1
    GPQA:
    79.6
    Coding:
    58.9
    TAU-Bench:
    82.4
    Multilingual:
    89.5
    AIME 2025:
    78.0

    Anthropic's most powerful model with exceptional reasoning and creative capabilities.

    Great for Creative Tasks
    Overall Score:
    83.8
    MMLU:
    88.8
    MMMU:
    76.5
    GPQA:
    79.6
    Coding:
    55.9
    TAU-Bench:
    81.4
    Multilingual:
    88.8
    AIME 2025:
    75.5

    OpenAI's most advanced reasoning model with breakthrough performance in complex problem-solving.

    Overall Score:
    82.8
    MMLU:
    85.6
    MMMU:
    82.9
    GPQA:
    83.3
    Coding:
    49.6
    TAU-Bench:
    70.4
    Multilingual:
    88.8
    AIME 2025:
    88.9

    OpenAI's reasoning model optimized for complex problem-solving, mathematics, and coding tasks.

    Overall Score:
    80.7
    MMLU:
    89.3
    MMMU:
    78.2
    GPQA:
    78.0
    Coding:
    50.5
    TAU-Bench:
    73.5
    Multilingual:
    82.1
    AIME 2025:
    79.3

    Anthropic's balanced model offering excellent performance across all domains.

    Great for Creative Tasks
    Overall Score:
    79.2
    MMLU:
    74.4
    MMMU:
    74.4
    GPQA:
    75.4
    Coding:
    54.1
    TAU-Bench:
    80.5
    Multilingual:
    86.5
    AIME 2025:
    70.5
    Qwen3 480B Logo
    Qwen3 480B
    Open Source

    Alibaba's most powerful Qwen3 model with state-of-the-art performance across all benchmarks.

    Overall Score:
    79.0
    MMLU:
    82.3
    MMMU:
    82.4
    GPQA:
    78.3
    Coding:
    47.1
    TAU-Bench:
    70.9
    Multilingual:
    80.8
    AIME 2025:
    83.6

    Anthropic's most powerful model with exceptional reasoning and creative capabilities.

    Great for Creative Tasks
    Overall Score:
    78.2
    MMLU:
    88.8
    MMMU:
    75.0
    GPQA:
    68.0
    Coding:
    52.8
    TAU-Bench:
    81.2
    Multilingual:
    83.2
    AIME 2025:
    61.3

    Google's optimized model balancing speed and performance for efficient deployment.

    Top Tier Reasoning
    Overall Score:
    77.7
    MMLU:
    88.4
    MMMU:
    79.7
    GPQA:
    82.8
    Coding:
    42.6
    TAU-Bench:
    72.3
    Multilingual:
    87.2
    AIME 2025:
    72.0

    Mistral AI's most advanced model with superior multilingual and coding performance.

    Great for Creative Tasks
    Overall Score:
    75.8
    MMLU:
    81.3
    MMMU:
    73.8
    GPQA:
    80.2
    Coding:
    40.9
    TAU-Bench:
    78.9
    Multilingual:
    86.4
    AIME 2025:
    72.6

    OpenAI's omni-modal model with native audio, vision, and text capabilities.

    Great for Creative Tasks
    Overall Score:
    75.6
    MMLU:
    88.7
    MMMU:
    69.1
    GPQA:
    53.6
    Coding:
    46.7
    TAU-Bench:
    78.0
    Multilingual:
    90.1
    AIME 2025:
    76.6

    OpenAI's enhanced multimodal model with improved reasoning and efficiency.

    Overall Score:
    72.7
    MMLU:
    74.8
    MMMU:
    71.8
    GPQA:
    66.3
    Coding:
    42.5
    TAU-Bench:
    68.0
    Multilingual:
    83.7
    AIME 2025:
    79.5

    Anthropic's most capable model, excelling at coding, writing, and complex reasoning tasks.

    Overall Score:
    69.1
    MMLU:
    88.7
    MMMU:
    68.3
    GPQA:
    59.4
    Coding:
    54.6
    TAU-Bench:
    71.5
    Multilingual:
    79.2
    AIME 2025:
    16.0
    DeepSeek-V3 Logo
    DeepSeek-V3
    Open Source

    DeepSeek's advanced model with strong coding and reasoning capabilities.

    Overall Score:
    68.7
    MMLU:
    84.1
    MMMU:
    65.2
    GPQA:
    70.9
    Coding:
    48.6
    TAU-Bench:
    70.0
    Multilingual:
    71.8
    AIME 2025:
    32.0

    Google's advanced model with 2M token context window and strong multimodal capabilities.

    Great for Creative Tasks
    Overall Score:
    66.4
    MMLU:
    85.9
    MMMU:
    62.2
    GPQA:
    63.9
    Coding:
    36.5
    TAU-Bench:
    75.7
    Multilingual:
    88.0
    AIME 2025:
    40.0
    Qwen2.5 72B Logo
    Qwen2.5 72B
    Open Source

    Alibaba's flagship open-source model with exceptional multilingual and coding capabilities.

    Overall Score:
    64.5
    MMLU:
    72.3
    MMMU:
    75.2
    GPQA:
    49.8
    Coding:
    40.8
    TAU-Bench:
    72.4
    Multilingual:
    85.0
    AIME 2025:
    37.0
    Llama 3.1 70B Logo
    Llama 3.1 70B
    Open Source

    Meta's efficient large model offering strong performance with lower computational requirements.

    Overall Score:
    64.1
    MMLU:
    79.6
    MMMU:
    68.9
    GPQA:
    46.7
    Coding:
    34.5
    TAU-Bench:
    73.8
    Multilingual:
    60.3
    AIME 2025:
    70.2
    DeepSeek-V3 Logo
    DeepSeek-V3
    Open Source

    DeepSeek's advanced model with strong coding and reasoning capabilities.

    Overall Score:
    57.0
    MMLU:
    81.7
    MMMU:
    65.2
    GPQA:
    43.9
    Coding:
    29.4
    TAU-Bench:
    70.0
    Multilingual:
    71.8
    AIME 2025:
    32.0

    xAI's most advanced model with real-time information access and enhanced reasoning.

    Top Tier Reasoning
    Overall Score:
    52.4
    MMLU:
    87.6
    MMMU:
    72.1
    GPQA:
    87.5
    Coding:
    -
    TAU-Bench:
    -
    Multilingual:
    83.2
    AIME 2025:
    74.5

    xAI's improved model with enhanced conversational abilities and real-time data access.

    Overall Score:
    52.3
    MMLU:
    85.7
    MMMU:
    76.0
    GPQA:
    80.2
    Coding:
    -
    TAU-Bench:
    -
    Multilingual:
    80.1
    AIME 2025:
    83.0

    Anthropic's fastest model, optimized for speed while maintaining strong capabilities.

    Overall Score:
    48.3
    MMLU:
    75.2
    MMMU:
    46.4
    GPQA:
    33.3
    Coding:
    26.0
    TAU-Bench:
    61.8
    Multilingual:
    65.4
    AIME 2025:
    23.0
    DeepSeek R1 Logo
    DeepSeek R1
    Open Source

    DeepSeek's advanced model with strong coding and reasoning capabilities.

    Overall Score:
    39.5
    MMLU:
    -
    MMMU:
    76.0
    GPQA:
    71.5
    Coding:
    25.2
    TAU-Bench:
    -
    Multilingual:
    -
    AIME 2025:
    79.8
    Llama 4 405B Logo
    Llama 4 405B
    Open Source

    Meta's next-generation open-source model with state-of-the-art capabilities.

    Overall Score:
    39.5
    MMLU:
    85.5
    MMMU:
    73.4
    GPQA:
    69.8
    Coding:
    -
    TAU-Bench:
    -
    Multilingual:
    84.6
    AIME 2025:
    -
    Last updated: 2026-03-29