🧪 SAGE: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

SAGE (Scientific Advanced General Evaluation) is a large-scale, high-difficulty, cross-disciplinary benchmark developed by Shanghai AI Laboratory for evaluating frontier scientific reasoning capabilities of Large Language Models (LLMs).

Benchmark Overview

SAGE evaluates models across seven core scientific fields (57 sub-fields in total), covering the key domains of AI for Science (AI4S):

  • Mathematics - Abstract algebra, analysis, differential equations, and computational mathematics
  • Physics - Classical mechanics, electrodynamics, quantum mechanics, thermodynamics, and astrophysics
  • Chemistry - Physical chemistry, inorganic chemistry, organic chemistry, and analytical chemistry
  • Biology - Genetics, immunology, molecular biology, biophysics, and ecology
  • Computer Science - Computer architecture, artificial intelligence, and software fundamentals
  • Earth Science - Geography, geodesy, atmospheric chemistry, marine science, and geology
  • Materials Science - Composite materials, metal materials, organic polymer materials, and material synthesis

Evaluation Metrics

  • Accuracy (%): Overall correctness of predictions across all domains, judged by LLM-as-Judge (OpenAI o4-mini / Qwen3-235B-A22B)
  • mG-Pass@2: Multi-generation Pass rate for 2 predictions (measures consistency of model outputs)
  • mG-Pass@4: Multi-generation Pass rate for 4 predictions (measures stability of reasoning capabilities) The leaderboard displays model performance sorted by average accuracy, with domain-specific scores reflecting strengths in different scientific fields. All metrics are derived from the SAGE validation/test set (≈800 expert-created original problems).

🏆 SAGE Benchmark Results

📊 Showing 11 results

Submit Your SAGE Results

Results can be submitted as evaluation outputs in JSON format. Each submission should include predictions and reasoning content for all test questions.

Required JSON Format:

{
    "submission_org": "Your Organization",
    "submission_email": "contact@example.com",
    "predictions": [
        {
            "original_question_id": 0,
            "content": ["answer1", "answer2", "answer3", "answer4"],
            "reasoning_content": ["reasoning1", "reasoning2", "reasoning3", "reasoning4"]
        }
    ]
}

Submission Guidelines:

  • Each prediction must include exactly 4 content items and 4 reasoning items
  • Question IDs should match the official SAGE test set
  • Provide clear scientific reasoning for each prediction
  • Ensure JSON format is valid and complete

Your submission will be automatically evaluated across all scientific domains and added to the leaderboard.

📋 提交要求

  • 文件格式: 上传符合SAGE格式的JSON文件
  • 组织信息: 填写准确的组织名称(将显示在排行榜)
  • 联系邮箱: 提供有效邮箱用于结果通知
  • 自动评测: 提交后将自动进行LLM评测并更新排行榜