🧪 SAGE: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning
SAGE (Scientific Advanced General Evaluation) is a large-scale, high-difficulty, cross-disciplinary benchmark developed by Shanghai AI Laboratory for evaluating frontier scientific reasoning capabilities of Large Language Models (LLMs).
Benchmark Overview
SAGE evaluates models across seven core scientific fields (57 sub-fields in total), covering the key domains of AI for Science (AI4S):
- Mathematics - Abstract algebra, analysis, differential equations, and computational mathematics
- Physics - Classical mechanics, electrodynamics, quantum mechanics, thermodynamics, and astrophysics
- Chemistry - Physical chemistry, inorganic chemistry, organic chemistry, and analytical chemistry
- Biology - Genetics, immunology, molecular biology, biophysics, and ecology
- Computer Science - Computer architecture, artificial intelligence, and software fundamentals
- Earth Science - Geography, geodesy, atmospheric chemistry, marine science, and geology
- Materials Science - Composite materials, metal materials, organic polymer materials, and material synthesis
Evaluation Metrics
- Accuracy (%): Overall correctness of predictions across all domains, judged by LLM-as-Judge (OpenAI o4-mini / Qwen3-235B-A22B)
- mG-Pass@2: Multi-generation Pass rate for 2 predictions (measures consistency of model outputs)
- mG-Pass@4: Multi-generation Pass rate for 4 predictions (measures stability of reasoning capabilities) The leaderboard displays model performance sorted by average accuracy, with domain-specific scores reflecting strengths in different scientific fields. All metrics are derived from the SAGE validation/test set (≈800 expert-created original problems).
🏆 SAGE Benchmark Results
📊 Showing 11 results
Doubao-Seed-1.6-thinking | SH AI Lab | User Submission | 43.8 | 34.2 | 33.5 | 2025-09-09T14:37:23.616340 |
Submit Your SAGE Results
Results can be submitted as evaluation outputs in JSON format. Each submission should include predictions and reasoning content for all test questions.
Required JSON Format:
{
"submission_org": "Your Organization",
"submission_email": "contact@example.com",
"predictions": [
{
"original_question_id": 0,
"content": ["answer1", "answer2", "answer3", "answer4"],
"reasoning_content": ["reasoning1", "reasoning2", "reasoning3", "reasoning4"]
}
]
}
Submission Guidelines:
- Each prediction must include exactly 4 content items and 4 reasoning items
- Question IDs should match the official SAGE test set
- Provide clear scientific reasoning for each prediction
- Ensure JSON format is valid and complete
Your submission will be automatically evaluated across all scientific domains and added to the leaderboard.
📋 提交要求
- 文件格式: 上传符合SAGE格式的JSON文件
- 组织信息: 填写准确的组织名称(将显示在排行榜)
- 联系邮箱: 提供有效邮箱用于结果通知
- 自动评测: 提交后将自动进行LLM评测并更新排行榜
Drop File Here - or - Click to Upload