DeepSeek-V3综合能力 2025-01-31 22:15:39 Benchmark (Metric) DeepSeek V3 DeepSeek V2.5 Qwen2.5 Llama3.1 Claude-3.5 GPT-4o 905 72B-Inst 405B-Inst Sonnet-1022 513 Architecture MoE MoE Dense Dense - - # Activated Params 37B 21B 72B 405B - - # Total Params 671B 236B 72B 405B - - English MMLU (EM) 88.5 80.6 85.3 88.6 88.3 87.2 MMLU-Redux (EM) 89.1 80.3 85.6 86.2 88.9 88 MMLU-Pro (EM) 75.9 66.2 71.6 73.3 78 72.6 DROP (3-shot F1) 91.6 87.8 76.7 88.7 88.3 83.7 IF-Eval (Prompt Strict) 86.1 80.6 84.1 86 86.5 84.3 GPQA-Diamond (Pass@1) 59.1 41.3 49 51.1 65 49.9 SimpleQA (Correct) 24.9 10.2 9.1 17.1 28.4 38.2 FRAMES (Acc.) 73.3 65.4 69.8 70 72.5 80.5 LongBench v2 (Acc.) 48.7 35.4 39.4 36.1 41 48.1 Code HumanEval-Mul (Pass@1) 82.6 77.4 77.3 77.2 81.7 80.5 LiveCodeBench (Pass@1-COT) 40.5 29.2 31.1 28.4 36.3 33.4 LiveCodeBench (Pass@1) 37.6 28.4 28.7 30.1 32.8 34.2 Codeforces (Percentile) 51.6 35.6 24.8 25.3 20.3 23.6 SWE Verified (Resolved) 42 22.6 23.8 24.5 50.8 38.8 Aider-Edit (Acc.) 79.7 71.6 65.4 63.9 84.2 72.9 Aider-Polyglot (Acc.) 49.6 18.2 7.6 5.8 45.3 16 Math AIME 2024 (Pass@1) 39.2 16.7 23.3 23.3 16 9.3 MATH-500 (EM) 90.2 74.7 80 73.8 78.3 74.6 CNMO 2024 (Pass@1) 43.2 10.8 15.9 6.8 13.1 10.8 Chinese CLUEWSC (EM) 90.9 90.4 91.4 84.7 85.4 87.9 C-Eval (EM) 86.5 79.5 86.1 61.5 76.7 76 C-SimpleQA (Correct) 64.1 54.1 48.4 50.4 51.3 59.3