logo

TReB

A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of LLMs
Leaderboard (LLM-as-a-judge)
The results are generated through the native TReB evaluation process, taking the highest score among the three modes: TCoT, PoT, and ICoT. (We also encourage submitters to use custom workflows or inference modes.)
Model Table Understanding
(TU)
Table Basic Operation
(TBO)
Table Computational Operation
(TCO)
Data Analysis
(DA)
Advanced Data Analysis
(ADA)
Overall Score
May 18, 2026 Kimi-K2.5
Moonshot AI
85.862 94.69 85.63 94.39 63.78 84.18
May 18, 2026 gemini-3-pro
Google
82.968 93.69 80.165 95.235 64.44333333 82.91
May 18, 2026 JT-DA
JIUTIAN Research
84.574 91.805 83.5075 94.065 61.11333333 82.31
May 18, 2026 Qwen3.6-27B
Alibaba
86.12 88.845 83.84 94.845 62.92666667 81.31
May 18, 2026 Qwen3.5-397B-A17B
Alibaba
84.142 88.8725 82.66 89.5125 63.17333333 81.03
May 18, 2026 MiniMax27
MiniMax
85.636 92.565 80.515 94.05 53.195 80.5
May 18, 2026 Qwen3-235B-A22B-Thinking
Alibaba
83.114 89.47 83.225 89.275 60.54833333 80.4
May 18, 2026 Qwen3.6-35B-A3B
Alibaba
86.081 90.49 83.165 88.59 60.51833333 80.34
May 18, 2026 Qwen3.5-35B-A3B
Alibaba
85.076 91.895 82.97 92.735 58.555 80.13
May 18, 2026 Qwen3.5-27B
Alibaba
85.698 93.04 82.82 86.205 59.415 79.92
May 18, 2026 Qwen3.5-122B-A10B
Alibaba
85.07 92.3 83.2 91.46 61.025 79.15
Feb 3, 2026 doubao-1.5-pro
ByteDance
79.996 87.12 79.9 90.5 54.818 78.47
Sep 1, 2025 QwQ-32B
Alibaba
83.22 81.98 83.275 83.275 58.595 78.15
Sep 1, 2025 Qwen3-32B
Alibaba
81.084 83.06 82.78 84.415 57.003 77.67
Feb 3, 2026 gpt-4o
OpenAI
81.09 80.21 81.145 89.288 56.22 77.59
Sep 1, 2025 Qwen2.5-72B-Instruct
Alibaba
80.462 80.17 80.225 85.093 53.555 75.9
Sep 1, 2025 Qwen3-14B
Alibaba
81.506 84.6 82.17 76.405 52.225 75.38
Feb 3, 2026 gpt-4o-mini
OpenAI
75.496 76.88 78.91 84.85 49.062 73.04
Sep 1, 2025 Deepseek-R1-Distill-Qwen-32B
Deepseek
76.914 77.85 77.41 81.545 49.948 72.73
Sep 1, 2025 Llama-3.1-70B-Instruct
Meta
68.012 78.9 77.595 81.06 56.223 72.36
Sep 1, 2025 Qwen3-8B
Alibaba
79.052 80.03 78.465 71.32 49.098 71.59
Sep 1, 2025 TableGPT2-7B
Zhejiang University
71.632 71.00 74.685 85.12 48.212 70.13
Sep 1, 2025 Deepseek-R1-Distill-Qwen-14B
Deepseek
75.028 74.12 72.76 73.73 39.18 66.96
Sep 1, 2025 Table-R1-Zero-7B
Yale
73.53 60.655 73.655 80.07 45.853 66.75
Dec 2, 2025 DeepSeek-R1-0528-Qwen3-8B
DeepSeek
77.292 78.21 69.035 64.273 40.387 65.84
Sep 1, 2025 Qwen2.5-Coder-7B-Instruct
Alibaba
67.612 73.99 68.02 80.025 38.832 65.7
Sep 1, 2025 Qwen2.5-7B-Instruct
Alibaba
69.052 68.23 67.3 79.755 41.915 65.25
Sep 1, 2025 Seed-Coder-8B-Instruct
ByteDance
64.698 71.16 66.17 82.085 41.582 65.14
Sep 1, 2025 Qwen2.5-Math-72B-Instruct
Alibaba
74.416 66.24 69.24 75.573 38.897 64.87
Sep 1, 2025 Llama-3.1-8B-Instruct
Meta
59.4 70.13 63.81 76.445 35.35 61.03
Sep 1, 2025 Deepseek-R1-Distill-Qwen-7B
Deepseek
60.72 58.58 53.855 71.463 29.313 54.79
Sep 1, 2025 Deepseek-R1-Distill-Llama-8B
Deepseek
62.689 56.52 47.8 50.773 19.97 47.55
Sep 1, 2025 Yi-Coder-9B-Chat
01-AI
43.97 57.05 47.01 56.095 33.172 47.46
Sep 1, 2025 Table-R1-SFT-7B
Yale
60.404 23.46 24.1 43.01 37.603 37.72

About TReB

TReB is a comprehensive table reasoning evolution benchmark, which measures both shallow table understanding abilities and deep table reasoning abilities. Overall, we construct a high quality dataset to evaluate 5 core skills of LLMs: Table Understanding (TU), Table Basic Operation (TBO), Table Computational Operation (TCO), Data Analysis (DA), and Advanced Data Analysis (ADA). Accordingly, we propose a total of 20 sub-tasks. The evaluation framework supports 3 distinct inference modes, TCoT, PoT and ICoT, encouraging more robust reasoning.

News

  • Feb. 3, 2026:

    We have updated the evaluation results of some models on the leaderboard and removed the display of the Rouge-L metric, retaining only the LLM-as-a-judge metric (which is relatively fairer).

    Jun. 18, 2025:

    πŸ”₯πŸ”₯ The benchmark paper, code and dataset are all released! Please check and submit your result to leaderboard!!

    Thank you for all the feedback!!!


Challenges from TReB

intro_framework

Submission

πŸ€—πŸ€— We warmly welcome submissions to our leaderboard, including both your own methods and contributions showcasing the latest model performance! TReB features two separate leaderboards. Please refer to the Submission Guidelines below for details, and submit your results as instructed to jttreb2025@gmail.com.

Citation

@misc{li2025trebcomprehensivebenchmarkevaluating,
title={TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models}, 
author={Ce Li and Xiaofan Liu and Zhiyan Song and Ce Chi and Chen Zhao and Jingjing Yang and Zhendong Wang and Kexin Yang and Boshen Shi and Xing Wang and Chao Deng and Junlan Feng},
year={2025},
eprint={2506.18421},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.18421}, 
}