logo

TReB

A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of LLMs
Leaderboard (LLM-as-a-judge)
The results are generated through the native TReB evaluation process, taking the highest score among the three modes: TCoT, PoT, and ICoT. (We also encourage submitters to use custom workflows or inference modes.)
Model Table Understanding
(TU)
Table Basic Operation
(TBO)
Table Computational Operation
(TCO)
Data Analysis
(DA)
Advanced Data Analysis
(ADA)
Overall Score
Oct 28, 2025 JT-DA
JIUTIAN Research
80.704 88.88 83.335 83.335 53.05 78.64
Feb 3, 2026 doubao-1.5-pro
ByteDance
79.996 87.12 79.9 90.5 54.818 78.47
Sep 1, 2025 QwQ-32B
Alibaba
83.22 81.98 83.275 83.275 58.595 78.15
Sep 1, 2025 Qwen3-32B
Alibaba
81.084 83.06 82.78 84.415 57.003 77.67
Feb 3, 2026 gpt-4o
OpenAI
81.09 80.21 81.145 89.288 56.22 77.59
Sep 1, 2025 Qwen2.5-72B-Instruct
Alibaba
80.462 80.17 80.225 85.093 53.555 75.9
Sep 1, 2025 Qwen3-14B
Alibaba
81.506 84.6 82.17 76.405 52.225 75.38
Feb 3, 2026 gpt-4o-mini
OpenAI
75.496 76.88 78.91 84.85 49.062 73.04
Sep 1, 2025 Deepseek-R1-Distill-Qwen-32B
Deepseek
76.914 77.85 77.41 81.545 49.948 72.73
Sep 1, 2025 Llama-3.1-70B-Instruct
Meta
68.012 78.9 77.595 81.06 56.223 72.36
Sep 1, 2025 Qwen3-8B
Alibaba
79.052 80.03 78.465 71.32 49.098 71.59
Sep 1, 2025 TableGPT2-7B
Zhejiang University
71.632 71.00 74.685 85.12 48.212 70.13
Sep 1, 2025 Deepseek-R1-Distill-Qwen-14B
Deepseek
75.028 74.12 72.76 73.73 39.18 66.96
Sep 1, 2025 Table-R1-Zero-7B
Yale
73.53 60.655 73.655 80.07 45.853 66.75
Dec 2, 2025 DeepSeek-R1-0528-Qwen3-8B
DeepSeek
77.292 78.21 69.035 64.273 40.387 65.84
Sep 1, 2025 Qwen2.5-Coder-7B-Instruct
Alibaba
67.612 73.99 68.02 80.025 38.832 65.7
Sep 1, 2025 Qwen2.5-7B-Instruct
Alibaba
69.052 68.23 67.3 79.755 41.915 65.25
Sep 1, 2025 Seed-Coder-8B-Instruct
ByteDance
64.698 71.16 66.17 82.085 41.582 65.14
Sep 1, 2025 Qwen2.5-Math-72B-Instruct
Alibaba
74.416 66.24 69.24 75.573 38.897 64.87
Sep 1, 2025 Llama-3.1-8B-Instruct
Meta
59.4 70.13 63.81 76.445 35.35 61.03
Sep 1, 2025 Deepseek-R1-Distill-Qwen-7B
Deepseek
60.72 58.58 53.855 71.463 29.313 54.79
Sep 1, 2025 Deepseek-R1-Distill-Llama-8B
Deepseek
62.689 56.52 47.8 50.773 19.97 47.55
Sep 1, 2025 Yi-Coder-9B-Chat
01-AI
43.97 57.05 47.01 56.095 33.172 47.46
Sep 1, 2025 Table-R1-SFT-7B
Yale
60.404 23.46 24.1 43.01 37.603 37.72

About TReB

TReB is a comprehensive table reasoning evolution benchmark, which measures both shallow table understanding abilities and deep table reasoning abilities. Overall, we construct a high quality dataset to evaluate 5 core skills of LLMs: Table Understanding (TU), Table Basic Operation (TBO), Table Computational Operation (TCO), Data Analysis (DA), and Advanced Data Analysis (ADA). Accordingly, we propose a total of 20 sub-tasks. The evaluation framework supports 3 distinct inference modes, TCoT, PoT and ICoT, encouraging more robust reasoning.

News

  • Feb. 3, 2026:

    We have updated the evaluation results of some models on the leaderboard and removed the display of the Rouge-L metric, retaining only the LLM-as-a-judge metric (which is relatively fairer).

    Jun. 18, 2025:

    πŸ”₯πŸ”₯ The benchmark paper, code and dataset are all released! Please check and submit your result to leaderboard!!

    Thank you for all the feedback!!!


Challenges from TReB

intro_framework

Submission

πŸ€—πŸ€— We warmly welcome submissions to our leaderboard, including both your own methods and contributions showcasing the latest model performance! TReB features two separate leaderboards. Please refer to the Submission Guidelines below for details, and submit your results as instructed to jttreb2025@gmail.com.

Citation

@misc{li2025trebcomprehensivebenchmarkevaluating,
title={TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models}, 
author={Ce Li and Xiaofan Liu and Zhiyan Song and Ce Chi and Chen Zhao and Jingjing Yang and Zhendong Wang and Kexin Yang and Boshen Shi and Xing Wang and Chao Deng and Junlan Feng},
year={2025},
eprint={2506.18421},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.18421}, 
}