logo

TReB

A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of LLMs
Leaderboard (LLM-as-a-judge)
scores are averaged over TCoT, PoT and ICoT
Model TU
TBO
TCO
DA
ADA Avg.
Sep 1, 2025 Qwen3-32B
open source
83.12 84.12 77.57 73.15 76.84 72.96
Sep 1, 2025 Qwen-14B
open source
81.30 83.13 78.48 68.56 46.23 71.54
Sep 1, 2025 Qwen-8B
open source
81.70 81.56 75.36 64.30 43.15 69.21
Sep 1, 2025 Qwen2.5-72B-Instruct
open source
81.90 74.85 72.65 71.03 45.29 69.15
Sep 1, 2025 Deepseek-R1-Distill-Qwen-32B
open source
80.00 81.46 70.59 65.74 40.32 67.62
Sep 1, 2025 DeepSeek-R1-0528-Qwen3-8B
open source
77.49 82.38 70.34 56.69 36.42 64.66
Sep 1, 2025 Llama-3.1-70B-Instruct
open source
75.46 69.69 69.69 63.72 40.91 63.81
Sep 1, 2025 Deepseek-R1-Distill-Qwen-14B
open source
79.57 75.05 66.88 57.83 30.92 62.05
Sep 1, 2025 TableGPT2-7B
open source
75.07 64.07 64.35 59.90 37.14 60.10
Sep 1, 2025 Seed-Coder-8B-Instruct
open source
65.49 60.10 57.24 66.94 39.15 57.78
Sep 1, 2025 Table-R1-Zero-7B
open source
73.24 57.97 60.79 61.58 33.93 57.50
Sep 1, 2025 Qwen2.5-Coder-7B-Instruct
open source
69.09 65.88 57.99 59.64 32.66 57.05
Sep 1, 2025 Qwen2.5-7B-Instruct
open source
70.22 58.71 58.06 62.12 32.48 56.32
Sep 1, 2025 Qwen2.5-Math-72B-Instruct
open source
68.31 62.83 58.74 58.10 31.79 55.95
Sep 1, 2025 Llama-3.1-8B-Instruct
open source
62.44 54.30 51.86 58.47 26.77 50.77
Sep 1, 2025 Deepseek-R1-Distill-Qwen-7B
open source
57.03 59.14 50.17 54.63 22.45 48.68
Sep 1, 2025 Yi-Coder-9B-Chat
open source
50.87 56.67 48.54 53.91 30.33 48.07
Sep 1, 2025 Table-R1-SFT-7B
open source
57.63 52.04 47.42 43.56 36.92 47.51
Sep 1, 2025 Deepseek-coder-33B-instruct
open source
46.26 55.38 44.68 51.28 27.82 45.08
Sep 1, 2025 Deepseek-R1-Distill-Llama-8B
open source
48.39 52.23 44.16 35.46 15.05 41.06
Sep 1, 2025 Qwen2.5-Math-7B-Instruct
open source
33.26 37.07 27.97 29.03 11.35 27.74
Sep 1, 2025 Kimina-Prover-Preview-Distill-7B
open source
19.20 15.57 14.37 13.49 6.08 13.74
Leaderboard (ROUGE-L)
scores are averaged over TCoT, PoT and ICoT
Model TU
TBO
TCO
DA
ADA Avg.
Sep 1, 2025 Qwen3-32B
open source
53.07 38.26 50.10 27.93 25.44 38.96
Sep 1, 2025 Qwen2.5-72B-Instruct
open source
53.84 33.00 47.60 27.97 23.48 37.18
Sep 1, 2025 Qwen3-8B
open source
51.52 37.95 47.20 25.96 22.51 37.03
Sep 1, 2025 Deepseek-R1-Distill-Qwen-32B
open source
52.84 35.52 46.40 27.51 21.81 36.82
Sep 1, 2025 Qwen3-14B
open source
51.15 39.00 48.04 23.47 22.33 36.80
Sep 1, 2025 Deepseek-R1-Distill-Qwen-14B
open source
50.04 28.66 42.51 26.04 17.00 32.85
Sep 1, 2025 Qwen2.5-7B-Instruct
open source
45.63 28.58 43.69 25.26 18.69 32.37
Sep 1, 2025 Qwen2.5-Coder-7B-Instruct
open source
44.15 29.54 42.93 24.02 17.81 31.69
Sep 1, 2025 Table-R1-Zero-7B
open source
44.84 27.42 41.42 25.35 17.76 31.36
Sep 1, 2025 TableGPT2-7B
open source
43.25 31.83 42.08 16.82 15.56 29.91
Sep 1, 2025 Seed-Coder-8B-Instruct
open source
36.65 28.60 38.52 26.97 17.21 29.59
Sep 1, 2025 DeepSeek-R1-0528-Qwen3-8B
open source
35.39 26.80 33.79 16.58 14.48 25.41
Sep 1, 2025 Deepseek-R1-Distill-Qwen-7B
open source
32.20 25.14 31.41 23.78 11.22 24.75
Sep 1, 2025 Qwen2.5-Math-72B-Instruct
open source
31.11 24.92 30.53 18.89 16.53 24.40
Sep 1, 2025 Llama-3.1-70B-Instruct
open source
30.98 20.59 29.91 10.49 11.39 20.67
Sep 1, 2025 Deepseek-R1-Distill-Llama-8B
open source
30.71 21.49 24.94 13.66 7.65 19.69
Sep 1, 2025 Table-R1-SFT-7B
open source
25.17 16.95 23.34 14.24 6.20 17.18
Sep 1, 2025 Llama-3.1-8B-Instruct
open source
20.00 19.88 22.96 12.44 9.96 17.05
Sep 1, 2025 DeepSeek-coder-33B-instruct
open source
22.09 20.42 17.75 14.10 8.33 16.54
Sep 1, 2025 Yi-Coder-9B-Chat
open source
22.89 17.90 19.02 9.56 8.92 15.66
Sep 1, 2025 Qwen2.5-Math-7B-Instruct
open source
9.26 13.53 13.26 6.00 6.22 9.65
Sep 1, 2025 Kimina-Prover-Preview-Distill-7B
open source
2.49 4.30 3.39 2.19 1.75 2.82

About TReB

TReB is a comprehensive table reasoning evolution benchmark, which measures both shallow table understanding abilities and deep table reasoning abilities. Overall, we construct a high quality dataset to evaluate 5 core skills of LLMs: Table Understanding (TU), Table Basic Operation (TBO), Table Computational Operation (TCO), Data Analysis (DA), and Advanced Data Analysis (ADA). Accordingly, we propose a total of 20 sub-tasks. The evaluation framework supports 3 distinct inference modes, TCoT, PoT and ICoT, encouraging more robust reasoning.

News

  • Jun. 18, 2025:

    πŸ”₯πŸ”₯ The benchmark paper, code and dataset are all released! Please check and submit your result to leaderboard!!

    .

    Thank you for all the feedback!!!


Challenges from TReB

intro_framework

Submission

πŸ€—πŸ€— We warmly welcome submissions to our leaderboard, including both your own methods and contributions showcasing the latest model performance! TReB features two separate leaderboards. Please refer to the Submission Guidelines below for details, and submit your results as instructed to jttreb2025@gmail.com.

Citation

@misc{li2025trebcomprehensivebenchmarkevaluating,
title={TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models}, 
author={Ce Li and Xiaofan Liu and Zhiyan Song and Ce Chi and Chen Zhao and Jingjing Yang and Zhendong Wang and Kexin Yang and Boshen Shi and Xing Wang and Chao Deng and Junlan Feng},
year={2025},
eprint={2506.18421},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.18421}, 
}
βš™ This website is an improved version based on the original source code from Bird-Bench!