TReB

Leaderboard (LLM-as-a-judge)

scores are averaged over TCoT, PoT and ICoT

	Model	TU	TBO	TCO	DA	ADA	Avg.
Sep 1, 2025	Qwen3-32B open source	83.12	84.12	77.57	73.15	76.84	72.96
Sep 1, 2025	Qwen-14B open source	81.30	83.13	78.48	68.56	46.23	71.54
Sep 1, 2025	Qwen-8B open source	81.70	81.56	75.36	64.30	43.15	69.21
Sep 1, 2025	Qwen2.5-72B-Instruct open source	81.90	74.85	72.65	71.03	45.29	69.15
Sep 1, 2025	Deepseek-R1-Distill-Qwen-32B open source	80.00	81.46	70.59	65.74	40.32	67.62
Sep 1, 2025	DeepSeek-R1-0528-Qwen3-8B open source	77.49	82.38	70.34	56.69	36.42	64.66
Sep 1, 2025	Llama-3.1-70B-Instruct open source	75.46	69.69	69.69	63.72	40.91	63.81
Sep 1, 2025	Deepseek-R1-Distill-Qwen-14B open source	79.57	75.05	66.88	57.83	30.92	62.05
Sep 1, 2025	TableGPT2-7B open source	75.07	64.07	64.35	59.90	37.14	60.10
Sep 1, 2025	Seed-Coder-8B-Instruct open source	65.49	60.10	57.24	66.94	39.15	57.78
Sep 1, 2025	Table-R1-Zero-7B open source	73.24	57.97	60.79	61.58	33.93	57.50
Sep 1, 2025	Qwen2.5-Coder-7B-Instruct open source	69.09	65.88	57.99	59.64	32.66	57.05
Sep 1, 2025	Qwen2.5-7B-Instruct open source	70.22	58.71	58.06	62.12	32.48	56.32
Sep 1, 2025	Qwen2.5-Math-72B-Instruct open source	68.31	62.83	58.74	58.10	31.79	55.95
Sep 1, 2025	Llama-3.1-8B-Instruct open source	62.44	54.30	51.86	58.47	26.77	50.77
Sep 1, 2025	Deepseek-R1-Distill-Qwen-7B open source	57.03	59.14	50.17	54.63	22.45	48.68
Sep 1, 2025	Yi-Coder-9B-Chat open source	50.87	56.67	48.54	53.91	30.33	48.07
Sep 1, 2025	Table-R1-SFT-7B open source	57.63	52.04	47.42	43.56	36.92	47.51
Sep 1, 2025	Deepseek-coder-33B-instruct open source	46.26	55.38	44.68	51.28	27.82	45.08
Sep 1, 2025	Deepseek-R1-Distill-Llama-8B open source	48.39	52.23	44.16	35.46	15.05	41.06
Sep 1, 2025	Qwen2.5-Math-7B-Instruct open source	33.26	37.07	27.97	29.03	11.35	27.74
Sep 1, 2025	Kimina-Prover-Preview-Distill-7B open source	19.20	15.57	14.37	13.49	6.08	13.74

Leaderboard (ROUGE-L)

scores are averaged over TCoT, PoT and ICoT

	Model	TU	TBO	TCO	DA	ADA	Avg.
Sep 1, 2025	Qwen3-32B open source	53.07	38.26	50.10	27.93	25.44	38.96
Sep 1, 2025	Qwen2.5-72B-Instruct open source	53.84	33.00	47.60	27.97	23.48	37.18
Sep 1, 2025	Qwen3-8B open source	51.52	37.95	47.20	25.96	22.51	37.03
Sep 1, 2025	Deepseek-R1-Distill-Qwen-32B open source	52.84	35.52	46.40	27.51	21.81	36.82
Sep 1, 2025	Qwen3-14B open source	51.15	39.00	48.04	23.47	22.33	36.80
Sep 1, 2025	Deepseek-R1-Distill-Qwen-14B open source	50.04	28.66	42.51	26.04	17.00	32.85
Sep 1, 2025	Qwen2.5-7B-Instruct open source	45.63	28.58	43.69	25.26	18.69	32.37
Sep 1, 2025	Qwen2.5-Coder-7B-Instruct open source	44.15	29.54	42.93	24.02	17.81	31.69
Sep 1, 2025	Table-R1-Zero-7B open source	44.84	27.42	41.42	25.35	17.76	31.36
Sep 1, 2025	TableGPT2-7B open source	43.25	31.83	42.08	16.82	15.56	29.91
Sep 1, 2025	Seed-Coder-8B-Instruct open source	36.65	28.60	38.52	26.97	17.21	29.59
Sep 1, 2025	DeepSeek-R1-0528-Qwen3-8B open source	35.39	26.80	33.79	16.58	14.48	25.41
Sep 1, 2025	Deepseek-R1-Distill-Qwen-7B open source	32.20	25.14	31.41	23.78	11.22	24.75
Sep 1, 2025	Qwen2.5-Math-72B-Instruct open source	31.11	24.92	30.53	18.89	16.53	24.40
Sep 1, 2025	Llama-3.1-70B-Instruct open source	30.98	20.59	29.91	10.49	11.39	20.67
Sep 1, 2025	Deepseek-R1-Distill-Llama-8B open source	30.71	21.49	24.94	13.66	7.65	19.69
Sep 1, 2025	Table-R1-SFT-7B open source	25.17	16.95	23.34	14.24	6.20	17.18
Sep 1, 2025	Llama-3.1-8B-Instruct open source	20.00	19.88	22.96	12.44	9.96	17.05
Sep 1, 2025	DeepSeek-coder-33B-instruct open source	22.09	20.42	17.75	14.10	8.33	16.54
Sep 1, 2025	Yi-Coder-9B-Chat open source	22.89	17.90	19.02	9.56	8.92	15.66
Sep 1, 2025	Qwen2.5-Math-7B-Instruct open source	9.26	13.53	13.26	6.00	6.22	9.65
Sep 1, 2025	Kimina-Prover-Preview-Distill-7B open source	2.49	4.30	3.39	2.19	1.75	2.82

About TReB

TReB is a comprehensive table reasoning evolution benchmark, which measures both shallow table understanding abilities and deep table reasoning abilities. Overall, we construct a high quality dataset to evaluate 5 core skills of LLMs: Table Understanding (TU), Table Basic Operation (TBO), Table Computational Operation (TCO), Data Analysis (DA), and Advanced Data Analysis (ADA). Accordingly, we propose a total of 20 sub-tasks. The evaluation framework supports 3 distinct inference modes, TCoT, PoT and ICoT, encouraging more robust reasoning.

News

Jun. 18, 2025:
🔥🔥 The benchmark paper, code and dataset are all released! Please check and submit your result to leaderboard!!
.
Thank you for all the feedback!!!

Challenges from TReB

Submission

🤗🤗 We warmly welcome submissions to our leaderboard, including both your own methods and contributions showcasing the latest model performance! TReB features two separate leaderboards. Please refer to the Submission Guidelines below for details, and submit your results as instructed to jttreb2025@gmail.com.

Citation

@misc{li2025trebcomprehensivebenchmarkevaluating,
title={TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models}, 
author={Ce Li and Xiaofan Liu and Zhiyan Song and Ce Chi and Chen Zhao and Jingjing Yang and Zhendong Wang and Kexin Yang and Boshen Shi and Xing Wang and Chao Deng and Junlan Feng},
year={2025},
eprint={2506.18421},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.18421}, 
}

⚙ This website is an improved version based on the original source code from Bird-Bench!