Awesome LM Evaluation Methodologies

Awesome LM Evaluation Methodologies

How To Use Me

In this webpage, press ctrl+F (for Windows)/command + F(for Mac)
Enter the keyword you want to search
Read the paper from its link.

Evaluation Methodologies

Author	Title	Proceeding	Link
Jiatong Li, et al.	PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations	NeurIPS 2024	https://arxiv.org/abs/2405.19740
Jingnan Zheng, et al.	ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation	NeurIPS 2024	https://arxiv.org/abs/2405.14125
Jinhao Duan, et al.	GTBench: Uncovering the Strategic Reasoning Capabilities of LLMs via Game-Theoretic Evaluations	NeurIPS 2024	https://arxiv.org/abs/2402.12348
Felipe Maia Polo, et al.	Efficient multi-prompt evaluation of LLMs	NeurIPS 2024	https://arxiv.org/abs/2405.17202
Fan Lin, et al.	IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation	NeurIPS 2024	https://arxiv.org/abs/2409.18892
Jinjie Ni, et al.	MixEval: Fast and Dynamic Human Preference Approximation with LLM Benchmark Mixtures	NeurIPS 2024	https://nips.cc/virtual/2024/poster/96545
Percy Liang, et al.	Holistic Evaluation of Language Models	TMLR	https://arxiv.org/abs/2211.09110
Felipe Maia Polo, et al.	tinyBenchmarks: evaluating LLMs with fewer examples	ICML 2024	https://openreview.net/forum?id=qAml3FpfhG
Miltiadis Allamanis, et al.	Unsupervised Evaluation of Code LLMs with Round-Trip Correctness	ICML 2024	https://icml.cc/virtual/2024/poster/33761
Wei-Lin Chiang, et al.	Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference	ICML 2024	https://arxiv.org/abs/2403.04132
Yonatan Oren, et al.	Proving Test Set Contamination in Black-Box Language Models	ICLR 2024	https://arxiv.org/abs/2310.17623
Kaijie Zhu, et al.	DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks	ICLR 2024	https://arxiv.org/abs/2309.17167
Seonghyeon Ye, et al.	FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets	ICLR 2024	https://openreview.net/forum?id=CYmF38ysDa
Shahriar Golchin, et al.	Time Travel in LLMs: Tracing Data Contamination in Large Language Models	ICLR 2024	https://openreview.net/forum?id=2Rwq6c3tvr
Gati Aher, et al.	Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies	ICML 2023	https://proceedings.mlr.press/v202/aher23a/aher23a.pdf

Evaluation Benchmarks

Author	Title	Proceeding	Link
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt	Measuring Massive Multitask Language Understanding	ICLR 2021	https://arxiv.org/abs/2009.03300
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, Junxian He	C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models	NeurIPS 2023	https://arxiv.org/abs/2305.08322
Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang	SafetyBench: Evaluating the Safety of Large Language Models	ACL 2024	https://aclanthology.org/2024.acl-long.830/
Haoran Li, Dadi Guo, Donghao Li, Wei Fan, Qi Hu, Xin Liu, Chunkit Chan, Duanyi Yao, Yuan Yao, Yangqiu Song	PrivLM-Bench: A Multi-level Privacy Evaluation Benchmark for Language Models	ACL 2024	https://aclanthology.org/2024.acl-long.4/

Survey Papers

Author	Title	Proceeding	Link
Yupeng Chang, et al.	A Survey on Evaluation of Large Language Models	TIST	https://dl.acm.org/doi/full/10.1145/3641289
Zishan Guo, et al.	Evaluating Large Language Models: A Comprehensive Survey	Preprint (arxiv)	https://arxiv.org/abs/2310.19736
Zhuang Ziyu, et al.	Through the Lens of Core Competency: Survey on Evaluation of Large Language Models	CCL 2023	https://aclanthology.org/2023.ccl-2.8/
Isabel O. Gallegos, et al.	Bias and Fairness in Large Language Models: A Survey	CL 2024	https://aclanthology.org/2024.cl-3.8/

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome LM Evaluation Methodologies

How To Use Me

Evaluation Methodologies

Evaluation Benchmarks

Survey Papers

About

Uh oh!

Releases

Packages

License

CSLiJT/awesome-lm-evaluation-methodologies

Folders and files

Latest commit

History

Repository files navigation

Awesome LM Evaluation Methodologies

How To Use Me

Evaluation Methodologies

Evaluation Benchmarks

Survey Papers

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages