- In this webpage, press ctrl+F (for Windows)/command + F(for Mac)
- Enter the keyword you want to search
- Read the paper from its link.
| Author | Title | Proceeding | Link |
|---|---|---|---|
| Jiatong Li, et al. | PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations | NeurIPS 2024 | https://arxiv.org/abs/2405.19740 |
| Jingnan Zheng, et al. | ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation | NeurIPS 2024 | https://arxiv.org/abs/2405.14125 |
| Jinhao Duan, et al. | GTBench: Uncovering the Strategic Reasoning Capabilities of LLMs via Game-Theoretic Evaluations | NeurIPS 2024 | https://arxiv.org/abs/2402.12348 |
| Felipe Maia Polo, et al. | Efficient multi-prompt evaluation of LLMs | NeurIPS 2024 | https://arxiv.org/abs/2405.17202 |
| Fan Lin, et al. | IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation | NeurIPS 2024 | https://arxiv.org/abs/2409.18892 |
| Jinjie Ni, et al. | MixEval: Fast and Dynamic Human Preference Approximation with LLM Benchmark Mixtures | NeurIPS 2024 | https://nips.cc/virtual/2024/poster/96545 |
| Percy Liang, et al. | Holistic Evaluation of Language Models | TMLR | https://arxiv.org/abs/2211.09110 |
| Felipe Maia Polo, et al. | tinyBenchmarks: evaluating LLMs with fewer examples | ICML 2024 | https://openreview.net/forum?id=qAml3FpfhG |
| Miltiadis Allamanis, et al. | Unsupervised Evaluation of Code LLMs with Round-Trip Correctness | ICML 2024 | https://icml.cc/virtual/2024/poster/33761 |
| Wei-Lin Chiang, et al. | Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference | ICML 2024 | https://arxiv.org/abs/2403.04132 |
| Yonatan Oren, et al. | Proving Test Set Contamination in Black-Box Language Models | ICLR 2024 | https://arxiv.org/abs/2310.17623 |
| Kaijie Zhu, et al. | DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks | ICLR 2024 | https://arxiv.org/abs/2309.17167 |
| Seonghyeon Ye, et al. | FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets | ICLR 2024 | https://openreview.net/forum?id=CYmF38ysDa |
| Shahriar Golchin, et al. | Time Travel in LLMs: Tracing Data Contamination in Large Language Models | ICLR 2024 | https://openreview.net/forum?id=2Rwq6c3tvr |
| Gati Aher, et al. | Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies | ICML 2023 | https://proceedings.mlr.press/v202/aher23a/aher23a.pdf |
| Author | Title | Proceeding | Link |
|---|---|---|---|
| Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt | Measuring Massive Multitask Language Understanding | ICLR 2021 | https://arxiv.org/abs/2009.03300 |
| Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, Junxian He | C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models | NeurIPS 2023 | https://arxiv.org/abs/2305.08322 |
| Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang | SafetyBench: Evaluating the Safety of Large Language Models | ACL 2024 | https://aclanthology.org/2024.acl-long.830/ |
| Haoran Li, Dadi Guo, Donghao Li, Wei Fan, Qi Hu, Xin Liu, Chunkit Chan, Duanyi Yao, Yuan Yao, Yangqiu Song | PrivLM-Bench: A Multi-level Privacy Evaluation Benchmark for Language Models | ACL 2024 | https://aclanthology.org/2024.acl-long.4/ |
| Author | Title | Proceeding | Link |
|---|---|---|---|
| Yupeng Chang, et al. | A Survey on Evaluation of Large Language Models | TIST | https://dl.acm.org/doi/full/10.1145/3641289 |
| Zishan Guo, et al. | Evaluating Large Language Models: A Comprehensive Survey | Preprint (arxiv) | https://arxiv.org/abs/2310.19736 |
| Zhuang Ziyu, et al. | Through the Lens of Core Competency: Survey on Evaluation of Large Language Models | CCL 2023 | https://aclanthology.org/2023.ccl-2.8/ |
| Isabel O. Gallegos, et al. | Bias and Fairness in Large Language Models: A Survey | CL 2024 | https://aclanthology.org/2024.cl-3.8/ |