You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I've been using EleutherAI's LM Evaluation Harness and I'd like to be able to also runs some code tasks using your Big Code Evaluation Harness. We need the scores for each sample in each benchmark and the LM Evaluation Harness has a helpful flag log_samples that activates logging the per-sample scores.
As best as I can tell (and please correct me if I'm wrong), Big Code's Evaluation Harness doesn't have a similar flag. If my understanding is correct, could this please be added?