Please add flag to log score for each sample (akin to Eleuther's LM Evaluation Harness)

Hi! I've been using EleutherAI's LM Evaluation Harness and I'd like to be able to also runs some code tasks using your Big Code Evaluation Harness. We need the scores for each sample in each benchmark and the LM Evaluation Harness has a helpful flag `log_samples` that activates logging the per-sample scores.

As best as I can tell (and please correct me if I'm wrong), Big Code's Evaluation Harness doesn't have a similar flag. If my understanding is correct, could this please be added?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Please add flag to log score for each sample (akin to Eleuther's LM Evaluation Harness) #215

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Please add flag to log score for each sample (akin to Eleuther's LM Evaluation Harness) #215

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions