More stable MBPP evaluation#2111
Merged
Zhudongsheng75 merged 4 commits intoopen-compass:mainfrom Jun 23, 2025
Merged
Conversation
c5c9e86 to
e3cf7f2
Compare
e3cf7f2 to
5fb578a
Compare
Collaborator
|
Thanks for your contribution. Please help review this PR. @Zhudongsheng75 |
Collaborator
|
Please fix the lint issue. |
34b990c to
3cfb4be
Compare
Contributor
Author
|
Hi @tonysy , I cannot make pre-commit work. The lint issue should be resolved. |
Collaborator
|
Thanks for the contribution. This PR may incur performance BC. We will test it and then merge it in next version. |
Collaborator
|
Please check the evaluation performance before/after this PR. cc @Zhudongsheng75 |
Collaborator
|
Also please pay attention to this PR @zhulinJulia24 @MaiziXiao . Thanks. |
zyc140345
pushed a commit
to zyc140345/opencompass
that referenced
this pull request
Oct 23, 2025
* timed re.search and _executor made global * TimeOutError exception handling * added missing blank lines * isort import --------- Co-authored-by: Francesco Bertolotti <[email protected]>
iamkaia
pushed a commit
to iamkaia/opencompass
that referenced
this pull request
Feb 4, 2026
* timed re.search and _executor made global * TimeOutError exception handling * added missing blank lines * isort import --------- Co-authored-by: Francesco Bertolotti <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR improves the stability of the evaluation process for the MBPP dataset.
Motivation
There were two key issues affecting the robustness and reliability of the evaluation:
Regex performance:
The regex pattern
[r'(.*)\s*```.*'](https://github.com/open-compass/opencompass/blob/aa2b89b6f8b7c5448e47ed1aa3f12b04da1ff123/opencompass/datasets/mbpp.py#L327)occasionally hangs when processing predictions containing excessive trailing whitespace. This is likely due to catastrophic backtracking. To mitigate this, I replaced the standardremodule with theregexmodule, which supports timeouts, and set a hard timeout of 10 seconds.Multiprocessing issue:
The following error was encountered:
AttributeError: Can't pickle local object 'execution.<locals>._execution'This occurs because the
ProcessPoolExecutorcannot pickle local (non-global) functions. The fix is to move the_executionfunction to the global scope so it can be properly serialized.Modifications
rewithregexand added a 10-second timeout to[regex.search](https://github.com/open-compass/opencompass/blob/aa2b89b6f8b7c5448e47ed1aa3f12b04da1ff123/opencompass/datasets/mbpp.py#L333)._executionfunction [from here](https://github.com/open-compass/opencompass/blob/aa2b89b6f8b7c5448e47ed1aa3f12b04da1ff123/opencompass/datasets/mbpp.py#L403C1-L418C33) to the global scope to resolve the pickling issue.Backward Compatibility
This change is not breaking. However, note that introducing a regex timeout may cause differences in evaluation results for cases where the timeout is triggered.