-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed improvement for gp_levenshtein(). #1408
Conversation
…improvements On longer strings (~500char) splitting these into an array before processing it character-by-character results in a 2x speed increase. Takes the average run in my testing from an average of 705ms to 341ms on my real-world random data set.
…at's not required. Might be slightly faster.
… loop to a foreach, which again is slightly faster. 341ms => 310ms
Just noting that there's limited unit tests for this code branch currently: https://github.com/GlotPress/GlotPress/blob/develop/tests/phpunit/testcases/test_strings.php No logical changes were made here though, and before/after return values appear to be the same still, so I haven't dug into adding proper unit tests, mostly due to personal time constraints. |
…'s function declaration.
Co-authored-by: Dion Hulse <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
gp_levenshtein():
mb_substr()
for significant speed improvementsOn longer strings (~500char) splitting these into an array before processing it character-by-character results in a 2x speed increase.
On my random WordPress.org page dataset, this results in a speed improvement of ~700ms ~= 310ms per call of
gp_string_similarity()
, which is basically justgp_levenshtein()
.Non-scientific measurement though, re-running a script with ~200 calls to gp_string_similarity() multiple times with changing the code back and forth.
Some of these changes are not strictly likely to be faster in all scenario's, but at least here seems that
foreach
is more efficient thanfor
with an additional assignment here.This was found by a
.po
import taking 100% CPU for several minutes, caused by a project having a significant number of obsoleted originals, most of which were long strings, so the C implementation wasn't being used.Note: I also went looking for more optimized variants of this algorithm, there's probably something more that can be done by removing
gp_string_similarity()
's percentage calculation and passing through the raw distance to future calls to abort early in the string if it's a long-way-away from the source string and a closer match has already been found.The algorithm in use here is this: https://en.wikipedia.org/wiki/Levenshtein_distance#Iterative_with_two_matrix_rows