Make the glossary regex more deterministic #1801
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
@amieiro noticed that with the Spanish glossary, a term didn't match that existed in the Glossary. On the other hand, Portuguese, which had the same rivalling terms got it right:

The problem arises from the fact that we now group the regex by suffixes. For each English term the suffix is determined and the term sorted into that bin and the regex is built from those bins.
Add-on and Add fall in different bins (the first just with an "s" suffix, the second with possible suffixes "s", "ed", or "ing") but in Spanish the second bin is created first because in the sequence of glossary terms, the word "troubleshooting" that starts the bin is processed before "customization" which crates the "s" bin. In Portuguese its reversed, more or less by chance, because "customization" is processed before "downgrading":
Solution
The proposed solution of making the regex more deterministic so that there are not differences between languages. In this case it fixes the problem but there could be other occurrences where a
krsort
would make it work. We need toTesting Instructions
Create a language with a glossary that contains the words "troubleshooting", "customization", "add", and "add-on" and an original that contains the word "Add-on". Before this PR it will only match the "add".