Support equivalent words in license detection #4190#4215
Merged
AyanSinhaMahapatra merged 28 commits intodevelopfrom Apr 22, 2025
Merged
Support equivalent words in license detection #4190#4215AyanSinhaMahapatra merged 28 commits intodevelopfrom
AyanSinhaMahapatra merged 28 commits intodevelopfrom
Conversation
Handle similar words in license detection by allowing multiple "legalese words" to have the same token id. Regenerate the tokens ids accordingly. Convert Index.tokens_by_tid to a computed property, available on demand. Convert tokens_by_tid to a dictionary from a list. Ensure that all code relying on the tokens_by_tid is updated as needed. All locations were used only for testing and debugging. Deprecate all rules that are duplicated under this new regime, where tokens like "license" and "licence" are not treated as identical. Update test suite to test the detection of all deprecated licenses and rules as a sanity check. A rule with "relevance" set to 0 is not tested if deprecated, as some rules are deprecated because they are false positive and should no longer be detected. Also improved the validation and loading of rules relevance, including the case for zero relevance. Update ambiguous or conflicting rules as needed. In particular ensure that all rules in the style of "MIT or GPL" without a GPL version are now reported consistently as: "mit or gpl-1.0-plus" Add new rules as needed to resolve failing tests and improve accuracy. Improve deprecated support for rules and licenses, adding a new "replaced_by" list attribute that lists the new expressions that must be detected from scanning the deprecated license or rule text. Reference: #4190 Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Use correct function names Remove duplicated license Remove duplicated rules Update failed merges Adjust and rename rules as needed Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
1394a76 to
1b508c8
Compare
Signed-off-by: Philippe Ombredanne <[email protected]>
pombredanne
commented
Apr 14, 2025
pombredanne
commented
Apr 14, 2025
pombredanne
commented
Apr 14, 2025
Member
Author
pombredanne
left a comment
There was a problem hiding this comment.
@AyanSinhaMahapatra here are the notes from our review
This was referenced Apr 14, 2025
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
USe a JSON assertion on full scan results Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Reporting a full stack trace and reraising an exception is helpful in debug mode. Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
6 tasks
This was making alpine test fail massively Signed-off-by: Philippe Ombredanne <[email protected]>
Provide details on each step of the Alpine expression cleanups Signed-off-by: Philippe Ombredanne <[email protected]>
There are upcoming PRs in develop that would use the same rule file names. Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Update rule to adopt the the "replaced_by" attribute Update tests from renaming MIT license rule files Signed-off-by: Philippe Ombredanne <[email protected]>
Member
Author
|
@AyanSinhaMahapatra all feedback is addressed. Follow up PRs are pending ASAP once this is merged:
|
AyanSinhaMahapatra
approved these changes
Apr 22, 2025
Member
AyanSinhaMahapatra
left a comment
There was a problem hiding this comment.
Thanks++ @pombredanne, LGTM!
See comments for a few small issues, will handle them separately.
| @@ -76,7 +76,7 @@ def add_sequence(automaton, tids, rid, start=0, with_duplicates=False): | |||
|
|
|||
|
|
|||
| MATCH_AHO_EXACT = '2-aho' | |||
Member
There was a problem hiding this comment.
It does not make sense to change the MATCH_AHO_EXACT_ORDER from 2 to 1, keeping the MATCH_AHO_EXACT value as 2-aho as this is quite confusing. We should either rename MATCH_AHO_EXACT as 1-aho or remove the numbers from these entirely.
| @@ -1,2 +1,2 @@ | |||
| license_expressions: | |||
| - odc-by-1.0 | |||
| - ppl | |||
Member
There was a problem hiding this comment.
This is a regression potentially, Maybe this rule is not added correctly? @pombredanne ?
sschuberth
added a commit
to oss-review-toolkit/ort
that referenced
this pull request
Jun 27, 2025
See [1]. Adjust a license score which now has higher confidence (probably due to [2]). [1]: https://github.com/aboutcode-org/scancode-toolkit/releases/tag/v32.4.0 [2]: aboutcode-org/scancode-toolkit#4215 Signed-off-by: Sebastian Schuberth <[email protected]>
sschuberth
added a commit
to oss-review-toolkit/ort
that referenced
this pull request
Jun 27, 2025
See [1]. Adjust a license score which now has higher confidence (probably due to [2]). [1]: https://github.com/aboutcode-org/scancode-toolkit/releases/tag/v32.4.0 [2]: aboutcode-org/scancode-toolkit#4215 Signed-off-by: Sebastian Schuberth <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR improves license detection with multiple new features, driven from considering "license" and "licence" alternative spellings as equivalent.
Handle similar words in license detection by allowing multiple "legalese words" to have the same token id.
Tag interesting similar words with the same token id including license/licence and more.
Regenerate the tokens ids accordingly.
Convert Index.tokens_by_tid to a computed property, available on demand. Convert tokens_by_tid to a dictionary from a list. Ensure that all code relying on the tokens_by_tid is updated as needed. All locations were used only for testing and debugging.
Deprecate all rules that are duplicated under this new regime, where tokens like "license" and "licence" are not treated as identical.
Update test suite to test the detection of all deprecated licenses and rules as a sanity check. A rule with "relevance" set to 0 is not tested if deprecated, as some rules are deprecated because they are false positive and should no longer be detected. Also improved the validation and loading of rules relevance, including the case for zero relevance.
Update ambiguous or conflicting rules as needed. In particular ensure that all rules in the style of "MIT or GPL" without a GPL version are now reported consistently as: "mit or gpl-1.0-plus"
Add new rules as needed to resolve failing tests and improve accuracy.
Improve deprecated support for rules and licenses, adding a new "replaced_by" list attribute that lists the new expressions that must be detected from scanning the deprecated license or rule text.
Reference: #4190
Fixes: #4190
Tasks
Run tests locally to check for errors.