Allow license rules to require the presence of certain defining keywords#2773
Allow license rules to require the presence of certain defining keywords#2773pombredanne merged 28 commits intoaboutcode-org:developfrom softsense:issue-2637-allow-license-rules-to-require-the-presence-of-certain-defining-keywords
Conversation
tests/licensedcode/data/datadriven/lic1/boto-boto3-cw-example-creating-alarm.txt.yml
Show resolved
Hide resolved
pombredanne
left a comment
There was a problem hiding this comment.
Thanks!
See a few comments in line.
tests/licensedcode/data/datadriven/lic1/boto-boto3-cw-example-creating-alarm.txt.yml
Show resolved
Hide resolved
pombredanne
left a comment
There was a problem hiding this comment.
This is overall awesome!
I'd like to check if the impact on indexing is not too bad... but this is not a critical process anymore.
| """ | ||
| Return a filtered list of kept LicenseMatch matches and a list of | ||
| discardable matches by removing all matches that do not contain all key | ||
| phrases required by the rule. |
There was a problem hiding this comment.
What if there is only one match and this does not contain the keyphrases?
Would you discard it, therefore leaving no matches?
There was a problem hiding this comment.
If the only match does not contain the key phrase wouldn't that indicate a high chance of being a false positive? If so, for my use-case I would rather have no results, but maybe that's not in-line with Scancode philosophy(e.g. not having false negatives)?
There was a problem hiding this comment.
If the only match does not contain the key phrase wouldn't that indicate a high chance of being a false positive?
yes in the general case, but let's take a practical example:
- in query I have
licensed under the AGPL 3.0and in rule1 I haveAGPL 3.0 - in rule2 I have
licensed under the {{AGPL 3.0 license}} - the query will initially match both rules
- the match to rule1 may be discarded as contained in the rule2 match
- the match to rule2 may be discarded as missing a key phrase
I am left with no matches yet licensed under the AGPL 3.0 was very clearly a license notice.
On the other hand if the match to rule2 is discarded as missing a key phrase early and before the containment filter, then we are at least left with the match to rule1?
There was a problem hiding this comment.
Okay, that makes sense. But unfortunately I was unable to recreate this scenario in a unit test.
Instead I experimented and moved the key phrase filter up (before the containment filter) and 5 of the datadriven tests I've added broke. Looking into that now. E.g. tests/licensedcode/test_detection_datadriven1.py::TestLicenseDataDriven1::test_detection_datadriven_lic1_alexa_skills_kit_sdk_for_java_txt now detects [apache-2.0, apache-2.0, unknown-license-reference] as opposed to only apache-2.0.
Also, are you suggesting to skip if there's only one rule like some of the other filters?
if len(matches) < 2:
return matches
pombredanne
left a comment
There was a problem hiding this comment.
I added some more comments for your review.
src/licensedcode/match.py
Outdated
| all_discarded.extend(discarded) | ||
| _log(matches, discarded, 'HIGH ENOUGH SCORE') | ||
|
|
||
| matches, discarded = filter_key_phrase_spans(matches) |
There was a problem hiding this comment.
I think this filter should be positioned much earlier in the process otherwise here is the risk:
- there was a smalller match that was contained
- it is filtered because of containment in a larger match
- the larger is discarded because of not having keyphrases
- there is no match left for this region and we now have a false negative
There was a problem hiding this comment.
I've moved the key phrase filter up in 2c377b3, to right before the first containment filter. I was unable to reproduce this scenario exactly, but some datadriven tests at least demonstrated the effectiveness (tests/licensedcode/test_detection_datadriven1.py::TestLicenseDataDriven1::test_detection_datadriven_lic1_alexa_skills_kit_sdk_for_java_txt)
tests/licensedcode/data/datadriven/lic1/alexa-skills-kit-sdk-for-java.txt.yml
Outdated
Show resolved
Hide resolved
A key phrase can be defined in a rule by surrounding one or more
words (tokens) with the `{{` and `}}` characters. Key phrases are
words that **must** be present in order to successfully match.
The key phrases are parsed for each rule during indexing. Then after
matching using the standard matching algorithm a filter removes all
matches that do not have the key phrases present.
---
At the moment sometimes the `ispan` alone can not be relied upon to determine
if key phrases are present.
At one case stopwords and unknown tokens made it appear as if the `ispan` was
more expansive than it should be. This has been solved by checking if the
`qspan` corresponding with the `ispan` intersect with any unknown tokens or
stopwords.
Another problem that is not solved is with the example boto3 example which
should match with `cc-by-nc-sa-4.0` but it now matching with several other
CC licenses. For some reason `Creative Commons Attribution 4.0 International
License` is considered to be matching in the ispan, and not intersecting with
any unknown tokens or stopwords (citation needed).
Signed-off-by: Mike Rombout <[email protected]>
License cc-by-sa-4.0 was detected as cc-by-4.0. Signed-off-by: Mike Rombout <[email protected]>
License mit was detected as gpl-2.0. Signed-off-by: Mike Rombout <[email protected]>
Non-license text is detected as bsd-simplified AND gpl-2.0. Signed-off-by: Mike Rombout <[email protected]>
Non-license text was detected as apache-2.0 AND cc-by-sa-4.0. Signed-off-by: Mike Rombout <[email protected]>
License mit was detected as ruby. Signed-off-by: Mike Rombout <[email protected]>
License mpl-2.0 was detected as apache-2.0. Signed-off-by: Mike Rombout <[email protected]>
License mit was detected as apache-2.0 OR epl-2.0. Signed-off-by: Mike Rombout <[email protected]>
License Bouncy Castyle License (mit) is detected as apache-2.0 OR gpl-2.0-plus WITH classpath-exception-2.0. Signed-off-by: Mike Rombout <[email protected]>
License erlangpl-1.1 was detected as mpl-1.1. Signed-off-by: Mike Rombout <[email protected]>
License mit was detected as cddl-1.0. Signed-off-by: Mike Rombout <[email protected]>
License apache-2.0 was detected as mit. Signed-off-by: Mike Rombout <[email protected]>
License apache-2.0 was detected as imagemagick. Signed-off-by: Mike Rombout <[email protected]>
Signed-off-by: Mike Rombout <[email protected]>
Co-authored-by: Philippe Ombredanne <[email protected]> Signed-off-by: Mike Rombout <[email protected]>
Signed-off-by: Mike Rombout <[email protected]>
Do not use defaultdict for Query.unknowns_by_pos and Query.stopwords_by_pos. Otherwise there are pernicious side effects to add new entries in these dctionaries when querying them after their creation. Reported-by: Mike Rombout @mrombout Signed-off-by: Philippe Ombredanne <[email protected]>
When a key phrase definition is started in a rule (with `{{`) but is
not closed (with `}}`) the rule is considered invalid and an error is
thrown.
Signed-off-by: Mike Rombout <[email protected]>
…_bootloader-exception Signed-off-by: Jiyeong Seok <[email protected]>
Signed-off-by: Mike Rombout <[email protected]>
Signed-off-by: Mike Rombout <[email protected]>
Signed-off-by: Mike Rombout <[email protected]>
Signed-off-by: Mike Rombout <[email protected]>
Signed-off-by: Mike Rombout <[email protected]>
These rules contained `{{` and `}}` characters, but
were never intended to be considered key phrases.
In order to avoid conflicts I've removed the syntax
from these rules.
Signed-off-by: Mike Rombout <[email protected]>
Signed-off-by: Mike Rombout <[email protected]>
This will prevent false-positives for the following scenario:
1. There are multiple matches, with a smalller match that id contained
2. It is filtered because of containment in the larger match
3. The larger is discarded because of not having keyphrases
4. There is no match left, we now have a false negative
As a practical example:
1. In query there is `licensed under the AGPL 3.0`
2. In rule1 I have `AGPL 3.0`
3. In rule2 I have `licensed under the {{AGPL 3.0 license}}`
3. The query will initially match both rules
4. The match to rule1 may be discarded as contained in the rule2 match
5. The match to rule2 may be discarded as missing a key phrase
By moving the key phrase filter up before the containment filter the
larger match may be filtered out, giving the smaller matches a chance
to stay.
Signed-off-by: Mike Rombout <[email protected]>
Signed-off-by: Mike Rombout <[email protected]>
pombredanne
left a comment
There was a problem hiding this comment.
LGTM and thank you ++ for this!
Fixes #2637
This PR depends on 97a1a57, contained in #2765. I have cherry-picked that commit into this PR as 2428723.
Tasks
Run tests locally to check for errors.