Provide duplicate meta keys as list instead of just the last seen value#74
Provide duplicate meta keys as list instead of just the last seen value#74metthal wants to merge 2 commits intoVirusTotal:masterfrom
Conversation
|
This is going to break some of our existing code at my employer. I can only guess we are not the only ones with this problem too. |
|
What would you suggest then? Make this behavior optional with some flag? We rely on this behavior. YARA language documentation does not specify whether meta keys need to be unique (at least I didn't find it) so we assumed that they don't have to be since YARA tool provides you with all the meta information, even duplicates. |
|
I didn’t realize the YARA tool provides duplicates. I’m probably still against this change, and just documenting it. This is really a python thing... |
|
Do your rulesets contain meta with duplicate keys? Because if you weren't even aware of it and weren't using duplicates, my change does not break anything in that case. You can notice that in the examples I provided. |
|
I have some code which assumes values are not a list. I can easily fix this, but others may be unaware of this change at first. |
|
Instead of introducing a backward-incompatible API change this could be implemented as a new function for the This way existing programs are not affected, but you have a way of getting all meta values with the same name. Thought? |
|
I like that idea. I'll implement it, I just need to get through all the stuff I am bogged down with right now. |
|
So I added This is code that uses import yara
r = yara.compile('/tmp/test.yar')
m = r.match(data='abc')
print(m[0].meta)
print(m[0].get_meta())
print(m[0].get_meta(allow_duplicates=True))And the output: |
62a5f26 to
55031f8
Compare
|
Bumping the PR since it somehow got completely forgotten. |
|
Bump. It would be great to have the bindings mirror the libyara API. |
| READONLY, | ||
| "Dictionary with offsets and strings that matched the file" | ||
| }, | ||
| { |
There was a problem hiding this comment.
There's no need for exporting this field to Python programs, right? If we already have the get_meta() function, we better prevent people from doing match._meta_with_dups.
| } | ||
| } | ||
| else | ||
| PyDict_SetItemString(meta_list_with_dups, meta->identifier, object); |
There was a problem hiding this comment.
I think that values in meta_list_with_dups should be always a list, even if the key has a single value. These way you don't need to take into account whether the value is a list or a single value.
In my previous description I said:
match.get_meta(allow_duplicates=True)could return a list in case of duplications.
But after a second thought I think I'm not sure about returning a list only in case of duplications. That shouldn't be a big issue given Python's dynamic nature, but it may be less error prone if allow_duplicates=True means that values are always a list even if it has a single item. Thoughts?
There was a problem hiding this comment.
Hey hey! I agree that this new function should always return a list, regardless of the number of items in the list. This makes the function's output type consistent so as user I don't need to add conditions to check the type of response.
I (as well as others) are interested in this PR, so I'd be happy to pick up where @metthal left off!
There was a problem hiding this comment.
Oh, sorry, I completely missed the fact that @plusvic reviewed my PR. I'm open to change the behavior in this PR.
There was a problem hiding this comment.
Out of curiosity, why would we need to maintain a flag to allow duplicate values? I suppose there's no harm to have it but just wondering.
If you just want a straight Dict[str, str] object, you could always just use the Match.meta attribute, which is legacy for a lot of potential users.
BUT if you want all the metadata regardless of duplication, a la Match.get_meta() (or Match.get_all_meta() to be concise 😉), then you can always expect to get Dict[str, List[str]] back even if there's one item per list.
Mind you I'm still reading through the changes to make sure I understand them properly😀
EDIT: I also agree that it might not be necessary to expose an attribute containing the full meta especially if it's given as function output which the user will call. See: #74 (comment)
|
Already implemented in #201 |
When I run regular
yaratool with-mflag to print meta information I can get all meta keys even if there are duplicates. Take this YARA file:Here is an example of how I can obtain all of them:
However, there is now way how to obtain all of them through Python API. I only get the value of the key which is mentioned as last in the file because ordinary
dictis used. My change alters this behavior and allows you to obtain all the values. Simply, if there are any duplicates, it starts putting the values in thelist.Before my change:
After my change:
Note: I am not that familiar with Python C API so I am not sure whether I put
Py_DECREFwherever it's necessary.