Skip to content

Provide duplicate meta keys as list instead of just the last seen value#74

Closed
metthal wants to merge 2 commits intoVirusTotal:masterfrom
metthal:duplicate-meta-keys
Closed

Provide duplicate meta keys as list instead of just the last seen value#74
metthal wants to merge 2 commits intoVirusTotal:masterfrom
metthal:duplicate-meta-keys

Conversation

@metthal
Copy link
Contributor

@metthal metthal commented Jan 23, 2018

When I run regular yara tool with -m flag to print meta information I can get all meta keys even if there are duplicates. Take this YARA file:

rule dummy {
	meta:
		key = "a"
		key = "b"
		key = "c"
		another_key = "d"
	condition:
		true
}

Here is an example of how I can obtain all of them:

$ yara -m /tmp/test.yar /tmp/test.yar 
dummy [key="a",key="b",key="c",another_key="d"] /tmp/test.yar

However, there is now way how to obtain all of them through Python API. I only get the value of the key which is mentioned as last in the file because ordinary dict is used. My change alters this behavior and allows you to obtain all the values. Simply, if there are any duplicates, it starts putting the values in the list.

Before my change:

$ python
>>> import yara
>>> r = yara.compile('/tmp/test.yar')
>>> m = r.match(data='abc')
>>> print(m[0].meta)
{'key': 'c', 'another_key': 'd'}

After my change:

$ python
>>> import yara
>>> r = yara.compile('/tmp/test.yar')
>>> m = r.match(data='abc')
>>> print(m[0].meta)
{'key': ['a', 'b', 'c'], 'another_key': 'd'}

Note: I am not that familiar with Python C API so I am not sure whether I put Py_DECREF wherever it's necessary.

@wxsBSD
Copy link
Contributor

wxsBSD commented Jan 23, 2018

This is going to break some of our existing code at my employer. I can only guess we are not the only ones with this problem too.

@metthal
Copy link
Contributor Author

metthal commented Jan 23, 2018

What would you suggest then? Make this behavior optional with some flag? We rely on this behavior. YARA language documentation does not specify whether meta keys need to be unique (at least I didn't find it) so we assumed that they don't have to be since YARA tool provides you with all the meta information, even duplicates.

@wxsBSD
Copy link
Contributor

wxsBSD commented Jan 23, 2018

I didn’t realize the YARA tool provides duplicates. I’m probably still against this change, and just documenting it. This is really a python thing...

@metthal
Copy link
Contributor Author

metthal commented Jan 23, 2018

Do your rulesets contain meta with duplicate keys? Because if you weren't even aware of it and weren't using duplicates, my change does not break anything in that case. You can notice that in the examples I provided. another_key is still just a string.

@wxsBSD
Copy link
Contributor

wxsBSD commented Jan 23, 2018

I have some code which assumes values are not a list. I can easily fix this, but others may be unaware of this change at first.

@plusvic
Copy link
Member

plusvic commented Jan 26, 2018

Instead of introducing a backward-incompatible API change this could be implemented as a new function for the Match object. The function could be named get_meta and it could receive a boolean argument allow_duplicates, that is false by default. So, match.meta would be equivalent to match.get_meta(), but match.get_meta(allow_duplicates=True) could return a list in case of duplications.

This way existing programs are not affected, but you have a way of getting all meta values with the same name. Thought?

@metthal
Copy link
Contributor Author

metthal commented Jan 30, 2018

I like that idea. I'll implement it, I just need to get through all the stuff I am bogged down with right now.

@metthal
Copy link
Contributor Author

metthal commented Jan 31, 2018

So I added get_meta() as proposed by @plusvic. This is how the input file looks like:

rule a {
	meta:
		key = "a"
		key = "b"
		key = "c"
	condition:
		true
}

This is code that uses get_meta():

import yara

r = yara.compile('/tmp/test.yar')
m = r.match(data='abc')
print(m[0].meta)
print(m[0].get_meta())
print(m[0].get_meta(allow_duplicates=True))

And the output:

{'key': 'c'}
{'key': 'c'}
{'key': ['a', 'b', 'c']}

@metthal metthal force-pushed the duplicate-meta-keys branch from 62a5f26 to 55031f8 Compare November 15, 2018 14:10
@metthal
Copy link
Contributor Author

metthal commented Nov 15, 2018

Bumping the PR since it somehow got completely forgotten.

@Northern-Lights
Copy link

Bump. It would be great to have the bindings mirror the libyara API.

READONLY,
"Dictionary with offsets and strings that matched the file"
},
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need for exporting this field to Python programs, right? If we already have the get_meta() function, we better prevent people from doing match._meta_with_dups.

}
}
else
PyDict_SetItemString(meta_list_with_dups, meta->identifier, object);
Copy link
Member

@plusvic plusvic Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that values in meta_list_with_dups should be always a list, even if the key has a single value. These way you don't need to take into account whether the value is a list or a single value.

In my previous description I said:

match.get_meta(allow_duplicates=True) could return a list in case of duplications.

But after a second thought I think I'm not sure about returning a list only in case of duplications. That shouldn't be a big issue given Python's dynamic nature, but it may be less error prone if allow_duplicates=True means that values are always a list even if it has a single item. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey hey! I agree that this new function should always return a list, regardless of the number of items in the list. This makes the function's output type consistent so as user I don't need to add conditions to check the type of response.

I (as well as others) are interested in this PR, so I'd be happy to pick up where @metthal left off!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I completely missed the fact that @plusvic reviewed my PR. I'm open to change the behavior in this PR.

Copy link
Contributor

@cccs-rs cccs-rs Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why would we need to maintain a flag to allow duplicate values? I suppose there's no harm to have it but just wondering.

If you just want a straight Dict[str, str] object, you could always just use the Match.meta attribute, which is legacy for a lot of potential users.

BUT if you want all the metadata regardless of duplication, a la Match.get_meta() (or Match.get_all_meta() to be concise 😉), then you can always expect to get Dict[str, List[str]] back even if there's one item per list.

Mind you I'm still reading through the changes to make sure I understand them properly😀

EDIT: I also agree that it might not be necessary to expose an attribute containing the full meta especially if it's given as function output which the user will call. See: #74 (comment)

@plusvic
Copy link
Member

plusvic commented May 20, 2022

Already implemented in #201

@plusvic plusvic closed this May 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants