-
Notifications
You must be signed in to change notification settings - Fork 8.3k
regexp_tree dictionary type #21694
Description
Use case
User-Agent parsing: #157.
But the implementation should be flexible enough to solve more string parsing/extraction tasks.
Describe the solution you'd like
The dictionary data is represented by a tree of regular expressions.
(actually not a tree but multiple trees - there can be multiple root nodes)
Every node of the tree can have:
- Regular expression. If it is matched, then children (if any) will be processed and setting the result attribute (if any, see 2) will be done. Regexp can have subpatterns that can be referenced later.
- Setting an attribute(s) to some value. Can contain references to previously matched subpatterns. Attribute is processed as a string and then casted to its type (in dictionary spec).
- Arbitrary number of children nodes of the same type.
When attributes are requested from the dictionary, the tree will be traversed until every attribute is set. If there are multiple matching regexps that have branches setting the same attribute, the first (last?) wins. If the tree has been traversed but attribute was not set, the default value for attribute is assigned.
The data can be loaded from table (using any available dictionary source) with the following structure:
id UInt64,
parent_id UInt64,
regexp String,
attr_1 String,
...
attr_n String
id and parent_id only exists to make a tree structure.
attr_n is the column for the corresponding attribute - its value is how to set the attribute if regexp matched, examples: Internet Explorer (just a string) or Windows \1 (string with referenced subpattern) or \2 (just a referenced subpattern, for example, if subpattern contains digits - it can be used to set numeric attribute).
Alternatively, the tree can be specified in YAML file (config) in local filesystem - we should create YAMLDictionarySource for this purpose. In this case, id and parent_id are not specified (implicit).
The tree can have multiple children (alternative branches) with different regular expressions. We should combine these regular expressions to match in a single pass and select what branches are applicable. This is possible with Hyperscan. Maybe we should match all the regexps from all tree nodes at once and then traverse the tree just checking the bool flags.
Details
Maybe we can also allow to set a temporary (scratch) variables to reference it later in a tree.
The dictionary should support LowCardinality(String) types.