Skip to content

regexp_tree dictionary type #21694

@alexey-milovidov

Description

@alexey-milovidov

Use case

User-Agent parsing: #157.
But the implementation should be flexible enough to solve more string parsing/extraction tasks.

Describe the solution you'd like

The dictionary data is represented by a tree of regular expressions.
(actually not a tree but multiple trees - there can be multiple root nodes)

Every node of the tree can have:

  1. Regular expression. If it is matched, then children (if any) will be processed and setting the result attribute (if any, see 2) will be done. Regexp can have subpatterns that can be referenced later.
  2. Setting an attribute(s) to some value. Can contain references to previously matched subpatterns. Attribute is processed as a string and then casted to its type (in dictionary spec).
  3. Arbitrary number of children nodes of the same type.

When attributes are requested from the dictionary, the tree will be traversed until every attribute is set. If there are multiple matching regexps that have branches setting the same attribute, the first (last?) wins. If the tree has been traversed but attribute was not set, the default value for attribute is assigned.

The data can be loaded from table (using any available dictionary source) with the following structure:

id UInt64,
parent_id UInt64,
regexp String,
attr_1 String,
...
attr_n String

id and parent_id only exists to make a tree structure.
attr_n is the column for the corresponding attribute - its value is how to set the attribute if regexp matched, examples: Internet Explorer (just a string) or Windows \1 (string with referenced subpattern) or \2 (just a referenced subpattern, for example, if subpattern contains digits - it can be used to set numeric attribute).

Alternatively, the tree can be specified in YAML file (config) in local filesystem - we should create YAMLDictionarySource for this purpose. In this case, id and parent_id are not specified (implicit).

The tree can have multiple children (alternative branches) with different regular expressions. We should combine these regular expressions to match in a single pass and select what branches are applicable. This is possible with Hyperscan. Maybe we should match all the regexps from all tree nodes at once and then traverse the tree just checking the bool flags.

Details

Maybe we can also allow to set a temporary (scratch) variables to reference it later in a tree.

The dictionary should support LowCardinality(String) types.

Metadata

Metadata

Assignees

Labels

comp-dictionaryDictionaries (in-memory key-value, periodically refreshed from sources).feature

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions