Skip to content

Merge #40878: Add regexp tree dictionary#43858

Merged
hanfei1991 merged 34 commits intoClickHouse:masterfrom
hanfei1991:regexp-tree-dictionary
Jan 10, 2023
Merged

Merge #40878: Add regexp tree dictionary#43858
hanfei1991 merged 34 commits intoClickHouse:masterfrom
hanfei1991:regexp-tree-dictionary

Conversation

@hanfei1991
Copy link
Copy Markdown
Member

@hanfei1991 hanfei1991 commented Dec 1, 2022

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

to merge #40878 , supporting regexp dictionary.

The YAML source has a new format. For every yaml node, the format will be like:

  • key_name (defined by dictioanary structure) : regular expression
  • attribute_name (defined by dictionary structure) : value (might be a string containing back-reference)
  • ... (multiple attribute names)
  • name which is not key or attributes: another yaml node (for recursive match)

This format is to keep compatible with: https://github.com/ua-parser/uap-core and https://github.com/matomo-org/device-detector/tree/master/regexes

The match order is :

  1. son is higher than parents,
  2. smaller id is highter than larger id if they are in the same layer.

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-feature Pull request with new product feature label Dec 1, 2022
@hanfei1991 hanfei1991 changed the title Merge #40878 [WIP] Merge #40878 Dec 6, 2022
@hanfei1991 hanfei1991 changed the title [WIP] Merge #40878 Merge #40878 Dec 22, 2022
@hanfei1991
Copy link
Copy Markdown
Member Author

Next goal is:

  1. Support popular ua-parser lib his format is to keep compatible with: https://github.com/ua-parser/uap-core and https://github.com/matomo-org/device-detector/tree/master/regexes , the problem with them right now is:
    • the regular expressions in the uap-core include finite boundries like "{0,200}" which cause huge pressure on vectorscan. we should refine these re before we test this lib
    • the regular expressions in device-detector contain zero-length assertions which is not supported by vectorscan. I will see if it's possible to avoid this syntax.
  2. Support more types parsing in yaml file and handle exceptions for parse error.

The other parts are ready to review. PTAL @vdimir

@vdimir vdimir changed the title Merge #40878 Merge #40878: Add regexp tree dictionary Dec 28, 2022
Copy link
Copy Markdown
Member

@vdimir vdimir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay.
The code almost looks good, just some comments. Also, It's worth adding some docs (in this or followup PR).

for (const auto & item : pieces)
{
if (item.ref_num >= 0)
result += matches[item.ref_num].ToString();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already checked ref_num in createStringPieces, but let's also add check here to be sure that item.ref_num < 10 and we don't overflow matches array

const StorageID & id_, const DictionaryStructure & structure_, DictionarySourcePtr source_ptr_, Configuration configuration_)
: IDictionary(id_), structure(structure_), source_ptr(source_ptr_), configuration(configuration_), logger(&Poco::Logger::get("RegExpTreeDictionary"))
{
if (auto * ch_source = typeid_cast<ClickHouseDictionarySource *>(source_ptr.get()))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How sample_block will work if source is other than Clickhouse ? Is it safe to change ClickHouseDictionarySource::sample_block after ClickHouseDictionarySource was created?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if source is not ClickHouse, the schema will mismatch and raise a Exception.

This is only for test. We assume that the user only uses YAML source by default.

comment String default 'nothing'
)
PRIMARY KEY(regexp)
SOURCE(CLICKHOUSE(QUERY concat('select * from ', currentDatabase() , '.regexp_dictionary_source_table')))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it's not just SOURCE(CLICKHOUSE(TABLE 'regexp_dictionary_source_table'))?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the schema will mismatch if we refer a TABLE. Though if we refer a QUERY, the schema will be decided by sample_block. That's trikcy but I think it's ok for test.

if (auto * ch_source = typeid_cast<ClickHouseDictionarySource *>(source_ptr.get()))
{
Block sample_block;
/// id, parent_id, regex, keys, values
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly existing sample_block differ? In tests we have the same

CREATE TABLE regexp_dictionary_source_table
(
    id UInt64,
    parent_id UInt64,
    regexp String,
    keys   Array(String),
    values Array(String),

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but the schema of Dictionary is (regexp, attribute_1, attribute_2 ...) . ClickHouse assumes dictionary and ClickHouse source have same schema

* - regexp: "MSIE (\\d+)"
* attr1: "Vecna"
* attr2: 22.8
* match: # nested match for subpattern
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in tests it's named versions, so what is the correct way to define subpath in this tree?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, the subpath attrubute name can be arbitrary.

@hanfei1991
Copy link
Copy Markdown
Member Author

hanfei1991 commented Dec 30, 2022

jobs needed to be done in this pr:

  • write docs
  • catch exception for parsing error, for example, type int64 meets 'abc'
  • restrict source for regex dict (only allow yaml source, add a setting for clickhouse table source)
  • support all data types for regex dict
  • do not pollute original regexp cache

jobs after this pr:

  • support uap-core for ua parsing and add it to test
    • support default value containing back-reference
  • support low-cardinality columns
  • support cache
  • support re2 as fallback when vectorscan is disabled or cannot process the regexp tree.
  • benchmark

@hanfei1991
Copy link
Copy Markdown
Member Author

@vdimir PTAL. I tested more data types and added restrictions for source types. Now it's safer than before. I also added some simple docs with simple examples. Not thouroughly but enough for a good begin :)

Copy link
Copy Markdown
Member

@vdimir vdimir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hanfei1991 hanfei1991 merged commit 91226ab into ClickHouse:master Jan 10, 2023
@hanfei1991 hanfei1991 deleted the regexp-tree-dictionary branch January 10, 2023 15:41
Comment on lines +218 to +219
if (topology_order.size() != regex_nodes.size())
throw Exception(ErrorCodes::INCORRECT_DICTIONARY_DEFINITION, "Invalid Regex tree");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception message is not really informative for a user, could you please update it (or replace with LOGICAL ERROR if it's not for a user)? (what is invalid? how to fix it?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this belongs to LOGICAL_ERROR. fixing in #48632
Thank you!

Comment on lines +226 to +227
if (visited.contains(child_idx))
throw Exception(ErrorCodes::INCORRECT_DICTIONARY_DEFINITION, "Invalid Regex tree");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants