Merge #40878: Add regexp tree dictionary#43858
Merge #40878: Add regexp tree dictionary#43858hanfei1991 merged 34 commits intoClickHouse:masterfrom
Conversation
|
Next goal is:
The other parts are ready to review. PTAL @vdimir |
vdimir
left a comment
There was a problem hiding this comment.
Sorry for the delay.
The code almost looks good, just some comments. Also, It's worth adding some docs (in this or followup PR).
tests/queries/0_stateless/02504_regexp_dictionary_yaml_source.sh
Outdated
Show resolved
Hide resolved
| for (const auto & item : pieces) | ||
| { | ||
| if (item.ref_num >= 0) | ||
| result += matches[item.ref_num].ToString(); |
There was a problem hiding this comment.
We already checked ref_num in createStringPieces, but let's also add check here to be sure that item.ref_num < 10 and we don't overflow matches array
| const StorageID & id_, const DictionaryStructure & structure_, DictionarySourcePtr source_ptr_, Configuration configuration_) | ||
| : IDictionary(id_), structure(structure_), source_ptr(source_ptr_), configuration(configuration_), logger(&Poco::Logger::get("RegExpTreeDictionary")) | ||
| { | ||
| if (auto * ch_source = typeid_cast<ClickHouseDictionarySource *>(source_ptr.get())) |
There was a problem hiding this comment.
How sample_block will work if source is other than Clickhouse ? Is it safe to change ClickHouseDictionarySource::sample_block after ClickHouseDictionarySource was created?
There was a problem hiding this comment.
if source is not ClickHouse, the schema will mismatch and raise a Exception.
This is only for test. We assume that the user only uses YAML source by default.
| comment String default 'nothing' | ||
| ) | ||
| PRIMARY KEY(regexp) | ||
| SOURCE(CLICKHOUSE(QUERY concat('select * from ', currentDatabase() , '.regexp_dictionary_source_table'))) |
There was a problem hiding this comment.
Why it's not just SOURCE(CLICKHOUSE(TABLE 'regexp_dictionary_source_table'))?
There was a problem hiding this comment.
because the schema will mismatch if we refer a TABLE. Though if we refer a QUERY, the schema will be decided by sample_block. That's trikcy but I think it's ok for test.
| if (auto * ch_source = typeid_cast<ClickHouseDictionarySource *>(source_ptr.get())) | ||
| { | ||
| Block sample_block; | ||
| /// id, parent_id, regex, keys, values |
There was a problem hiding this comment.
How exactly existing sample_block differ? In tests we have the same
CREATE TABLE regexp_dictionary_source_table
(
id UInt64,
parent_id UInt64,
regexp String,
keys Array(String),
values Array(String),There was a problem hiding this comment.
Yeah but the schema of Dictionary is (regexp, attribute_1, attribute_2 ...) . ClickHouse assumes dictionary and ClickHouse source have same schema
| * - regexp: "MSIE (\\d+)" | ||
| * attr1: "Vecna" | ||
| * attr2: 22.8 | ||
| * match: # nested match for subpattern |
There was a problem hiding this comment.
in tests it's named versions, so what is the correct way to define subpath in this tree?
There was a problem hiding this comment.
indeed, the subpath attrubute name can be arbitrary.
Co-authored-by: Vladimir C <[email protected]>
Co-authored-by: Vladimir C <[email protected]>
jobs needed to be done in this pr:
jobs after this pr:
|
|
@vdimir PTAL. I tested more data types and added restrictions for source types. Now it's safer than before. I also added some simple docs with simple examples. Not thouroughly but enough for a good begin :) |
docs/en/sql-reference/dictionaries/external-dictionaries/regexp-tree.md
Outdated
Show resolved
Hide resolved
| if (topology_order.size() != regex_nodes.size()) | ||
| throw Exception(ErrorCodes::INCORRECT_DICTIONARY_DEFINITION, "Invalid Regex tree"); |
There was a problem hiding this comment.
This exception message is not really informative for a user, could you please update it (or replace with LOGICAL ERROR if it's not for a user)? (what is invalid? how to fix it?)
There was a problem hiding this comment.
actually this belongs to LOGICAL_ERROR. fixing in #48632
Thank you!
| if (visited.contains(child_idx)) | ||
| throw Exception(ErrorCodes::INCORRECT_DICTIONARY_DEFINITION, "Invalid Regex tree"); |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
to merge #40878 , supporting regexp dictionary.
The YAML source has a new format. For every yaml node, the format will be like:
This format is to keep compatible with: https://github.com/ua-parser/uap-core and https://github.com/matomo-org/device-detector/tree/master/regexes
The match order is :