Skip to content

Autocomplete#69641

Open
iyubondyrev wants to merge 19 commits intoClickHouse:masterfrom
iyubondyrev:autocomplete_final
Open

Autocomplete#69641
iyubondyrev wants to merge 19 commits intoClickHouse:masterfrom
iyubondyrev:autocomplete_final

Conversation

@iyubondyrev
Copy link
Copy Markdown

@iyubondyrev iyubondyrev commented Sep 16, 2024

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Autcomplete for ClickHouse CLI based on a small transformer and several markov models

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

I will add full documentation in the future in this PR. For now:

It is an autocomplete system that predicts the next token of user input in ClickHouse CLI. On a high level, it works like this:

There is a transformer that predicts the type of the next token (Literal, Operator, Identifier or it can predict Keywords like SELECT, etc). Then on each of the types (Literal, Operator, Identifier), there is a dedicated markov model that predicts the value of the token itself. Also, there is a Markov Model for not preprocessed tokens, just for bare queries. If the latter markov model is highly sure about the next word (p>0.8) its prediction is placed on top of the predictions of the other machinery.
This approach allows us to predict words with markov model that we are very sure of and if we are not sure it allows us NOT to predict nonsense (because we are bound to the specific type of token).

Here is a short video (this is an old one, it works a bit better now).

autocomplete_speed_x5.mov

I will be updating this PR description and the code because it is still ongoing. There will be a very verbose description of the system a bit later as this is part of my course work and I need to do it anyway :).

Please have a look and if you have any suggestions/comments I will be happy to update the code/explain my choices.

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

  • Allow: All Required Checks
  • Allow: Stateless tests
  • Allow: Stateful tests
  • Allow: Integration Tests
  • Allow: Performance tests
  • Allow: All Builds
  • Allow: batch 1, 2 for multi-batch jobs
  • Allow: batch 3, 4, 5, 6 for multi-batch jobs

  • Exclude: Style check
  • Exclude: Fast test
  • Exclude: All with ASAN
  • Exclude: All with TSAN, MSAN, UBSAN, Coverage
  • Exclude: All with aarch64, release, debug

  • Run only fuzzers related jobs (libFuzzer fuzzers, AST fuzzers, etc.)
  • Exclude: AST fuzzers

  • Do not test
  • Woolen Wolfdog
  • Upload binaries for special builds
  • Disable merge-commit
  • Disable CI cache

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Sep 16, 2024

CLA assistant check
All committers have signed the CLA.

@iyubondyrev iyubondyrev changed the title Autocomplete final Autocomplete Sep 16, 2024
@yariks5s yariks5s added the can be tested Allows running workflows for external contributors label Sep 16, 2024
@robot-clickhouse robot-clickhouse added pr-feature Pull request with new product feature submodule changed At least one submodule changed in this PR. labels Sep 16, 2024
@robot-clickhouse
Copy link
Copy Markdown
Member

robot-clickhouse commented Sep 16, 2024

This is an automated comment for commit 3e44d9f with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here❌ failure
Successful checks
Check nameDescriptionStatus
Docs checkBuilds and tests the documentation✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success

@rschu1ze rschu1ze self-assigned this Sep 17, 2024
@rschu1ze
Copy link
Copy Markdown
Member

Note to myself: thesis.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Oct 29, 2024

Dear @rschu1ze, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

KneserNey markov_identifiers = KneserNey(markov_order);
KneserNey markov_operators = KneserNey(markov_order);

GPTJModel transformer_model = GPTJModel("ggml-model-f32.bin");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not fp16?

Make the file embedded into the binary.


size_t query_history_limit = 700;

const String history_query = fmt::format(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query shouldn't run if it is clickhouse-local.

@alexey-milovidov alexey-milovidov mentioned this pull request Dec 31, 2024
76 tasks
@nikitamikhaylov nikitamikhaylov self-assigned this Jan 12, 2025
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added pr-improvement Pull request with some product improvements and removed pr-feature Pull request with new product feature labels Jan 13, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Feb 18, 2025

Dear @nikitamikhaylov, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-improvement Pull request with some product improvements submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants