Skip to content

*wip* Add experimental AI functions aiClassify/Extract/GenerateSQL/Translate/GenerateContent/GenerateEmbedding#100735

Closed
rschu1ze wants to merge 30 commits intomasterfrom
feature/llm-functions
Closed

*wip* Add experimental AI functions aiClassify/Extract/GenerateSQL/Translate/GenerateContent/GenerateEmbedding#100735
rschu1ze wants to merge 30 commits intomasterfrom
feature/llm-functions

Conversation

@rschu1ze
Copy link
Copy Markdown
Member

@rschu1ze rschu1ze commented Mar 25, 2026

This is a re-push of #99579
Courtesy to @melvynator

Introduces experimental AI functions that call external AI/embedding providers (OpenAI, Anthropic) directly from SQL queries without leaving ClickHouse.

Uses Named Collections for storing provider credentials and query-level settings for operational parameters (like temperature and error handling).

All functions are gated behind the allow_experimental_ai_functions setting.

Changelog category (leave one):

  • Experimental Feature

Changelog entry (a user-readable short description):

Introduces experimental AI functions aiClassify, aiExtract, aiGenerateSQL, aiTranslate, aiGenerateContent, aiGenerateEmbedding, aiGenerateEmbeddingOrNull that call an external AI and embedding providers (OpenAI, Anthropic) directly from SQL queries without leaving ClickHouse.

Ubuntu and others added 19 commits March 25, 2026 16:20
…SQL, LLMTranslate

Introduces four new scalar functions that call external LLM providers
(OpenAI, Anthropic) directly from SQL queries, enabling AI-powered
text classification, structured extraction, natural language to SQL
generation, and translation without leaving ClickHouse.

Key components:
- Provider abstraction (ILLMProvider) with OpenAI and Anthropic implementations
- Shared execution engine (LLMFunctionBase) with concurrent dispatch,
  request deduplication, SipHash128-based result caching, token-bucket
  rate limiting, exponential backoff retries, and per-query quota enforcement
- 13 new llm_* query-level settings for fine-grained control
- 9 LLM-specific ProfileEvents for observability
- Input sanitization to handle control characters in raw text
- LLMGenerateSQL resolves live database schema via DatabaseCatalog
  and strips markdown fences from output for direct copy-paste execution

Configuration uses Named Collections for provider credentials and
query-level settings for operational parameters, following the
precedent set by the S3 integration.
…sting LLM code

New functions:
- LLMGenerateContent: general-purpose text generation via LLMFunctionBase
- LLMGenerateEmbedding: standalone IFunction returning Array(Float32), with
  OpenAI embedding endpoint auto-derivation and binary cache serialization

Embedding provider support:
- Add LLMEmbeddingRequest/Response structs and ILLMProvider::embed() virtual
- Implement OpenAIProvider::embed() with deriveEmbeddingURI()
- Default embed() throws NOT_IMPLEMENTED for unsupported providers

Bug fixes and hardening (from code review):
- Fix cache key hash collisions by length-prefixing SipHash fields
- Normalize Anthropic stop_reason to OpenAI finish_reason convention
  (max_tokens -> length, end_turn -> stop) so truncated responses skip cache
- Handle NULL input rows in LLMFunctionBase and LLMGenerateEmbedding
- Fix JSON injection in LLMClassify categories via Poco::JSON::Array
- Use proper Poco::JSON::Parser in LLMClassify::postProcessResponse
- Add null-safety checks in OpenAI/Anthropic response parsing (optValue)
- Fix memory leak: raw Poco::JSON::Array* -> Poco::JSON::Array::Ptr
- Align quota modes with PRD: on_quota_exceeded accepts "break"/"null",
  on_error accepts "null" (was checking non-existent "partial")

Tests:
- Add 03300_llm_functions stateless test covering argument validation,
  return type verification, and non-constant dimension rejection
…che reset

- LLMGenerateEmbedding now batches multiple texts per HTTP request
  (configurable via max_batch_size in named collection, default 100)
- Empty string inputs return empty array without calling the API
- Null inputs return empty array gracefully
- First argument can be an inline URL (http/https) bypassing named collections
- HuggingFace TEI and huggingface recognized as provider aliases (OpenAI-compatible)
- LLMResultCache gains reset() method for cache invalidation
- OpenAI embed() parses response index field for correct ordering
- Authorization header omitted when api_key is empty (for local models)
Embedding models are not LLMs, so the LLM prefix is misleading.
The new name is more accurate and consistent with ClickHouse naming.
…perature handling, and Anthropic compatibility

- Add generateEmbeddingOrNull function (returns empty array on errors instead of throwing)
- Fix hasNamedCollectionArg to detect collection name as first argument via NamedCollectionFactory lookup
- Fix resolveTemperature to extract temperature from last Float argument instead of always using default
- Fix LLMGenerateContent to distinguish system_prompt (String) from temperature (Float) arguments
- Rename LLM* ProfileEvents to AI* (AIInputTokens, AIAPICalls, etc.)
- Add embedding_max_batch_size query-level setting
- Improve Settings.cpp descriptions for all AI function settings
- Add AI functions documentation page (ai-functions.md)
- Update stateless tests for generateEmbeddingOrNull
AI functions now require SET allow_experimental_ai_functions = 1 before use,
consistent with other experimental features (NLP, hash, funnel functions).
…xtPtr

- Move AI settings from 26.3 to 26.4 in SettingsChangesHistory.cpp
- Use WithContext instead of holding ContextPtr copy in LLMFunctionBase and generateEmbedding
- Remove unused ErrorCodes, ProfileEvents, CurrentMetrics declarations across LLM files
- Fix clang-tidy: const ref copies, range-based loop, empty catch comment, int64_t, static
- Add AI function names to aspell dictionary
- Add allow_experimental_ai_functions to enableAllExperimentalSettings.cpp
- Fix "informations" typo in ai-functions.md
…es/database params

- LLMGenerateSQL: add explicit arity check when first arg is a named
  collection so LLMGenerateSQL('existing_collection') fails with a
  deterministic error instead of OOB access
- LLMGenerateSQL: remove unused tables/database parameters from
  signature (schema is auto-discovered from catalog), reducing max args
  from 5 to 3
- LLMGenerateContent: add same named-collection arity guard
- Update docs and REGISTER_FUNCTION metadata to match
The various_checks.sh style check forbids using WithContext in
src/Functions/. Use ContextWeakPtr + getContext() helper directly,
which achieves the same safe weak-reference semantics without
triggering the check.
…g check, quota values

- Guard resp.embeddings.back() UB when provider returns empty response
- Sanitize provider error messages to avoid leaking sensitive data
  (extract JSON error message or truncate to 256 chars)
- Throw BAD_ARGUMENTS when first arg is non-constant in 3-arg
  generateEmbedding form instead of silently falling back to default
- Align llm_on_quota_exceeded: remove undocumented break/partial
  values, only accept throw/null as documented in Settings.cpp
Fixes test 02415_all_new_functions_must_be_documented which requires
all functions to have description, syntax, and introduced_in metadata.
- Add no-replicated-database tag to 03300_llm_functions test to prevent
  timeout in DatabaseReplicated mode
- Remove duplicate LLM settings entries from 26.3 block (already in 26.4)
LLM functions require external HTTP access and named collection
configuration that is not available in the unit test environment.
@rschu1ze rschu1ze changed the title Feature/llm functions Add experimental AI functions llmClassify/Extract/GenerateSQL/Translate/GenerateContent and generateEmbedding[OrNull] Mar 25, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 25, 2026

Workflow [PR], commit [4a0b1ef]

Summary:

job_name test_name status info comment
Style check failure
aspell failure cidb
Docs check dropped
Build (amd_debug) dropped
Build (amd_asan_ubsan) dropped
Build (amd_tsan) dropped
Build (amd_msan) dropped
Build (amd_binary) dropped
Build (arm_asan_ubsan) dropped
Build (arm_binary) dropped
Build (amd_release) dropped

AI Review

Summary

This PR adds experimental AI SQL functions and provider integration with caching/quota/rate-limit controls plus docs/tests. I found one additional actionable issue not already covered by existing bot comments: embedding function metadata still advertises old non-ai function names. Verdict: request a small follow-up fix before merge to keep function metadata consistent for users.

Findings

⚠️ Majors

  • [src/Functions/aiGenerateEmbedding.cpp:481,499] FunctionDocumentation .syntax and examples still use generateEmbedding / generateEmbeddingOrNull, while the registered SQL names are aiGenerateEmbedding / aiGenerateEmbeddingOrNull.
    • Impact: user-facing metadata and docs/introspection can mislead users to call non-existent names.
    • Suggested fix: update those documentation strings/examples to use aiGenerateEmbedding(...) and aiGenerateEmbeddingOrNull(...).
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
User-Lens
  • Function metadata naming mismatch (generateEmbedding* vs aiGenerateEmbedding*) is confusing for users relying on docs/introspection output.
Final Verdict
  • Status: ⚠️ Request changes
  • Minimum required action: align embedding FunctionDocumentation syntax/examples with actual ai* SQL function names.

, retry_delay_ms(context->getSettingsRef()[Setting::llm_retry_initial_delay_ms])
, cache_ttl(context->getSettingsRef()[Setting::llm_cache_ttl_sec])
, default_llm_resource(context->getSettingsRef()[Setting::default_llm_resource])
, llm_max_rows_per_query(context->getSettingsRef()[Setting::llm_cache_ttl_sec])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llm_max_rows_per_query is initialized from llm_cache_ttl_sec here, so row quota tracks TTL seconds instead of the configured row limit.

Impact: quota enforcement becomes incorrect (default 86400 instead of 100000, and user-set llm_max_rows_per_query is ignored).

Please initialize it from Setting::llm_max_rows_per_query.

, llm_max_api_calls_per_query(context->getSettingsRef()[Setting::llm_max_api_calls_per_query])
, llm_on_error(context->getSettingsRef()[Setting::llm_on_error])
, llm_on_quota_exceeded(context->getSettingsRef()[Setting::llm_on_quota_exceeded])
, max_batch_size(context->getSettingsRef()[Setting::generateembedding_max_batch_size] ? 0 : DEFAULT_EMBED_BATCH_SIZE)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_batch_size is inverted in this ternary. When generateembedding_max_batch_size is non-zero, this sets max_batch_size to 0.

That makes for (size_t i = 0; i < to_dispatch.size(); i += max_batch_size) non-progressing (i += 0) and the query can hang indefinitely.

Please use the configured setting value when it is non-zero, otherwise DEFAULT_EMBED_BATCH_SIZE.

std::lock_guard lock(results_mutex);
results[key] = cached->result;
++cache_hits;
quota->rows_processed.fetch_add(rows.size(), std::memory_order_relaxed);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Cache-hit rows bypass quota validation: this path increments rows_processed directly, but only dispatch path calls checkBeforeDispatch.

Impact: a query can exceed llm_max_rows_per_query if many rows are served from cache.

Please apply the same quota check for cached rows before accepting the cache hit (or centralize quota checks so both cached and dispatched paths share the same gate).


DataTypePtr getReturnTypeImpl(const ColumnsWithTypeAndName & arguments) const override
{
if (arguments.size() < 2 || arguments.size() > 4)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ With a named collection, this arity check is too permissive and allows a 2-argument call ([collection,] text, categories...) to pass analysis. In buildSystemPrompt we then access arguments[idx + 1], which becomes out-of-bounds when only collection,text are provided.

Please require at least 3 arguments when hasNamedCollectionArg(arguments) is true (or add an explicit guard before dereferencing idx + 1).


DataTypePtr getReturnTypeImpl(const ColumnsWithTypeAndName & arguments) const override
{
if (arguments.size() < 2 || arguments.size() > 4)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Same issue here: when the first argument is a named collection, this check still allows only 2 arguments total. Later buildSystemPrompt / buildResponseFormatJSON read arguments[idx + 1], so [collection, text] can reach an out-of-bounds access.

Please enforce >= 3 arguments when hasNamedCollectionArg(arguments) is true.


DataTypePtr getReturnTypeImpl(const ColumnsWithTypeAndName & arguments) const override
{
if (arguments.size() < 2 || arguments.size() > 5)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ For llmTranslate, a named-collection call requires at least collection, text, target_language. This check accepts 2 arguments, but buildSystemPrompt unconditionally reads arguments[idx + 1] (target_language), which can go out of bounds for [collection, text].

Please add the same named-collection arity guard used in llmGenerateSQL/llmGenerateContent.

SELECT '-- llmExtract: missing named collection';
SELECT llmExtract('text', 'extract the name'); -- { serverError BAD_ARGUMENTS }

-- llmExtract expects 2-5 args
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ This section appears to duplicate llmExtract tests (llmExtract('hello world', 'French')) while llmTranslate argument validation is not tested at all.

Given the arity-sensitive logic around optional named collections, please add explicit negative tests for llmTranslate (too few / too many / missing named collection) and replace this duplicated block.

for (const auto * item : batch_items)
req.inputs.push_back(item->text);

auto resp = provider->embed(req, timeouts);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ ai_max_rps is ignored by embedding functions.

FunctionBaseAI throttles chat/completion calls, but this embedding path sends requests directly without a Throttler. This means ai_max_rps does not cap request rate for generateEmbedding/generateEmbeddingOrNull, so embedding-heavy queries can burst above the configured global AI request rate.

Please apply the same throttling mechanism here before each provider request.

@rschu1ze rschu1ze changed the title Add experimental AI functions llmClassify/Extract/GenerateSQL/Translate/GenerateContent and generateEmbedding[OrNull] Add experimental AI functions aiClassify/Extract/GenerateSQL/Translate/GenerateContent/GenerateEmbedding Mar 25, 2026
"First argument of {} must be a constant string (named collection or URL) when 3 arguments are provided", name);
String first_arg(first_const->getDataAt(0));

if (!first_arg.empty() && looksLikeURL(first_arg))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generateEmbedding accepts an inline URL and assigns it directly to endpoint_val without any allowlist or access-control check. This lets any user who can run this experimental function make arbitrary outbound HTTP requests from the server process (SSRF), including to internal metadata services or private network endpoints.

Please enforce the same URL-safety policy used by other SQL-accessible HTTP paths (for example host allowlists / restricted schemes), or require named collections only.

{
auto it = results.find(key);
const std::vector<Float32> * vec = nullptr;
if (it != results.end() && !it->second.empty())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For generateEmbedding (non-OrNull), malformed provider responses can silently produce wrong results instead of throwing.

If the provider omits embeddings for some inputs (or returns empty vectors), those rows are treated as skipped and emitted as empty arrays. This violates the documented "throws on API errors" behavior and can silently corrupt downstream similarity calculations.

Please validate that each requested input has a non-empty embedding (and ideally expected dimensionality), and throw RECEIVED_ERROR_FROM_REMOTE_IO_SERVER when this contract is violated (while keeping graceful behavior for generateEmbeddingOrNull).

if (!first_arg.empty() && looksLikeURL(first_arg))
{
provider_name = "openai";
endpoint_val = first_arg;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generateEmbedding allows the 3-argument URL form ([collection_or_url,] text, dimensions), but this branch sets provider_name and endpoint_val only; model stays empty.

Later the request always sends ai_embedding_request.model = model, so URL-based calls submit an empty model and fail at runtime on OpenAI-compatible APIs.

Please either reject URL mode unless a model can be resolved, or resolve model from settings/named collection for this branch before dispatch.

/* ####################################################### */ \
/* AI function settings (LLMClassify, LLMExtract, LLMGenerateSQL, LLMTranslate, LLMGenerateContent, generateEmbedding, generateEmbeddingOrNull) */ \
DECLARE(Bool, allow_experimental_ai_functions, false, R"(
Enable experimental AI functions (LLMClassify, LLMExtract, LLMGenerateSQL, LLMTranslate, LLMGenerateContent, generateEmbedding, generateEmbeddingOrNull). These functions make external HTTP calls to LLM and embedding providers.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setting documentation strings still reference legacy function names (LLMClassify, LLMExtract, LLMGenerateSQL, LLMTranslate, LLMGenerateContent) that are not the exposed SQL names in this PR (aiClassify, aiExtract, aiGenerateSQL, aiTranslate, aiGenerateContent).

Please update these descriptions to current function names to avoid confusing users.


void AIQuotaTracker::recordResponse(UInt64 in_tokens, UInt64 out_tokens)
{
input_tokens.fetch_add(in_tokens, std::memory_order_relaxed);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ai_max_input_tokens_per_query is not actually enforced for chat-style AI functions.

checkBeforeDispatch receives estimated_input_tokens = 0 from callers, and recordResponse only increments input_tokens without checking max_input_tokens.

Impact: queries can exceed the configured input-token budget silently, despite the setting/docs saying the quota is enforced per query.

Please enforce the limit when recording actual usage (e.g. in recordResponse) or pass a non-zero estimate to checkBeforeDispatch and reconcile with actual usage.

DROP TABLE IF EXISTS _03300_ret_embedding_or_null;

-- =============================================================================
-- 9. aiGenerateEmbedding: dimensions argument must be constant
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. Dimensions number can't be a function argument.

-- aiGenerateEmbeddingOrNull returns Nullable(Array(Float32))
DROP TABLE IF EXISTS _03300_ret_embedding_or_null;
CREATE TABLE _03300_ret_embedding_or_null ENGINE = Memory AS
SELECT aiGenerateEmbeddingOrNull(x, 256) AS result FROM (SELECT 'hello' AS x WHERE 0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does 'OrNull' mean?

{
String schema = resolveSchema();
return "You are a ClickHouse SQL expert. Generate a valid ClickHouse SQL query.\n"
"Rules:\n"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird to have the prompt hard-coded.
Please make it orthogonal or remove the function altogether.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/doris/pull/51725/changes#diff-0b36d9b633dcf67a515eded9e04db7c6d1aa70d4803ae80275ff4a792e65a7e6R27-R33

Not sure weird is the appropriate terminology. DuckDB extension does the same thing too.

But making the prompt an optional parameter (defaulting to this one) is ok.


if (result.find("```sql") != String::npos)
strip_prefix(result, "```sql");
else if (result.find("```SQL") != String::npos)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Markdown injection.

"First argument of {} must be a constant string (named collection or URL) when 3 arguments are provided", name);
String first_arg(first_const->getDataAt(0));

if (!first_arg.empty() && looksLikeURL(first_arg))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And now we have an SSRF vulnerability.

std::unordered_map<UInt128, std::vector<Float32>, UInt128Hash> results;
std::mutex results_mutex;

std::atomic<UInt64> total_api_calls{0};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason to make it atomic inside the loop.

throw;
}
}
catch (...)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't allow this.

});
}

pool->wait();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Functions shouldn't use thread pools.
Either it should be a separate component to encapsulate the parallelism and thread pools.

Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it at least ten times simpler.

@alexey-milovidov
Copy link
Copy Markdown
Member

This missing integration tests. Test at least with Gemma 3 270M.

.description = "Generates embedding vectors for the given text using an embedding model. "
"Supports batch API calls for efficient processing of multiple rows. "
"Throws on API errors. Returns an empty array for NULL or empty inputs.",
.syntax = "generateEmbedding([collection_or_url,] text, dimensions)",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FunctionAiGenerateEmbedding is registered under SQL names aiGenerateEmbedding and aiGenerateEmbeddingOrNull, but FunctionDocumentation still uses generateEmbedding / generateEmbeddingOrNull in .syntax and examples.

This creates inconsistent user-facing metadata (for example in function introspection/docs) versus actual callable names.

Please update these documentation strings to the real SQL names used in this PR.

@rschu1ze rschu1ze changed the title Add experimental AI functions aiClassify/Extract/GenerateSQL/Translate/GenerateContent/GenerateEmbedding *wip* Add experimental AI functions aiClassify/Extract/GenerateSQL/Translate/GenerateContent/GenerateEmbedding Mar 25, 2026
@rschu1ze
Copy link
Copy Markdown
Member Author

You are right, this is just a working draft based on a (likely) vibecoded PR. Yes, the PR will be simplified and better tested.

@george-larionov
Copy link
Copy Markdown
Member

george-larionov commented Mar 26, 2026

I think we should split this PR up into a few parts. Below is how Claude proposed to split them up, I think it is a good start and I've linked the draft PRs below as well:

  1. "infrastructure" which will contain the base function type and all the plumbing, along with a simple function to show it works
  2. Add the rest of the ai text functions
  3. Add the generateEmbedding funcs
    (still working on them, I'll un-draft them when they are ready for a review)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-experimental Experimental Feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants