Add ngramClassify and ngramClassifyUTF8 functions#48984
Add ngramClassify and ngramClassifyUTF8 functions#489840442A403 wants to merge 15 commits intoClickHouse:masterfrom
Conversation
|
It does not build. |
|
Either I didn't understand the code, or it was not the algorithm we wanted to implement. |
|
Missing the analysis of the algorithm (for charset or language classification). |
c95824a to
1105d61
Compare
|
This is an automated comment for commit 384e073 with description of existing statuses. It's updated for the latest CI running
|
|
I've done a model quality analysis. Here's results: Historically naive Bayes classificator began its development in 90s with email spam filtering. So I decided to measure quality on spam filtering. import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
data = pd.read_csv('spam.csv').to_numpy()
data[:,0] = data[:,0] == 'ham'
X_train, X_test, y_train, y_test = train_test_split(data[:,1:], data[:,0], test_size=0.25, random_state=128)
y_train, y_test = y_train.astype(bool), y_test.astype(bool)
with open('spam_model_emails.txt', 'w') as file:
for email in X_train[~y_train]:
file.write(email[0])
with open('ham_model_emails.txt', 'w') as file:
for email in X_train[y_train]:
file.write(email[0])
with open('spam_pred_emails.csv', 'w') as file:
file.write('email,\n')
for email in X_test[~y_test]:
file.write('"' + email[0].replace('"', '') + '",\n')
with open('ham_pred_emails.csv', 'w') as file:
file.write('email,\n')
for email in X_test[y_test]:
file.write('"' + email[0].replace('"', '') + '",\n')Then the following commands imports data: clickhouse client -q "INSERT INTO ensure_ham FORMAT CSV" < ~/ham_pred_emails.csv
clickhouse client -q "INSERT INTO ensure_spam FORMAT CSV" < ~/spam_pred_emails.csvAnd the result: > (
WITH
(SELECT count(*) FROM ensure_spam WHERE ngramClassify('spam', email) = 'spam') AS tp,
(SELECT count(*) FROM ensure_ham WHERE ngramClassify('spam', email) = 'spam') AS fp,
(SELECT count(*) FROM ensure_spam) as spam_count,
(SELECT count(*) FROM ensure_ham) as ham_count
SELECT 'ans=spam' AS "ans", tp / spam_count AS "y_i=spam", fp / ham_count AS "y_i=ham"
)
UNION ALL
(
WITH
(SELECT count(*) FROM ensure_spam WHERE ngramClassify('spam', email) = 'ham') AS fn,
(SELECT count(*) FROM ensure_ham WHERE ngramClassify('spam', email) = 'ham') AS tn,
(SELECT count(*) FROM ensure_spam) as spam_count,
(SELECT count(*) FROM ensure_ham) as ham_count
SELECT 'ans=ham' AS "ans", fn / spam_count AS "y_i=spam", tn / ham_count AS "y_i=ham"
);
┌─ans──────┬───────────y_i=spam─┬─y_i=ham─┐
│ ans=spam │ 0.7058823529411765 │ 0 │
└──────────┴────────────────────┴─────────┘
┌─ans─────┬────────────y_i=spam─┬─y_i=ham─┐
│ ans=ham │ 0.29411764705882354 │ 1 │
└─────────┴─────────────────────┴─────────┘I think the result and especially for ham emails is very good. If invest more time in finding good model result could be much better! |
| @@ -0,0 +1 @@ | |||
| offensive angry war violance evil adversity bad break bummer calamity cataclysm catastrophe disaster downer drag evil hard knocks hard times hardship jam misery misfortune mishap storm cloud tragedy tribulation trouble unpleasantnes No newline at end of file | |||
There was a problem hiding this comment.
This is not as discussed. The model file should contain (class_id, ngram, count) triples.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add
ngramClassifyandngramClassifyUTF8functionsDocumentation entry for user-facing changes
Task is described at #42194 as "Text classification with ngram models".
This PR adds new function
ngramClassifyandngramClassifyUTF8which implements naive Bayes classifier. These functions are a easy way to classify text very quickly (sacrificing quality of course).For example you need to find offensive comments about football and hockey then you can query:
So it is based on models. In example above there are 2 models:
offensiveandsport. These models are based on texts that must represent all its class.Models are configurated in config. So for query above there could be the following config:
Template for configuration is located on
programs/server/config.d/ngram_classifiers.xml.In the sequal of this PR functionality can expand up to next features:
ngramClassify('clickhouse.offensive', text)