-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
Description
Hi!
I'm receiving the error below when attempting to pass a sparse matrix to HistGradientBoostingClassifier. The matrix is the result of using CountVectorizer and TfidfTransformer on input text.
In my case, the size of the text prohibits converting the sparse matrix to a dense one (I run out of memory).
Steps/Code to Reproduce
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import HistGradientBoostingClassifier
df = pd.read_csv(...)
vectorizer = CountVectorizer()
tfidf = TfidfTransformer()
clf = HistGradientBoostingClassifier()
vecs = vectorizer.fit_transform(df.loc[:, "very_large_text"])
vecs = tfidf.fit_transform(vecs)
clf.fit(vecs, df.loc[:, "label"])
Expected Results
No error is thrown.
Actual Results
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
Versions
System:
python: 3.7.3 (default, Oct 1 2019, 18:28:53) [GCC 5.4.0 20160609]
executable: /local_disk0/pythonVirtualEnvDirs/virtualEnv-3631eab5-084b-4139-952e-5aff594ac1bb/bin/python
machine: Linux-4.15.0-1050-azure-x86_64-with-debian-stretch-sid
Python deps:
pip: 19.0.3
setuptools: 40.8.0
sklearn: 0.21.3
numpy: 1.16.2
scipy: 1.2.1
Cython: 0.29.6
pandas: 0.24.2