-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Closed
Description
Description
I am wondering if clustering with kmeans for 250000 samples into 6000 cluster is a too hard problem to compute because it kills even server with 12 cores, 258GB RAM and 60GB swap.
Similar "questions":
- python memory error for kmeans in scikit-learn
- Memory Error when fitting the data using sklearn package
Code to Reproduce
The use-case us following:
import numpy as np
from sklearn import cluster
locations = np.random.random((250000, 2)) * 5
kmean = cluster.KMeans(n_clusters=6000, n_init=10, max_iter=150,
verbose=True, n_jobs=20, copy_x=False,
precompute_distances=False)
kmean.fit(locations)
print (kmean.cluster_centers_)
Actual Results
Iteration 35, inertia 156.384475435
center shift 7.768886e-03 within tolerance 2.084699e-04
Traceback (most recent call last):
File "test_kmeans.py", line 8, in <module>
kmean.fit(locations)
File "/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 889, in fit
return_n_iter=True)
File "/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py", line 362, in k_means
for seed in seeds)
File "/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 768, in __call__
self.retrieve()
File "/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 719, in retrieve
raise exception
sklearn.externals.joblib.my_exceptions.JoblibMemoryError: JoblibMemoryError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
/mnt/datagrid/personal/borovec/Dropbox/Workspace/Uplus_fraud-monitoring/test_kmeans.py in <module>()
3
4 locations = np.random.random((250000, 2)) * 5
5 kmean = cluster.KMeans(n_clusters=6000, n_init=10, max_iter=150,
6 verbose=True, n_jobs=20, copy_x=False,
7 precompute_distances=False)
----> 8 kmean.fit(locations)
9 print (kmean.cluster_centers_)
10
11
12
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py in fit(self=KMeans(algorithm='auto', copy_x=False, init='k-m...
random_state=None, tol=0.0001, verbose=True), X=array([[-1.86344999, 0.05621132],
[ 0.88...-1.20243728],
[ 0.97877704, 1.24561138]]), y=None)
884 X, n_clusters=self.n_clusters, init=self.init,
885 n_init=self.n_init, max_iter=self.max_iter, verbose=self.verbose,
886 precompute_distances=self.precompute_distances,
887 tol=self.tol, random_state=random_state, copy_x=self.copy_x,
888 n_jobs=self.n_jobs, algorithm=self.algorithm,
--> 889 return_n_iter=True)
890 return self
891
892 def fit_predict(self, X, y=None):
893 """Compute cluster centers and predict cluster index for each sample.
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py in k_means(X=array([[-1.86344999, 0.05621132],
[ 0.88...-1.20243728],
[ 0.97877704, 1.24561138]]), n_clusters=6000, init='k-means++', precompute_distances=False, n_init=10, max_iter=150, verbose=True, tol=0.00020846993669604294, random_state=<mtrand.RandomState object>, copy_x=False, n_jobs=20, algorithm='elkan', return_n_iter=True)
357 verbose=verbose, tol=tol,
358 precompute_distances=precompute_distances,
359 x_squared_norms=x_squared_norms,
360 # Change seed to ensure variety
361 random_state=seed)
--> 362 for seed in seeds)
seeds = array([ 968587040, 226617041, 2063896048, 6552... 393005117, 134324550, 14152465, 2054736812])
363 # Get results with the lowest inertia
364 labels, inertia, centers, n_iters = zip(*results)
365 best = np.argmin(inertia)
366 best_labels = labels[best]
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=20), iterable=<generator object <genexpr>>)
763 if pre_dispatch == "all" or n_jobs == 1:
764 # The iterable was consumed all at once by the above for loop.
765 # No need to wait for async callbacks to trigger to
766 # consumption.
767 self._iterating = False
--> 768 self.retrieve()
self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=20)>
769 # Make sure that we get a last message telling us we are done
770 elapsed_time = time.time() - self._start_time
771 self._print('Done %3i out of %3i | elapsed: %s finished',
772 (len(self._output), len(self._output),
---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
MemoryError Tue Oct 17 16:11:14 2017
PID: 18062 Python 2.7.9: /mnt/home.dokt/borovji3/vEnv/bin/python
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
126 def __init__(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func = <function _kmeans_single_elkan>
args = (memmap([[-1.86344999, 0.05621132],
[ 0....1.20243728],
[ 0.97877704, 1.24561138]]), 6000)
kwargs = {'init': 'k-means++', 'max_iter': 150, 'precompute_distances': False, 'random_state': 134324550, 'tol': 0.00020846993669604294, 'verbose': True, 'x_squared_norms': memmap([ 3.47560557, 1.66662896, 0.19072331, ..., 1.87283488,
3.21604332, 2.50955219])}
self.items = [(<function _kmeans_single_elkan>, (memmap([[-1.86344999, 0.05621132],
[ 0....1.20243728],
[ 0.97877704, 1.24561138]]), 6000), {'init': 'k-means++', 'max_iter': 150, 'precompute_distances': False, 'random_state': 134324550, 'tol': 0.00020846993669604294, 'verbose': True, 'x_squared_norms': memmap([ 3.47560557, 1.66662896, 0.19072331, ..., 1.87283488,
3.21604332, 2.50955219])})]
132
133 def __len__(self):
134 return self._size
135
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/k_means_.py in _kmeans_single_elkan(X=array([[-1.86344999, 0.05621132],
[ 0.88...-1.20243728],
[ 0.97877704, 1.24561138]]), n_clusters=6000, max_iter=150, init='k-means++', verbose=True, x_squared_norms=memmap([ 3.47560557, 1.66662896, 0.19072331, ..., 1.87283488,
3.21604332, 2.50955219]), random_state=<mtrand.RandomState object>, tol=0.00020846993669604294, precompute_distances=False)
394 x_squared_norms=x_squared_norms)
395 centers = np.ascontiguousarray(centers)
396 if verbose:
397 print('Initialization complete')
398 centers, labels, n_iter = k_means_elkan(X, n_clusters, centers, tol=tol,
--> 399 max_iter=max_iter, verbose=verbose)
max_iter = 150
verbose = True
400 inertia = np.sum((X - centers[labels]) ** 2, dtype=np.float64)
401 return labels, inertia, centers, n_iter
402
403
...........................................................................
/mnt/home.dokt/borovji3/vEnv/local/lib/python2.7/site-packages/sklearn/cluster/_k_means_elkan.so in sklearn.cluster._k_means_elkan.k_means_elkan (sklearn/cluster/_k_means_elkan.c:6961)()
225
226
227
228
229
--> 230
231
232
233
234
MemoryError:
___________________________________________________________________________
Versions
Python 2.7.9 (default, Mar 1 2015, 12:57:24) [GCC 4.9.2] on linux2
numpy==1.13.1
scipy==0.19.1
scikit-learn==0.18.1
Metadata
Metadata
Assignees
Labels
No labels