-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Closed
Labels
Description
Describe the bug
The runtime of adjusted_mutual_info_score jumps significantly when we have large amounts of unique values in the two lists. Hovering around 6k total unique values (ie 2 columns of 3k unique values) keeps the runtime around a minute, but when we increase the number of unique values, the runtime shoots up.
Steps/Code to Reproduce
import pandas as pd
from sklearn.metrics import adjusted_mutual_info_score as ams
import time
df = pd.DataFrame({'a': [x % 8000 for x in range(1000000)],
'b': [x % 7000 for x in range(1000000)]})
start = time.time()
mi = ams(df['a'], df['b'])
end = time.time()
end - start
Expected Results
small or linear increase in runtime as we increase the number of unique values.
Actual Results
Large increase in runtime
2 rows of 6k unique values: 598s
2 rows of 8k unique values: 889s
2 rows of 10k unique values: 1118s
Versions
System:
python: 3.8.0 (default, Aug 17 2020, 18:01:34) [Clang 11.0.3 (clang-1103.0.32.62)]
executable: /Users/bryan.chen/.pyenv/versions/woodwork/bin/python
machine: macOS-10.16-x86_64-i386-64bit
Python dependencies:
pip: 22.2.2
setuptools: 41.2.0
sklearn: 1.0.2
numpy: 1.21.2
scipy: 1.7.1
Cython: 0.29.28
pandas: 1.4.3
matplotlib: 3.5.1
joblib: 1.0.1
threadpoolctl: 2.2.0
Built with OpenMP: True