Skip to content

adjusted_mutual_info_score takes a long time with lists containing many unique values #24254

@bchen1116

Description

@bchen1116

Describe the bug

The runtime of adjusted_mutual_info_score jumps significantly when we have large amounts of unique values in the two lists. Hovering around 6k total unique values (ie 2 columns of 3k unique values) keeps the runtime around a minute, but when we increase the number of unique values, the runtime shoots up.

Steps/Code to Reproduce

import pandas as pd
from sklearn.metrics import adjusted_mutual_info_score as ams
import time

df = pd.DataFrame({'a': [x % 8000 for x in range(1000000)],
                   'b': [x % 7000 for x in range(1000000)]})

start = time.time()
mi = ams(df['a'], df['b'])
end = time.time()
end - start

Expected Results

small or linear increase in runtime as we increase the number of unique values.

Actual Results

Large increase in runtime

2 rows of 6k unique values: 598s
2 rows of 8k unique values: 889s
2 rows of 10k unique values: 1118s

Versions

System:
    python: 3.8.0 (default, Aug 17 2020, 18:01:34)  [Clang 11.0.3 (clang-1103.0.32.62)]
executable: /Users/bryan.chen/.pyenv/versions/woodwork/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
          pip: 22.2.2
   setuptools: 41.2.0
      sklearn: 1.0.2
        numpy: 1.21.2
        scipy: 1.7.1
       Cython: 0.29.28
       pandas: 1.4.3
   matplotlib: 3.5.1
       joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions