Skip to content

NMI and AMI are not explained well in the user guide #8645

@amueller

Description

@amueller

http://scikit-learn.org/dev/modules/clustering.html#mutual-information-based-scores

All are between zero and one, but there is no concise explanation of the difference between the three, except that NMI is used in the literature and AMI is used more recently.

There is a nice explanation in the IR book:

$I( \Omega ; \mathbb{C} )$ in Equation 184 measures the amount of information by which our knowledge about the classes increases when we are told what the clusters are. The minimum of $I( \Omega ; \mathbb{C} )$ is 0 if the clustering is random with respect to class membership. In that case, knowing that a document is in a particular cluster does not give us any new information about what its class might be. Maximum mutual information is reached for a clustering $\Omega_{exact}$ that perfectly recreates the classes - but also if clusters in $\Omega_{exact}$ are further subdivided into smaller clusters (Exercise 16.7 ). In particular, a clustering with $K=N$ one-document clusters has maximum MI. So MI has the same problem as purity: it does not penalize large cardinalities and thus does not formalize our bias that, other things being equal, fewer clusters are better.

The normalization by the denominator $[H(\Omega )+H(\mathbb{C} )]/2$ in Equation 183 fixes this problem since entropy tends to increase with the number of clusters. For example, $H(\Omega)$ reaches its maximum $\log N$ for $K=N$, which

ensures that NMI is low for $K=N$. Because NMI is normalized, we can use it to compare clusterings with different numbers of clusters. The particular form of the denominator is chosen because $[H(\Omega )+H(\mathbb{C} )]/2$ is a tight upper bound on $I( \Omega ; \mathbb{C} )$ (Exercise 16.7 ). Thus, NMI is always a number between 0 and 1. 

Basically (if I understand correctly): MI is perfect if one is a sub-partition of the other. NMI "fixes" that and penalizes overpartioning by putting entropy in the denominator. But NMI still depends on the number of clusters and samples.

There's this statement here in the user guide:

Random (uniform) label assignments have a AMI score close to 0.0 for any value of n_clusters and n_samples (which is not the case for raw Mutual Information or the V-measure for instance).

I think this should rather say NMI instead or raw MI (which I think would still be correct but better explain the point).
Maybe adding some examples either inline or create a separate example would be interesting, where change in n_samples and n_clusters is demonstrated for the different cluster measures for partitions that should intuitively be similar, i.e. by just duplicating each point.

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocumentationEasyWell-defined and straightforward way to resolve

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions