{"id":18632,"date":"2020-09-06T20:52:56","date_gmt":"2020-09-06T20:52:56","guid":{"rendered":"https:\/\/ittutorial.org\/?p=18632"},"modified":"2020-09-28T12:33:10","modified_gmt":"2020-09-28T12:33:10","slug":"python-unsupervised-learning","status":"publish","type":"post","link":"https:\/\/ittutorial.org\/python-unsupervised-learning\/","title":{"rendered":"Evaluating a Clustering | Python Unsupervised Learning -2"},"content":{"rendered":"<p>Hi, In this article, we continue where we left off from the previous topic. If you haven&#8217;t read the previous article, you can find it here.<\/p>\n<blockquote class=\"wp-embedded-content\" data-secret=\"9Q3LXHYi6m\"><p><a href=\"https:\/\/ittutorial.org\/unsupervised-learning\/\">k-means clustering | Python Unsupervised Learning -1<\/a><\/p><\/blockquote>\n<p><iframe loading=\"lazy\" class=\"wp-embedded-content\" sandbox=\"allow-scripts\" security=\"restricted\" style=\"position: absolute; clip: rect(1px, 1px, 1px, 1px);\" title=\"&#8220;k-means clustering | Python Unsupervised Learning -1&#8221; &#8212; IT Tutorial\" src=\"https:\/\/ittutorial.org\/unsupervised-learning\/embed\/#?secret=2LHK79Tvpg#?secret=9Q3LXHYi6m\" data-secret=\"9Q3LXHYi6m\" width=\"600\" height=\"338\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"><\/iframe><\/p>\n<p>&nbsp;<\/p>\n<h1>Evaluating a Clustering<\/h1>\n<p>In the previous article we used k-means to cluster the sample dataset into the three cluster. But how can we evulate the quality of this clustering?<\/p>\n<p>Let&#8217;s consider the <strong>iris<\/strong> data set as an example.<\/p>\n<p>A direct approach is to compare the clusters with the iris species\u00a0 You&#8217;ll learn about this first, before considering the problem of\u00a0 how to measure the quality of a clustering in a way that doesn&#8217;t require our samples to come pre-grouped into species<\/p>\n<p>This measure of quality can then be used to make an informed choice about the number of clusters look for.<\/p>\n<p>&nbsp;<\/p>\n<h2>Cross tabulation with pandas<\/h2>\n<ul>\n<li>Clusters vs species is a &#8220;cross-tabulation&#8221;<\/li>\n<li>Use the <strong>pandas<\/strong> library<\/li>\n<\/ul>\n<p>Cross tabulations like these provide great insights into which sort of samples are in which cluster.<\/p>\n<p>But in most dataset the samples are not labelled by species.<\/p>\n<p>&nbsp;<\/p>\n<h1>Measuring clustering quality<\/h1>\n<p>We need a way to measure the quality of a clustering that uses only the clusters and the samples themselves.<\/p>\n<ul>\n<li>Using only samples and their cluster labels<\/li>\n<li>A good clustering has tight cluster<\/li>\n<li>Samples in each cluster bunched together<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h3>Inertia measures clustering quality<\/h3>\n<ul>\n<li>Measures how spreed out the clusters are (lower is better)<\/li>\n<li>Distance from each sample to centroid of its cluster<\/li>\n<li>Afret <span style=\"color: #808000\">fit()<\/span> , available as attribute <span style=\"color: #808000\">inertia_ <\/span><\/li>\n<li><span style=\"color: #000000\">k-means attempts to minimize the inertia when choosing clusters\u00a0<\/span><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h3>How many cluster to choose?<\/h3>\n<ul>\n<li>A good clustering has tight clusters (so low inertia)<\/li>\n<li>&#8230;. but not too many clusters<\/li>\n<li>Choose an &#8220;elbow&#8221; in the inertia plot<\/li>\n<li>Where inertia begins to decrease more slowly<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p>Let&#8217;s proceed with the example now.<\/p>\n<pre>import matplotlib.pyplot as plt\r\nfrom sklearn import datasets\r\nfrom sklearn.cluster import KMeans\r\nimport pandas as pd\r\nimport numpy as np\r\n\r\ndata = np.loadtxt(\"https:\/\/archive.ics.uci.edu\/ml\/machine-learning-databases\/00236\/seeds_dataset.txt\",)\r\n\r\n\r\n\r\nks = range(1, 6)\r\ninertias = []\r\n\r\nfor k in ks:\r\n\r\n\u00a0 \u00a0 \u00a0model = KMeans(n_clusters=k)\r\n\r\n# Fit model to samples\r\n\u00a0 \u00a0 \u00a0model.fit(data)\r\n\r\n# Append the inertia to the list of inertias\r\n\u00a0 \u00a0 \u00a0inertias.append(model.inertia_)\r\n\r\n# Plot ks vs inertias\r\nplt.plot(ks, inertias, '-o')\r\nplt.xlabel('number of clusters, k')\r\nplt.ylabel('inertia')\r\nplt.xticks(ks)\r\nplt.show()<\/pre>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-18641\" src=\"https:\/\/ittutorial.org\/wp-content\/uploads\/2020\/09\/Screenshot_14.png\" alt=\"\" width=\"1057\" height=\"867\" srcset=\"https:\/\/ittutorial.org\/wp-content\/uploads\/2020\/09\/Screenshot_14.png 1057w, https:\/\/ittutorial.org\/wp-content\/uploads\/2020\/09\/Screenshot_14-300x246.png 300w, https:\/\/ittutorial.org\/wp-content\/uploads\/2020\/09\/Screenshot_14-1024x840.png 1024w, https:\/\/ittutorial.org\/wp-content\/uploads\/2020\/09\/Screenshot_14-768x630.png 768w\" sizes=\"auto, (max-width: 1057px) 100vw, 1057px\" \/><\/p>\n<p>The inertia decreases very slowly from 3 clusters to 4, so it looks like 3 clusters would be a good choice for this data.<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-18643 aligncenter\" src=\"https:\/\/ittutorial.org\/wp-content\/uploads\/2020\/09\/Screenshot_15.png\" alt=\"\" width=\"740\" height=\"448\" srcset=\"https:\/\/ittutorial.org\/wp-content\/uploads\/2020\/09\/Screenshot_15.png 1264w, https:\/\/ittutorial.org\/wp-content\/uploads\/2020\/09\/Screenshot_15-300x182.png 300w, https:\/\/ittutorial.org\/wp-content\/uploads\/2020\/09\/Screenshot_15-1024x620.png 1024w, https:\/\/ittutorial.org\/wp-content\/uploads\/2020\/09\/Screenshot_15-768x465.png 768w\" sizes=\"auto, (max-width: 740px) 100vw, 740px\" \/><br \/>\n<span style=\"color: #000080\"><strong>Note: labels and varieties variables are as in the picture<\/strong><\/span><\/p>\n<pre>model = KMeans(n_clusters=3)\r\n\r\n# Use fit_predict to fit model and obtain cluster labels: labels\r\nlabels = model.fit_predict(data)\r\n\r\n# Create a DataFrame with labels and varieties as columns: df\r\ndf = pd.DataFrame({'labels':labels, 'varieties': varieties})\r\n\r\n# Create crosstab: ct\r\nct = pd.crosstab(df[\"labels\"],df[\"varieties\"])\r\n\r\n# Display ct\r\nprint(ct)<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-18642\" src=\"https:\/\/ittutorial.org\/wp-content\/uploads\/2020\/09\/Screenshot_16.png\" alt=\"\" width=\"984\" height=\"646\" srcset=\"https:\/\/ittutorial.org\/wp-content\/uploads\/2020\/09\/Screenshot_16.png 635w, https:\/\/ittutorial.org\/wp-content\/uploads\/2020\/09\/Screenshot_16-300x197.png 300w, https:\/\/ittutorial.org\/wp-content\/uploads\/2020\/09\/Screenshot_16-310x205.png 310w\" sizes=\"auto, (max-width: 984px) 100vw, 984px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>The cross-tabulation shows that the 3 varieties of grain separate really well into 3 clusters. But depending on the type of data you are working with, the clustering may not always be this good.<\/p>\n<p>Is there anything you can do in such situations to improve your clustering? You&#8217;ll find out in the next tutorial!<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hi, In this article, we continue where we left off from the previous topic. If you haven&#8217;t read the previous article, you can find it here. k-means clustering | Python Unsupervised Learning -1 &nbsp; Evaluating a Clustering In the previous article we used k-means to cluster the sample dataset into the three cluster. But how &hellip;<\/p>\n","protected":false},"author":67,"featured_media":18628,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[12904],"tags":[12362,12909,12905,12865,12898,12883,12864,12886,12889,13636,12900,12899,12910,12876,12877,12878,12879,12888,12913,12907,12908,12906,12896,12872,12871,12891,12911,12868,12884,12897,12875,12869,12881,12894,12870,12880,12893,12895,12902,12890,12912,12914,12901,12885,12882,12892,12915,12916,12867,12866,12874,12873,12887],"class_list":["post-18632","post","type-post","status-publish","format-standard","has-post-thumbnail","","category-data-science","tag-advance-python","tag-clustering-quality","tag-cross-tabulation","tag-data-science","tag-data-science-example-of-unsupervised-learning","tag-data-science-in-python","tag-datascience","tag-denetimsiz-ogrenme","tag-derin-ogrenme","tag-evaluating-a-clustering","tag-example-of-supervised-learning","tag-example-of-unsupervised-learning","tag-inertia-measures","tag-iris-dataset-examle","tag-k-means-example","tag-kmeans-example","tag-kmeans-example-in-python","tag-makina-ogrenmesi","tag-matplotlib","tag-numpy","tag-numpy-array-in-python","tag-pandas","tag-python-advance-clustering","tag-python-classification","tag-python-clustering","tag-python-clustering-ornekleri","tag-python-cross-validation","tag-python-data-science","tag-python-deep-learning","tag-python-example-of-unsupervised-learning","tag-python-iris-dataset","tag-python-k-means","tag-python-k-means-examle","tag-python-k-means-ornek","tag-python-kmeans","tag-python-kmeans-example","tag-python-knn-ornekleri","tag-python-kumeleme-ornegi","tag-python-machine-learning-example","tag-python-makina-ogrenmesi","tag-python-matplotlib","tag-python-sklearn","tag-python-supervised-learning-example","tag-python-unlabeled-data","tag-python-unsupervised-learning-example","tag-python-unsupervised-learning-uygulamalari","tag-sklearn-clustering","tag-sklearn-cluster-kmeans","tag-supervised-learning","tag-unsupervised-learning","tag-unsupervised-learning-classification","tag-unsupervised-learning-clustering","tag-veri-bilimi"],"aioseo_notices":[],"jetpack_featured_media_url":"https:\/\/ittutorial.org\/wp-content\/uploads\/2020\/09\/indir.png","jetpack_sharing_enabled":true,"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/ittutorial.org\/wp-json\/wp\/v2\/posts\/18632","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ittutorial.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ittutorial.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ittutorial.org\/wp-json\/wp\/v2\/users\/67"}],"replies":[{"embeddable":true,"href":"https:\/\/ittutorial.org\/wp-json\/wp\/v2\/comments?post=18632"}],"version-history":[{"count":7,"href":"https:\/\/ittutorial.org\/wp-json\/wp\/v2\/posts\/18632\/revisions"}],"predecessor-version":[{"id":19441,"href":"https:\/\/ittutorial.org\/wp-json\/wp\/v2\/posts\/18632\/revisions\/19441"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ittutorial.org\/wp-json\/wp\/v2\/media\/18628"}],"wp:attachment":[{"href":"https:\/\/ittutorial.org\/wp-json\/wp\/v2\/media?parent=18632"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ittutorial.org\/wp-json\/wp\/v2\/categories?post=18632"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ittutorial.org\/wp-json\/wp\/v2\/tags?post=18632"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}