Image 1. Simple K-Means Clustering Method: Overview. This is the overview of the K-Means clustering method. This method will work for most data sets — this particular data has four different labels and 52 columns of numerical data. The clustering algorithm will take this data and cross-compare it in order to group the data set into specific clusters of related items. Image 2. K-Means Clustering Process with Sort. This is an alternate process overview, with one addition: the Sort. Sortin« allows output to be arranged in descending or ascending order. Any variable can be sorted. While the most common sorting method is alphabetically (A-Z), or numerically (0-9), there can be occasions where data will be sorted in reverse order (Z-A) or (9-0). Keep in mind that the sort command works from the first character in: this can cause confusion when numbers start with a zero (0), or the length of a number is important. For example, if 100, 1000 and 1001 have significantly different meanings, it may be best to find an alternate way to sort them — because they will be placed very near each other once the sort is performed. Image 4. K-Means Clustering Process Overview, without Sort (Pareto). This is an expanded view of the Simple K- Means process, in order to show RapidMiner’s GUI in all of its glory. You can see the connections running from Read Excel, to Replace Missing Values, to Work on Subset, and then two connections to lead to the output. What you cannot see is the sub-process within Work on Subset. Also, if you incorrectly connect the operators, the process typically won’t work. This is actually a good thing, because you want to make sure that your data mining is actually valid. Of special note are the two output connections — one tells RapidMiner to display the Example Set output, and the other is an instruction to output the Clustering from the Work on Subset sub-process. Please connect them both, because you want the full result set. Image 5. Read Excel, Parameters. Data sets come in all shapes and sorts. While there are many databases and API feeds that stream data in a constant flow, the most common data set in 2012 is probably the spreadsheet. | don’t expect that spreadsheets will be dominating the data field after the next 10 years — realistically, databases and data warehouses are starting to replace things like spreadsheets and basic data tables. That said, when you are looking for a data set, it is easiest to start with a spreadsheet. The Excel sheet is the de-facto standard for businesses and organizations, but there are other formats out there, such as CSV, text-delimited, and the OpenOffice format. When using CSV and text-delimited data, there are extra steps that necessary before really being able to use the data set. Image 6. Read Excel, Data Set Metadata Information, Column Definitions. This shows the Read Excel metadata for the data set columns. The first four columns are known as the ‘labels’, and these are the key to understanding what you are looking at. There are two different variables (Country and Indicator), and each variable has two types (Name and Code). The Code is merely a shortened version of the Name. For instance, the Country Name for row 1 is identical to the Country Code, and the Indicator Name for row 1000 is the same as the Indicator Code. Where this becomes more important is when the output is visualized and examined — it is pretty critical to know which country and indicator you are looking at. As far as the Read Excel parameter goes, you just want to make sure that the defining attributes are labeled as ‘polynomial’ and ‘attribute’ or ‘label’. RapidMiner will sometimes default to ‘binomial’ for attribute definition, which can cause errors. ‘Binomial’ means ‘two’, and ‘polynomial’ means many. Since this data set has multiple (more than two) entries for Country Code/Name and Indicator Code/Name, you want to select ‘polynomial’. nominal or ordinal. These attribute definitions have definite uses, but not in this data mining Image 7. Replace Missing Values, Parameters. Here is a tricky bit of work — how to decide what to replace the missing data elements with. You see, often data sets won’t work unless they meet certain criteria. For K-Means clustering (and probably for most clustering algorithms) this means that the data cannot have NULL data. NULL data is where the element is blank. There are many raging debates about how and what you introduce into the data set to replace error- causing elements. Image 8. Clustering, Process Overview. Drilled down from Work on Subset. This operator is where the clustering work is actually done. The rest of the data mining process is preparation for this operator, which breaks down the data set into the clusters. Image 10. Work on Subset, Parameters. Using the Work on Subset operator, you can define the subset that the process will analyze. This isn’t just fluff, because often operators will give an error message with a properly defined subset. For the clustering algorithm, a basic tenet is that the data must be purely numerical, and without any missing data elements. We took care of the latter by using the replace missing data operator; now we are going to separate out the numerical data, process it, and then return it into the original data set. The result can be seen down below in the Output Images (Images 10-23). The reason for this is that the K-Means Clustering operator cannot handle polynomial, binomial, ordinal, or nominal data — only numerical. Perhaps other clustering algorithms can work with ordinal or nominal data, but for Our purposes, everything must be numerical. Image 11. Work on Subset, Select Attributes. Here is a demonstration of how the attributes are selected. On the left, we have the non-selected attributes. On the right, we have the selected attributes. In the middle are the arrows that move attributes back and forth between the columns. The goal when selecting this specific subset is to put all the numerical attributes on the right, and leave all the non-numerical attributes on the left. It is as simple as that. Noteworthy: The subset and clustering doesn’t care whether the numerical attributes are percentages, huge numbers, decimals, or whole numbers. The data just needs to be in the form of numbers, but beyond that, the whole process is fairly format-agnostic (it doesn’t matter if one column is a percentage and other column is a 15-digit whole number). This is one of the major strengths of data mining applications — the overhead to prepare the data set and get results is far less than the overhead to manually calculate a fraction of the output. Really, mathematics can no longer be separated from technology — the two are completely inter-dependent. Image 12. Output, ExampleSet (Clustering), Data View. Here is an example of the example set yielded from the output. This is the raw data with the addition of two columns: ‘id’ and ‘label’. The ID column is basically a replicated Row Number column generated by one of the operators. The key result in our output is stored under ‘label’, where you can see cluster_4 listed for the top results. This shows that the clustering algorithm has properly worked, and that every row has been classified into a cluster. Image 13. Output, ExampleSet (Clustering), Meta Data View. Here is the ExampleSet metadata view, where you can see the two new attributes (id and label), as well as their types. You can also spot the statistics and range — when you take a peek at them, they will show a few things. For instance, you can see that the average and range for the year 2011 is a big, fat zero. This means that there was no data available at the beginning of the processing, and so every column was replaced with a zero when we used the Replace Missing Element operator. The final result is that there is nothing calculated for this year. Looking here, you can see that cluster_2 (at the bottom of the graph) holds the most data elements. This is not a mistake — data mining in this data set shows the elements that are outside of the norm, rather than within the normal range. You can see that cluster_8 holds the next largest array of points, and cluster_4 (at the very top) only holds two points. By hovering over any given data point, a tooltip appears that will show greater detail. By double-clicking on data point, the complete sub -record of that point will be shown. This is immensely useful when you want to drill down into the data mining results and explore the ‘who, when, and why’ in your findings. Image 15. Simple K-Means Clustering, Result Overview. Here is the simple output of the centroid clustering model. On the left of the result overview you can see the clustering output (clusters 0-9, with a varying number of elements within each cluster). On the right of the result overview you can see the example set sorted by Pareto Rank. This contains statistics about the output, but doesn’t directly impact the resulting graphs or sets. mage 17. Output, Detailed View of Data Set Element. This is an expanded view of one of the data points within the Plot /iew graph (specifically, ID 5935). You can see how this would pretty quickly allow you to summarize and discover new rends and data findings. In this case, the data point corresponds to China’s number of primary education pupils that are currently enrolled. This has been grouped in cluster_4, which actually means that is an enormous outlier. Hence, China’: school population lies within one of the clusters with only two elements, at the top of the graph. Image 18. Output, Cluster Model, Text View. Shows ten clusters, numbered 0-9. This is the straightforward text output, showing the name of the cluster and the number of elements that each cluster contains. You can see that the majority of the items are sorted into cluster 2, with the next largest being cluster 8. After that, the number of items in any given cluster drops sharply, with cluster 1 having 24 items. The smallest cluster is cluster 4, which we already examined somewhat earlier in the chapter. Cluster 4 has only two items, but there is nothing in the K-Means algorithm that states that a cluster need to have any items. In other words, a cluster can have zero items in it. This is often seen with a K = 2, two item cluster set, where all of the items are sorted into one cluster. It is important to note that an empty cluster doesn’ mean output failure, and can actually be a perfectly valid answer. Image 19. Output, Cluster Model, Folder View. This view shows the condensed list of clusters with clusters #3 and #5 expanded to show the data set elements within that cluster. This is the folder view for the Cluster Model, where you can see the actual contents of the different clusters. While this view is not nearly as useful as the Plot View graph, it can still shed some light on how data is arranged. Typically, you would refer to this to see specific data points, and then refer to your data set for further insight. Thankfully, with the Plot View function, RapidMiner enables much quicker data discovery than the ‘hunt and seek’ method. Image 20. Output, Cluster Model, Centroid Table. This shows a data summary for each cluster. It gives the centroid data for each attribute and cluster, which also could help you determine what kind of criteria the clustering algorithm used in order to find your given cluster. For instance, you can see that cluster 2 has a large number of four digit (plus decimal) numbers, which means that the cluster is focused around that size of data point. Cluster 8 deals with much larger, 6 digit data points. Cluster 9 has even larger points, which range into the 10 million. Knowing this can help predict which cluster an unknown element could be assigned to, as well as help form a basic understanding of the data set you are working Anth Examining any given data set can take days or weeks. Don’t expect an immediate response, but take comfort in the fact that you are doing what was once considered impossible. The fact that we can sift such a huge array of data by just arranging operators and pushing a button is fantastic. Eventually the insight-finding portion of data analysis will also be automated, and then there will be a whole new field of technologist that opens up. For now, keep striving to be at the cutting edge of data science, and | will see you in the next chapter. Image 1. This is the global overview from RapidMiner 5.1, as it is set up to extract data from an API feed and parse it into an spreadsheet. There are endless reasons that a collection of data points would be extracted and stored in this way: typically the data would be stored in a database such as Microsoft Access/SQL Server, MySQL, or Oracle. For simplicity, this example is made to just extract to a CSV, the most generic format of spreadsheet available. Image 2. Getting The Page. This displays the settings for web page retrieval. Most API feeds are constantly streaming data flows, with elements being added to the stream on a regular basis. Using a data set from the World Bank would be slightly different, as this data would be stored as an Excel file on your computer and then analyzed from your desktop using RapidMiner. Another option is to use RapidAnalytics, a server-deployed version of RapidMiner for large- scale organizational use. Image 3. Cutting the Document (Sub-Process). You remember how | mentioned that there was a sub-process involved with cutting the document into pieces? This is displayed here as a series of three actions that RapidMiner takes in order to break the data feed apart. If you were using a text document, these steps would be almost identical — especially the core concept of tokenization. Image 5. Extract Information (Detailed View). Here you can see the method that is used to extract information from any given unorganized or partially organized data source. For instance, say your data has a field for date, but it is mashed in with all other types of data in a giant blob. By knowing how to use a command similar to Extract Information, you can find the specific element that you are looking for and pull that out. This is useful for all types of result-finding processes, and can be applied with text analytics, web scraping, data mining, and numerous other fields. Image 7. Filtering Tokens. Filtering the tokens, in this case, is performed by the Filter Tokens command. The parameters for the token filtering are set as such: a token shorter than 4 characters (letters), and longer than 50, get cut out of the result set. Thus, the token ‘the’ (3 characters) gets cut out, while the token ‘there’ (5 characters) remains withir the data set. Image 8. Document to Data. This is the process that will store the output from the data process as data. When ealing with data, it is important to keep a mental note of what format your data set is in at any given time. For instance, if ou are working with a document, you cannot just save to Excel — the data must be prepared first, via a routine like ocuments to Data. Think of it like wrapping a gift for a friend: you need to make sure that everything will fit into the box efore you deliver it. Image 10. Read CSV. This is the first step of the second process that cleans and organizes the semi-raw output, preparing it for greater analysis. There are a few approaches to data storage and manipulation, but my school of thought is that you don’t want to eliminate your original data — ever. Keep in mind that the data project you are working on currently may be completely different than the data you are required to extract tomorrow, and that your data set or data mining process could be a vital feature for your future work. Save your work, and don’t be afraid to refer back to it and use what you created before. Refinement and change are fundamental technical skills. Image 12. Output from Process 1. You can see the larger size of this dataset, when compared to the output fro °>rocess 2. This would be considered the unrefined, base data set that you were working from. Image 13. Sample output from Process 2. This shows the output from the second process, where you can see specific categories and data types that have been parsed and stored by RapidMiner. Also included is a section on statistics, as well as the data range. These are key factors, and are demonstrated more in Image 12. Image 14. Data View of Output from Process 2. This shows a demonstration of the actual data pulled from an API feed. The purpose of this is showing how your data should/could look after processing. Note that there are still blanks and gaps in the data, which isn’t unexpected. These could change the output or parsing, and imperfect data does create imperfect results. Knowing the limits of your data set and data process are critical.