0% encontró este documento útil (0 votos)
48 vistas65 páginas

Dmbi Practical

Contains dmbi file for practical IPU

Cargado por

Dark Knight
Derechos de autor
© © All Rights Reserved
Nos tomamos en serio los derechos de los contenidos. Si sospechas que se trata de tu contenido, reclámalo aquí.
Formatos disponibles
Descarga como PDF o lee en línea desde Scribd
0% encontró este documento útil (0 votos)
48 vistas65 páginas

Dmbi Practical

Contains dmbi file for practical IPU

Cargado por

Dark Knight
Derechos de autor
© © All Rights Reserved
Nos tomamos en serio los derechos de los contenidos. Si sospechas que se trata de tu contenido, reclámalo aquí.
Formatos disponibles
Descarga como PDF o lee en línea desde Scribd
EXPERIMENT 1 AIM: Study of ETL process an tools ETL Process ETL is defined as a process that extracts the data from different RDBMS source systems, then transforms the data (like applying calculations, concatenations, ete.) and finally loads the data into the Data Warehouse system. ETL full-form is Extract, Transform and Load. It’s tempting to think a creating a Data warehouse is simply extracting data from multiple sources and loading into database of a Data warehouse. This is far from the truth and requires a complex ETL process. The ETL process requires active inputs from various stakeholders including developers, analysts, testers, top executives and is technically challenging. In order to maintain its value as a tool for decision-makers, Data warehouse system needs to change with business changes. ETL is a recurring activity (daily, weekly, monthly) of a Data warehouse system and. needs to be agile, automated, and well documented, There are many reasons for adopting ETL in the organization: + Ithelps companies to analyse their business data for taking critical business decisions. + Transactional databases cannot answer complex business questions that can be answered by ETL. + A Data Warehouse provides a common data repository + ETL provides a method of moving the data from various sources into a data warehouse. + As data sources change, the Data Warehouse automatically update. + Well-designed and documented ETL system is almost essential to the success of a Data Warehouse project. + Allow verification of data transformation, aggregation and calculations rules. + ETL process allows sample data comparison between the source and the target system. + ETL process can perform complex transformations and requires the extra area to store the data. + ETL helps to Migrate data into a Data Warehouse. Convert to the various formats and types to adhere to one consistent system, + ETL isa predefined process for accessing and manipulating source data into the target database. ETL Process in Data Warehouses Step 1) Extraction In this step, data is extracted from the source system into the staging area, Transformations if any are done in staging area so that performance of source system in not degraded. Also, corrupted data is copied directly from the source into Data warehouse database, rollback will be a challenge. Staging area gives an opportunity to validate extracted data before it moves into the Data warehouse, Data warehouse needs to integrate systems that have different DBMS , Hardware, Operating Systems and Communication Protocols. Sources could include legacy applications like Mainframes, customized applications, Point of contact devices like ATM, Call switches, text files, spreadsheets, ERP, data from vendors, partners amongst others. Hence one needs a logical data map before data is extracted and loaded physically. This data map describes the relationship between sources and target data, Three Data Extraction methods: 1. Full Extraction 2. Partial Extraction- without update notification, 3. Partial Extraction- with update notification Irrespective of the method used, extraction should not affect performance and response time of the source systems. These source systems are live production databases. Any slow down or locking could effect company's bottom line. Some validations are done during Extraction: + Reconcile records with the source data + Make sure that no spam/unwanted data loaded + Data type check + Remove all types of duplicate/fragmented data * Check whether all the keys are in place or not Step 2) Transformation Data extracted from source server is raw and not usable in its original form. Therefore it needs to be cleansed, mapped and transformed. In fact, this is the key step where ETL process adds value and changes data such that insightful BI reports can be [Link] this step, you apply a set of functions on extracted data. Data that does not require any transformation is called as direct move or pass through [Link] transformation step, you can perform customized operations on data. For instance, if the user wants sum-of-sales revenue which is not in the database, Or if the first name and the last name in a table is in different columns. It is possible to concatenate them before loading. Following are Data Integrity Problems: 1. Different spelling of the same person like Jon, John, etc. 2. There are multiple ways to denote company name like Google, Google Ine. 3. Use of different names like Cleveland, Cleveland. 4, There may be a case that different account numbers are generated by various applications for the same customer. 5. In some data required files remains blank 6. Invalid product collected at POS as manual entry can lead to mistakes. idations are done during this stage + Filtering — Select only certain columns to load + Using rules and lookup tables for Data standardization © Character Set Conversion and encoding handling + Conversion of Units of Measurements like Date Time Conversion, currency conversions, numerical conversions, ete. © Data threshold vali ion check. For example, age cannot be more than two digits. * Data flow validation from the staging area to the intermediate tables. + Required fields should not be left blank. + Cleaning ( for example, mapping NULL to 0 or Gender Male to "M" and Female to "F" ete.) + Split a column into multiples and merging multiple columns into a single column. + Transposing rows and columns, + Use lookups to merge data + Using any complex data validation (¢.g,, if the first two columns in a row are empty then it automatically reject the row from processing) Step 3) Loading Loading data into the target data warehouse database is the last step of the ETL process. Ina typical Data warehouse, huge volume of data needs to be loaded in a relatively short period (nights). Hence, load process should be optimized for performance. In case of load failure, recover mechanisms should be configured to restart from the point of failure without data integrity loss. Data Warehouse admins need to monitor, resume, cancel loads as per prevailing server performance. ETL tools There are many Data Warehousing tools are available in the market, Here, are some most prominent one: 1, MarkLogie: MarkLogie is a data warehousing solution which makes data integration easier and faster using an array of enterprise features. It can query different types of data like documents, relationships, and metadata, 2. Oracle: Oracle is the industry-leading database. It offers a wide range of choice of Data Warehouse solutions for both on-premises and in the cloud. It helps to optimize customer experiences by increasing operational efficiency. 3. Amazon RedShit Amazon Redshift is Datawarehouse tool. It is a simple and cost-effective tool to analyze all types of data using standard SQL and existing BI tools. It also allows running complex queries against petabytes of structured data, EXPERIMENT 5 AIM: Implementation of Classification technique on ARFF files using WEKA Data Set @relation diabetes! @attribute 'glucose_cocentration' real @attribute 'blood_pressure' real @attribute 'skin_thickness’ real G@attribute ‘age’ real G@attribute class {tested_negative,tested_positive} @data 7,140,50,60,tested_positive 8,120,43,75,tested_positive 6,110,37,60,tested_positive 9,125,28,46,tested_positive 9,160,53,72,tested_positive 7,105,48,65,tested_positive 8,100,32,78,tested_positive 9,110,25,43,tested_positive 3,145,49,54,tested_positive 4,156,60,40,tested_positive 2,170,28,37,tested_positive 1,20,43,60,tested_positive 4,108,46,45,tested_positive 3,90,27,70,tested_positive 2,100,36,30,tested_negative 4,125,35,45,tested_negative 3,120,26,48,tested_negative 1,100,30,30,tested_negative 4,108,23,43,tested_negative 3,140,39,58,tested_positive 1,120,55,40,tested_positive 2,87,29,34,tested_negative 3,96,32,71.tested_positive 1,104,54,27,tested_negative 4,120,31,31,tested_negative 1,105,30,40,tested_negative 3,110,21,46,tested_negative 2,102,36,33,tested_negative 3,118,27,40,tested_negative Result redic yns on user test set insti# actual predicted error prediction 1:2 2:tested_positive 1:2 2:tested_positive ssted_positive 22 2:tested_positive sted_positive stested_positive 1:2 2:tested_positive sted_positive 22 2:tested_positive 10 1:2 2:tested_positive 1 2xested_positive 1 2tested_positive 1 2xtested_positive 0.667 14 1:? 2:tested_positive I 15 1:2 L:tested_negative I tested_negative 17 1:2 I:tested_negative 18 1:2 I:tested_negative I:tested_negative 20 1:? 2:tested_positive 1 21 1:2 2:tested_positive 0.667 Correctly Classified Instances 22 75.8621 % Incorrectly Classified Instances 7 24.1379 % ‘onfusion Matrix ab <- classified as 0 2| a=tested_negative 5 12| b= tested positive a—™ <48 >48 va — ‘lucose_cocentation ‘ested, postive (11.0) <6 26 vets vetsineo a <1 > 0 oa = ste arse tend porte 2.0) es oy — vse egnive (11.0) teed ponte 3.0.0) Analysis: At root node dataset is split into age attribute ,if age is >48 person is tested as positive else next, split is on basis of glucose_concentration .If glucose_concentration is >6 person is tested as positive else next split is blood_pressure. If blood_pressure >140 person is tested as positive else next split is skin_thickness. If skin_thickness >39 person is tested as positive else person is tested as negative. EXPERIMENT 9 AIM: Study of DBMINER tool Theory: DBMiner, a data mining system for interactive mining of multiple-level knowledge in large relational databases, has been developed based on our years-of-research. The system implements a wide spectrum of data mining functions, including generalization, characterization, discrimination, association, classification, and prediction. By incorporation of several interesting data mining techniques, including attribute-oriented induction, progressive deepening for mining multiple-level rules, and meta-rule guided knowledge mining, the system provides a user-friendly, interactive data mining environment with good performance The system has the following distinct features: © It incorporates several interesting data mining techniques, including attribute- oriented induction, progressive deepening for mining multiple-level rules and meta-rule guided knowledge mining, ete., and implements a wide spectrum of data mining functions including generalization, characterization, association, classification, and prediction. © It performs interactive data mining at multiple concept levels on any user- specified set of data in a database using an SQL-like Data Mining Query Language, DMQL, ora graphical user interface, Users may interactively set and adjust various thresholds, control a data mining process, perform roll-up or drill- down at multiple concept levels, and generate different forms of outputs, including generalized relations, generalized feature tables, multiple forms of generalized rules, visual presentation of rules, charts, curves, etc. © Efficient implementation techniques have been explored using different data structures, including generalized relations and multiple-dimensional data cubes, and being integrated with relational database techniques. The data mining process may utilize user- or expert-defined set-grouping or schema-level concept hierarchies which can be specified flexibly, adjusted dynamically based on data distribution, and generated automatically for numerical attributes. © Both UNIX and PC (Windows/NT) versions of the system adopt a client/server architecture. The latter communicates with various commercial database systems for data mining using the ODBC technology. DBMiner characterizer The characterizer generalizes a set of task-relevant data into a generalized relation which can then be viewed at multiple concept levels from different angles. In particular, it derives a set of characteristic rules which summarize the general characteristics of a set of user-specified data (called the target class). For example, the symptoms of a specific disease can be summarized by a characteristic rule. DBMiner discriminator A discriminator discovers a set of discriminant rules which summarize the features that distinguish the class being examined (the target class) from other classes (called contrasting classes). For example, to distinguish one disease from others, a discriminant rule summarizes the symptoms that discriminate this disease from others. DBMiner association rule finder An association rule finder discovers a set of association rules at multiple concept levels from the relevant set(s) of data in a database, For example, one may discover a set of symptoms frequently occurring together with certain kinds of diseases and further study the reasons behind them. DBMiner data classifier A classifier analyzes a set of training data(i.e., a set of objects whose class label is known) and constructs a model for each class based on the features in the data, A set of classification rules is generated by such a classification process, which can be used to classify future data and develop a better understanding of each class in the database. For example, one may classify diseases and provide the symptoms which describe each class or subclass of diseases. DBMiner predictor A predictor predicts the possible values of some missing data or the value distribution of certain attributes in a set of objects. This involves finding the set of attributes relevant to the attribute of interest (by some statistical analysis) and predicting the value distribution based on the set of data similar to the selected object(s). For example, an employee's potential salary can be predicted based on the salary distribution of similar employees in the company. DBMiner meta-rule guided miner ‘A meta-rule guided miner is a data mining mechanism which takes a user-specified meta-rule form, such as as a pattern to confine the search for desired rules. For example, one may specify the discovered rules to be in the form of" in order to find the relationships between a student's major and his/her gpa in a university database. DBMiner evolution evaluator A data evolution evaluator evaluates the data evolution regularities for certain objects whose behavior changes over time. This may include characterization, classification, association, or clustering of time-related data, For example, one may find the general characteristics of the companies whose stock price has gone up over 20% last year or evaluate the trend or particular growth patterns of certain stocks. DBMiner dey (ion evaluator A deviation evaluator evaluates the deviation patterns for a set of task-relevant data in the database. For example, one may discover and evaluate a set of stocks whose behavior deviates from the trend of the majority of stocks during a certain period of time. The module contains the following three functions: 1, recognizes or identifies the general trend and/or behavior for data in the database, 2. detects the set of data which deviates from such a trend or behavior, and 3. summarizes the general characteristics of deviation data, DBMiner user interfaces Three user interfaces, UNIX-based, Windows/NT-based, and WWW/netscape- based GUIs have been developed to allow users to interactively discover multiple-level knowledge in large relational databases, it integrates well with existing commercial database systems with high performance, and is robust at handling noise and exceptional data. EXPERIMENT 10 AIM: Study of ARMINER tool Theory: The data mi ing technology has attracted lots of researchers and organizations for its brilliant prospects of applicationfll. Due to much research on it, a large number of applications have emerged and many prototypes have been produced, such as KEFIR from GTE and IMACS from AT&T. Some systems, such as Intelligent Mine from IBMN, DBMiner from Simon Fraser University {a} and Knight from Nanjing University 41, have been used successfully in vements of the current data many domains like finance and commerce. Representing the acl mining technology, these systems involve the research in databases, expert systems, machine learning and statistics. A few of them have been put into practice in business fields. After developing AMINER[S], a data mining tool which adopts various kinds of data mining technologies, we have successfully constructed ARMiner, a data mining system based on associ yn rules and a component in AMINER, by integrating commercial requirements and the research on the association rules together. The goal of ARMiner is to develop data mining, tools for intelligent POS systems and to support decision-making in data warehousing ARMiner is not designed for some particular kind of application. By permitting the outside modifi jon to its domain knowledge during the process of data mining, ARMiner acquires flexibility to some degree. Another advantage of ARMiner is that an interestingness measure is introduced in the system as a new evaluation to filter useless and uninteresting association rules. As a result, an improved algorithm is obtained to reserve the semantics implied in the association rules entirely. Moreover, ARMiner provides mining algorithms and preprocessing API fun ins for its re-development. These API functions can be seamlessly integrated with many developing environments, thus facilitating the deployment of ARMiner. ARMiner consists of five components: a basic technology module, a presentation module, an. algorithm module, a data source module and an instruction center of data mining. © Basic technology module: It refers to the environment of software and hardware where a system is developed, such as a server, a network, a platform and a tool. As the physical foundation of implementation, it determines the efficiency of the final system. Presentation module: It refers to the operating interface for users. It can be direct, like the operating interface of the Client/Server architecture, or indirect (for example, the electronic mail which users use to deliver data mining requests and accept the result of data mining). Algorithm module: Itis the core of the system. Guided by the instruction center of data mining, it selects suitable algorithms to mine the clean data processed from data sources, applies techniques like indexing, parallel computing and pruning ramifications to improve efficiency and sends the mining result to the presentation module. Data source module: It prepares data for mining by transforming the raw data extracted in various ways from different sources, such as the relational database, multidimensional database, data warehouse and even flat files. The raw data can be extracted through gateways such as ODBC, or by connecting databases in special ways, or by analyzing the data from the data warehouse and other data sources. Large data set can be sampled to reduce the size of data which will be processed. Then, the dirty data are removed by cleaning (for example, the raw data are generalized according to the concept hierarchies) and the remainder is integrated for mining. In that way, every mining algorithm uses the same interface to access data without being aware of the existence of data sources. Instruction center of data mining: As the headquarters of the system, the instruction center directs three modules, namely, presentation, algorithm and data source, to run properly. The presentation form base stores the definitions of forms in which tile output is presented to end users, such as natural languages, graphics, and grids. The knowledge/algorithm base is used to control tile management and execution of algorithms, for example, to adjust the evaluation system and to choose a suitable mining algorithm to accelerate computation, The data preparation method base provides the methods for the data source module during data transfer, such as data transformation and conception hierarchy analysis. EXPERIMENT 6 AIM: Implementation of Clustering technique on ARFF files using WEKA DATASE G@relation [Link] @attribute outlook {sunny, overcast, rainy} @attribute temperature {hot, mild, cool} @attribute humidity {high, normal} @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,hot,high,FALSE,no sunny,hot,high, TRUE,no overcast,hot,high,[Link] rainy,mild high, FALSE. yes rainy,cool,normal,FALSE,yes rainy,cool,normal, TRUE,no overeast,cool,normal, TRUE,yes sunny,mild,high,FALSE,no sunny,cool,normal, FALSE, yes rainy,mild,normal,FALSE,yes sunny,mild,normal, TRUE, yes overcast,mild, high, TRUE, yes overcast,hot,normal, [Link] rainy,mild high, TRUE,no RESULT: kMeans Final cluster centroids: Clusters Attribute FullData 0 2 (9.0) G.0) (3.0) (3.0) outlook rainy rainy overcast. sunny temperature mild hot_—_—hot mild humidity normal high high normal. windy TRUE TRUE FALSE TRUE play yes no yes yes Clustered Instances 0 2( 40%) 1 2( 40%) 2 1( 20%) eer tatncen ANALYSIS: Three clusters are created such that rainy but temperature is mild there no wind,and humidity is normal still play can be yes. If there is overcast,cool,windy,normal humidity then play is [Link] its is sunny hot,high humidity and windy then also play is [Link] test data 2 instances go to closter 1 cnad cluster 0 and 1 instance goes to cluster2. EXPERIMENT 7 AIM: Implementation of Association technique on ARFF files using WEKA DATASE Note: This data is based on binary system where a item is bought or not represented by t @relation supermarket @attribute 'department!" { t} @attribute 'department2' { t} @attribute 'department3' { t} @attribute 'departmenta’ { t} @attribute ‘departments’ { t} tt} {th ty 1th @attribute 'department6" G@attribute ‘department?’ @at ute 'department® @attribute ‘grocery misc’ { t} @attribute ‘department! 1" {t} @attribute 'baby needs’ { t} ute ‘bread and cake! { t} @attribute ‘baking needs’ { t} @attribute ‘coupons! { t} @attribute 'juice-sat-cord-ms' { t} @attribute ‘tea’ { 1} G@attribute ‘biscuits’ { t} @attribute ‘canned fish-meat’ { t} @attribute ‘canned fruit’ { t} @attribute ‘canned vegetables’ { t} @attribute ‘breakfast food { t) Gattribute 'cigs-tobacco pkts! { t} G@attribute ‘cigarette cartons’ { t} @attribute 'cleaners-polishers' { t} @attribute ‘coffee’ { t} @attribute 'sauces-gravy-pkle’ { t} @attribute 'confectionary' { t} @attribute 'puddings-deserts' { t} @attribute ‘disheloths-scour’ { t} @attribute 'deod-disinfectant’ { t} @attribute 'frozen foods’ { t} @attribute ‘razor blades’ { t} @attribute 'fuels-garden aids! { t} G@attribute ‘spices’ { t} @attribute 'jams-spreads’ { t) @attribute ‘insecticides’ { t} @attribute 'pet foods’ { t} @attribute ‘laundry needs’ { t} @attribute ‘party snack foods’ { t} G@attribute 'tissues-paper pra’ { t} @attribute ‘wrapping’ { t} @attribute ‘dried vegetables’ { t} @attribute 'pkt-canned soup! { t} @attribute ‘soft drinks’ { t} @attribute ‘health food other’ { t} @attribute ‘beverages hot’ { t} @attribute 'health&beauty misc’ { t} G@attribute 'deodorants-soap" { t} @attribute 'mens toiletries’ { t} @attribute ‘medicines’ { t} @attribute ‘haircare’ { t} @attribute ‘dental needs! { t} @attribute 'lotions-creams' { t} @attribute ‘sanitary pads’ { t} attribute 'cough-cold-pain' { t} yute 'department57' { t} @attribute 'meat misc’ { t} @attribute ‘cheese’ { t} @attribute ‘chickens’ { t} @attribute 'milk-cream' { t} G@attribute ‘cold-meats’ { t} @attribute ‘deli gourmet’ { t} @attribute ‘margarine’ { t} @attribute 'salads' { t} @attribute 'small goods' { t} Zattribute ‘dairy foods’ { t} @attribute ‘fruit drinks’ { t} ute ‘delicatessen mise’ { t} ite 'department70" { t} ute 'beef { t} @attribute 'hogget’ { t} @attribute ‘lamb’ { t} @attribute 'pet food" { t} @attribute ‘pork’ { t) @attribute ‘poultry’ { t} @attribute ‘veal’ {t} @attribute ‘gourmet meat! { t} @attribute '‘department79" { t} @attribute '‘department80" { t} @attribute 'department8 1" { t} @attribute 'produce mise! { t} @attribute ‘fruit’ { t) G@attribute ‘plants’ { t} G@attribute ‘potatoes! { t} @attribute ‘vegetables’ { t} @attribute 'flowers' { t} @attribute 'department88" { @attribute '‘department89" { t} G@attribute 'variety mise’ { t} @attribute 'brushware' { t} @attribute ‘electrical’ { t} @attribute ‘haberdashery’ { t} G@attribute ‘kitchen’ { t} @attribute 'manchester' { t} @attribute ‘pantyhose’ { t} @attribute 'plastieware’ { t} @attribute 'department98" { t} G@attribute ‘stationary’ { t} @attribute ‘department! 00" { t} @attribute 'department101' { t} @attribute ‘department! 02" { t} @attribute 'prepared meals’ { t) @attribute ‘preserving needs’ { t} @attribute ‘condiments’ { t} attribute ‘cooking oils’ { t} ute 'department107' { t} @attribute ‘department! 08" { t} @attribute ‘department109" { t} @attribute ‘department! 10! { t} @attribute ‘department! 11" { t} @attribute ‘department! 12! { t} } @attribute ‘department! 13" { t @attribute ‘department! 14" { t @attribute ‘health food bulk’ { t} @attribute ‘department! 16" { t} Zattribute ‘department! 17" { t} ute ‘department! 18' { t} ute ‘department1 19" { t} ite ‘department! 20' { t} ute ‘bake off products’ { t} @attribute ‘department 122" { t} @attribute 'department123' { t} @attribute 'department!24" { t} @attribute 'department!25" { t} @attribute ‘department! 26" { t} @attribute '‘department127" { t} @attribute ‘department! 28" { t} G@attribute 'department129" { t} @attribute ‘department! 30" { t} @attribute ‘small goods2' { t} G@attribute ‘offal’ { t} @attribute ‘mutton’ { t} @attribute 'trim pork’ { t} @attribute ‘trim lamb! { ¢} @attribute ‘imported cheese! { t} @attribute 'department137' { t @attribute 'department138° { t @attribute 'department139" { t @attribute ‘department! 40' { t! + } t } @attribute ‘department 41" { ¢ @attribute 'department142" { t @attribute 'department143" { t @attribute ‘department! 44’ { t @attribute ‘department 145" { t @attribute ‘department 46" { t @attribute 'department147" { t @attribute ‘department 148" { t @attribute ‘department 149" { t @attribute ‘department! 50" { t @attribute 'department151' { t @attribute ‘department! 52" { t @attribute 'department153! { t} @attribute ‘department! 54" { t} {tribute ‘department! 55" { t} tribute ‘department 56" { t} @attribute ‘department 57" { t} @attribute ‘department! 58! { t} @attribute ‘department159" { t} @attribute ‘department! 60" { t} @attribute ‘department! 61" { t} G@attribute ‘department 62" { t} @attribute ‘department! 63" { t} @attribute ‘department! 64" { t} @attribute ‘department 165" { t} @attribute ‘department! 66" { t} {tribute 'department167" { t} t } + + y + y } t + } t } } + } t + } tribute ‘department 68" { t @attribute ‘department! 69" { t @attribute 'department170' { t G@attribute 'department171" { t @attribute '‘department!72" { t @attribute ‘department! 73" { t @attribute ‘department! 74" { t @attribute 'department!75" { t} @attribute ‘department! 76" { t} @attribute '‘department!77" { t} ittribute ‘department 178" { t} @attribute 'department!79" { t} @attribute ‘casks white wine! { t} @attribute ‘casks red wine’ { t} @attribute 750ml white nz’ { t} @attribute "750m! red nz! { t} @attribute "750m! white imp’ { t} @attribute 750ml red imp’ { t} @attribute ‘sparkling nz’ { t} @attribute ‘sparkling imp’ { t} @attribute 'brew kits/accesry' { t} @attribute ‘department 189" { t} tribute 'port and sherry’ { t} @attribute ‘ctrled label wine! { t} @attribute ‘department!92" { t} @attribute 'department193" { t} @attribute 'department194' { t} @attribute 'department195" { t} @attribute ‘department! 96" { t} @attribute '‘department197" { t} @attribute 'department198' { t} @attribute 'department199" { t} @attribute ‘non host support’ {t} @attribute 'department201' { t} @attribute 'department202" { t} @attribute 'department203" { t} @attribute 'department204" { t} @attribute 'department205" { t} @attribute '‘department206" { t} G@attribute 'department207" { t} @attribute 'department208" { t} @attribute 'department209" { t} @attribute 'department210! { t} @attribute ‘department? 1" { t} G@attribute 'department212! { t} @attribute 'department213" { t} @attribute 'department214" { t} @attribute 'department215" { t} @attribute 'department216" { t} @attribute ‘total’ { low, high} % low < 100 @data 2752.12.25 2attats 292002, 2,752, 7sataats 2,252,202, 2, 2220202, 22.202, 1 202,26242222,232,2209 EID III 22 III III III 2222227, 22IDPII 22222222, 222, 222 5 9.9,9'9'[Link].9.99 WIZ2AAQIADIIINIVI2PIIIILI 2 III IIIQIIIVII22I222 2 Hh ASL DLL AL REA NCLCE CY LRLLELSEERS NCSC LEASE OLSYOL EOLLLEL SE CELLELELE COLE CY EE DENSELY LOLLY SE EEERESS BI 222.229. 29.222 22222 IDI IIL ID I DIZ IDI ID ID IID DI IIIIDD QIDQDIDIDQIDI DID DID IDI DIDI DIIDVID DI IIVIVIVIIIIVIVIIVIVID [Link].[Link].29 low aE ATLL W222, LAWL Poa ? 9, 2.9,99.9,2.9,[Link].[Link].99 MII 2 29.2222 29,22. 229,2.99 2222222 BE LAID 2D DI IIDD DEI I DD IIRL, RESULT: About ‘Cass implementing an Aprior-ype algorithm ore deta 0.05 dowcrescipinims (fe lowerSoundMinSuppot 01 000000sss—~S~sS = ee rumaules (10 ouputtemsets (Fase > removeAlMisingCols (Fase » sgitanceleel “10 weateroAsMising (Fale Ss uppertoundbirsuppor 1.0 verbose (Fase ‘Number of cycles performed: 17 Generated sets of large itemsets: Size of set of large itemsets L(1): 44 Size of set of large itemsets L(2): 380 Size of set of large itemsets L(3): 910 Size of set of large itemsets L(4): 633 Size of set of large itemsets L(5): 105 Size of set of large itemsets L(6): 1 Best rules found: 1, biscuits=t frozen foods=t fruit=t totalhigh 788 lift:(1.27) lev:(0.03) [155] eonv:(3.35) 2. baking needs-t biscuits=t fruit=t totalhigh 760 —=> bread and cake=t 696 Jifts(1.27) lev:(0.03) [149] conv:(3.28) 3. baking needs=t frozen foods=t fruit=t total=high 770 ==> bread and cake=t 705 lifts(1.27) lev:(0.03) [150] conv:(3.27) 4. biscuits=t fruit=t vegetables=t totalhigh 815 ==> bread and cake=t 746 lifts(1.27) lev:(0.03) [159] conv:(3.26) 5. party snack foods-t fruit=t total=high 854 lifts(1.27) lev:(0.04) [164] conv:(3.15) bread and cake=t 723. > bread and cake=t 779 — 6. biscuits-t frozen foods-t vegetables-t total-high 797 —> bread and cake-t 725 lift:(1.26) lev:(0.03) [151] conv:(3.06) 7. baking needs=t biscuits=t vegetables=t total=high 772 ==> bread and cake=t 701 lifts(1.26) lev:(0.03) [145] conv:(3.01) 8. biscuits=t fruit=t total-high 954 => bread and cake=t 866 lifts(1.26) lev:(0.04) [179] conv:(3) 9, frozen foods-t fruit-t vegetables-t total-high 834 lift:(1.26) lev:(0.03) [156] conv:(3) 10, frozen foods=t fruit=t totalhigh 969 ==> bread and cake=t 877 lift(1.26) lev:(0.04) [179] conv:(2.92) bread and cake=t 757 ANALYSIS: We can see that all presented rules have a consequent of “bread and cake”.All presented rules indicate a high total transaction amount, Biscuits an frozen foods appear in many of the presented rules, We might find that convincing people that buy biscuits, frozen foods and fruit to buy bread and cake results in a high total transaction amount (Rule #1). This may sound plausible, but is flawed reasoning. The product combination does not cause a high total, itis only associated with a high total. EXPERIMENT 8 |M: Implementation of Visualize technique on ARFF files using WEKA THEORY: The Weka Knowledge Explorer is an easy to use graphical user interface that harnesses the power of the weka software, Each of the major weka packages Filters, Classifiers, Clusterers, Associations, and Attribute Selection is represented in the Explorer along with a Visualization tool which allows datasets and the predictions of Classifiers and Clusterers to be visualized in two dimensions. Visualize Panel: This panel displays a scatter plot matrix for the current dataset. The size of the individual cells and the size of the points they display can be adjusted using the slider controls at the bottom of the panel. The number of cells in the matrix can be changed by pressing the "Select Attributes" button and then choosing those attributes to displayed. When. a dataset is large, plotting performance can be improved by displaying only a subsample of the current dataset, Clicking on a cell in the matrix pops up a larger plot panel window that displays. the view from that cell. This panel allows you to visualize the current dataset in one and two dimensions. When the colouring attribute is discrete, each value is displayed as a different colour; when the colouring attribute is continuous, a spectrum is used to indicate the value. Attribute "bars" (down the right hand side of the panel) provide a convenient summary of the discriminating power of the attributes individually. This panel can also be popped up in a separate window from the classifier panel and the cluster panel to allow you to visualize predictions made by classifiers/clusterers. When the class is discrete, misclassified points are shown by a box in the colour corresponding to the class predicted by the classifier; when the class is continuous, the size of each plotted point varies in proportion to the magnitude of the error made by the classifier. ai. Lu. Plot Matrix sepallength sepalwidth petallength petalwidth _class class petalwidth petallength ¥ ss peste |Z i I sepalwidth sepallength Many algorithms in the weka tool allows visualization of the results. Some of them are shown below: Decision Tree(J48) <= 067 > 06 — Ss ely Dur ces DAs — ay SE 249 | © © @ __ Probability Distribution Table For sepalwidth Garr petallength x petalwidth CobWeb Clustering Weka also has a_— graphical user interface to a _—neural_—_ network ([Link]), This interface allows the user to specify the structure of a multi fer perceptron and the parameters that control its training. ‘Controls san) eee Learning Rate = (07 Num Of Epochs “S00 Accept | Error per Eooch = 0 Momentum = (0°? EXPERIMEN AIM: Data Warehouse Cleansing to input names from users (inconsistent) and format them. THEORY: Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data, Data quality problems are present in single data collections, such as files and databases, e.g., due to misspellings during data entry, missing information or other invalid data. When multiple data sources need to be integrated, e.g., in data warehouses, federated database systems or global web-based information systems, the need for data cleaning increases significantly. PROBLEM 1: Showing the data plainly 1. The text is very small. This is maybe to try and fit all the data on the screen so one doesn’t have to scroll right to see it. 2. There is some colouring in there. It may signify something important about the data 3. Discontinuities in the column lettering and row numbering. SOLUTION: Using the cell formatting features, we can remove the choices that the original publisher of the dataset made, 1. Select the complete dataset (Ctrl+A). 2. Right-click anyway on the row number column and select Unhide rows. Repeat the same steps for the columns, 3. Select all the cells from which you want to remove the formatting. Click the Format option in the menu. Click on the 'Clear formatting’ option to remove cell colouring and make the text size uniform. 4. Right click again, and go to “Format *. Select the ‘Align’ option and choose the following options to place the text sensibly in the cells: on am sceGeg 5, Resize the rows and columns by selecting them and then right clicking on any of them, In the option ‘Resize rows’, set the row size. PROBLEM 2: Whitespaces and newlines — data that shouldn’t be there There can be duplicate entries in the dataset due to extra characters alongside the data that are not displayed — so we can’t easily see them. These are likely to be either whitespace at the ends of lines (also called ‘trailing spaces’ or newlines that were added accidentally during the data entry, These are very common errors, and their presence can affect eventual analysis of the data, as the spreadsheet treats them as different entries. For example, if we are counting deals that are categorised as Done, the spreadsheet will exclude those that are categorised as “Done “ (note the extra space at the end), Similarly, cells can also have new lines in them, the presence of which is obscured by the layout. SOLUTION: There are two easy ways to remove whitespace and newlines from a worksheet. Both are equally as effective Use the Find and Replace feature Both whitespace and newlines can be “seen” by the spreadsheet. 1. Open the find/replace tool (Ctrl-H). 2. Select “More Options” and check “Regular expressions”, This feature enables the spreadsheet to search for patterns, and not just specific characters. 3. In the input area for Find type [:space:]$ and click “Find”, This is a regular expression that searches only for spaces that are at the end of the text ina cell (which is what the $ denotes). It should look like this pclae (ements) f:space:]s . Replace with I © Match case Entire cells Fewer Options isle) (tla) (© Gurrent selection only © Backwards @ Regular expres: Similarity search © Search For Styles Search in Formulas. Search direction @ Rows © Columns Search in all sheets 4. Running this search will show you the cells in this worksheet that have one or more trailing whitespaces. 5. To remove the trailing whitespaces that have been found, click “Find All”. Make sure the input area for ‘Replace with’ is empty. Then click on “Replace All”, Perform this operation until the spreadsheet tells you, “The search key was not found”. 6. To remove the new lines, repeat steps (a) to (e) with n in the Findbox. Remember to keep regular expressions enabled, or this won't work. 7. Run the auto-filter again to see how the data has changed. Use the TRIM and CLEAN funetions Trailing whitespace and newlines are common enough problems in spreadsheets that there are two specialised functions ~ clean and trim ~ that ean be used to remove them. This is a little more detailed, so follow the steps carefully: 1. In your spreadsheet, the GRAIN dataset should be in ‘Sheetl’. Create a new worksheet for your spreadsheet, called Sheet2. 2. In cell Al of the new worksheet you have just created enter the following formula: =CLEAN(TRIM(GSheet.A1)) and press enter. This will take the content of cell Al from Sheetl, which is your original data, and reproduce it in Sheet2 without any invisible character, new lines or trailing whitesp: 3. Find out the full data range of Sheet! : It will be Al to 1417. In Sheet2, select cell Al and then copy it (Shortcut: Ctrl+C). In the same sheet select the range Al to 1417 and paste the formula into it (Shortcut: Ctrl+V). The complete dataset from Sheet! will be reproduced in Sheet2, without any the problematic invisible characters. 4. To work further on this data, you will have to now remove the formulas you used to clean it. This can be done with the Paste Special operation. In Sheet2, select the complete dataset and copy it. Put the cursor in Cell Al, and then go to Edit + Paste Special. This enables you to choose what attributes of the cell you want to paste: we want to paste everything except the formulas, which is done by checking the appropriate boxes, as below, and then clicking Okay: Selection Paste all @ Text & Numbers @ Date & time formulag & Comments @& Formats @& Objects 5. Double click on any cell, and you will see that it just contains data, not a formula. If you like, run through the steps outlined in Problem | to make the text ‘wrap’ in cells, and adjust the column widths. PROBLEM 3: Blank cells In many spreadsheets we come across there will be empty (“blank”) cells. They may have been left blank intentionally, or in error. Either way, they are missing data, and it’s useful to be able to find, quantify, display and correct them if needed. SOLUTION: Use the COUNTBLANK function This will enable us to show the number of blank cells, which helps us figure out the size of the potential problem: 1, Scroll to the end of the dataset. In a row beneath the data, enter the following formula: -COUNTBLANK(A1:A417) and press enter. This will count the number of blank cells in column A so far as row 417, the last entry in the GRAIN dataset, 2. In the same row copy the formula just created across rows B to I. This will show us a count of the blank entries in the other columns. 3. We can see that by far the most blank cells are in column G, ‘Projected Investment’. Use the ISBLANK function with the Conditional Formatting feature Blank cells can also be highlighted using conditional formatting and the ISBLANK function, changing the background colour of blank cells, so you can see where they are: 1. Select the dataset (cells Al to 1417), and open the ‘Conditional Formatting’ menu (Format —> Conditional Formatting —> Conditional Formatting). This spreadsheet feature allows us to automatically change the formatting (eg. font size, cell style, background colours etc) depending on the criteria we speci 2. In the conditional formatting window, make the conditions look like the image below. ‘To make the blank cells stand out more clearly, use an existing style or set up a new one by clicking the ‘New Style’ option, which takes us to a window where we can choose font type, size, background colour and so on, @& condition 1, (cellvalueis >) (equalto =) |ISBLANKO. celsyle CT: ) 3. Check what has happened. When the conditional formatting is applied the blank cells will be highlighted in red. Here’s how it looks when zoomed out a bit (View Zoom, then enter 75% into the ‘Variable’ option): 4, To remove the conditional formatting: repeat steps (a) to (b) above, but define a clear style (or use the ‘default’) instead. Otherwise select the cell or range of cells and select “Clear Direct Formatting”. Fill in empty values with the Find and Replace feature In the GRAIN dataset we do not know whether blank cells signify data that has been deliberately or accidentally left out. We may want to actively specify that the data is missing, rather than leaving a blank cell. 1. Select the complete data range (A1 to 1417). 2. Open the find/replace dialogue (Shortcut: Ctrl-H). If we have already used this earlier, we will need to disable searching with regular expressions, because this causes the search to work differently. Do this by clicking ‘More Options’ and uncheck regular expressions. 3. Leave the “Search for” input box empty. Type “Missing” into the replace box (without quote marks). Click on ‘replace all’, 4, Every blank cell will now have the value “Missing” recorded in it. We can verify this using the COUNTBLANK function we outlined above. Filling blank cells isn’t always useful and it’s important to choose the right term to denote a missing value. For example, in the context of ‘Proposed investment’ using the term ‘none’ to signify missing data is confusing. Readers may think this means we know there is no investment, rather than that there is no data. PROBLEM 4: Fixing numbers that aren’t numbers The GRAIN dataset has a column called Proposed Investment. This records the amount of cash paid for land in US Dollars. However they are recorded as text, not as numbers. This means the spreadsheet can’t use these values to do the mathematical operations required to make totals, averages, or sort the numbers from highest to lowest, Further, the data have not been entered in a consistent form. SOLUTION: We can solve this with a combination of automation and old-fashioned hand correction of data. A part solution using a combination of formulas Choose a consistent unit: USS millions. Create a new column H called “Projected Investment (USS millions) to the immediate right of the current column G, Projected Investment. We will use column Has a working column to display the outcomes of our calculations. Go to Cell ~—-H2,,—and—enter_— the -— following formula: =LEFT(SUBSTITUTE(G2,"USS”,””), SEARCH” » SUBSTITUTE(G2,”USS”,””))) Then copy it (Shortcut: Ctrl-C). Select the range H1 to H417 using the mouse, and paste the formula into that range. We will see that if there is any data in Column G, a new value will be displayed in column H, as below: = Projected — Projected (uss, ution) VALUED uss77 77 Uss200 200 Where there is no data, a warning sign like #VALUE! will be shown. ‘This formula works using three functions joined together: LEFT, SUBSTITUTE and SEARCH. PROBLEM 5: Structural Problems - Data in inconvenient places Column F of the GRAIN dataset contains details about the intended use to which sold land will be put. Here are the first 10 entries (if our sheet is sorted alphabetically by ‘Landgrabbed’) © Milk, olive oil, potatoes © Rice © Oilpalm © Oil palm © Sugar cane © Oilseed © Rice ‘® Soybeans ‘* Maize, soybeans, wheat ‘* Barley, maize, soybeans, sunflower, wheat In some cells there are single values, like just Oil palm. In others, the picture of how land is used is more complicated and there are more values. At a simple level, this data means we can do some basic analysis. So we can do some useful basic stuff. But there are problems that will affect the analysis: © We can’t see the complete range of options very easily. © We have to rely on the people creating the data to have arranged things alphabetically 00; what if someone had recorded Alfalfa at the end of the list? We couldn't see it. ‘© Further, we assume they’ve spelled things the same each time, or used the same word for example: “Alfalfa crop” or perhaps another word for it. ‘© It’s difficult to get a full list of the land uses that you can look for. SOLUTION: In a spreadsheet this can be partially overcome using the standard filter, which is more flexible than an autofilter. Use standard filters for a more flexible search 1. Go to Data — Filter > Standard Filter. This opens up a window like the one below. It has a lot more options. Let’s look for deals involvii Alfalfa again. Make our standard filter look like this, and select OK: iter eter ‘Operator Field name Condition value columa F contains ‘iat : veeti | (t a oe 2. So, this searches down Column F, where our data about production is stored, for cells where the word “alfalfa” is written anywhere. It doesn’t matter whether other words are there. The sprea isheet will again be filtered to show four rows. We can build up more complicated queries. For example, try this one, which will filter the data for deals where the land use was thought to be for rice AND bananas, amongst others: Filter criteria Operator ——_—Field name Condition | ‘ESNDIGRE = contains A AND = (Column F = Contains =) | Bananas ’ (Wivors onions belo GS one 4. This returns only two results. 5. To remove the filter, go to Data — Filter > Remove. PROBLEM 6: From “banabas” to “bananas” — dealing with inconsistencies in data In the GRAIN dataset column F, called ‘Production’ contains data about what buyers of land intend to grow on their new acquisitions, such as growing ‘cereal’, “soyabeans’ or ‘sugarcane’ and many other types of agricultural activity. As mentioned above, we can use this data to sort and filter the dataset, which helps us see the extent of different kinds of production. However, there are some uncertainties that make the data in the ‘Production’ column far less useful than it could be: © We don’t know everything we could be looking for in the dataset. This is because there isn’t a single list of categories of production (called a ‘controlled vocabulary”). There are in fact well over 100 distinct categories. '* We can’t be confident that the people creating the data have been consistent in how they entered the data. So if we look for “oil seed”, we'll also find “oilseed” and “oil seeds”, This difference is important because a spreadsheet treats these values as completely different things. SOLUTIO! The process of solving this problem has four steps: 1. Get all the categories that appear in the data Make one big list from the huge mess Remove duplicates and correct categories that are nearly the same REN Edit the data to fit the more accurate list of categories (i) Get all the categories that appear in the data 1. Create a new empty worksheet and name it “Text-to-columns” (or something memorable so you know what's in it). We will use this worksheet as a ‘scratch pad’ to work on this particular problem. 2. In the GRAIN dataset that you have been working on, select Column F containing the data about production. Copy it (Shortcut: Ctrl-C) and then go to the new “Text-to-Columns” worksheet and paste it into column A. Rename the column as “Production (original)” 3. We will now have to remove non-printing characters from this data. We can use the combination of CLEAN and TRIM and then Paste-Special. So, in cell BI enter =CLEAN(TRIM(A1)). This will print out a version of the data in cell Al that has no non-printing characters. Copy the formula down to column B to create a completely clean version of column A. Then select column B, copy it, and use Edit — Paste Special, unchecking “formulas”. Now we have clean, usable data in column B. Rename it “Production (cleaned)”. . Select column B that has our cleaned data, and select Data —> Text to Columns. A new window like this will appear: co other Merge diters Textoaimeer (Fe eat (ee aotee na In “Separator options” area uncheck “tab” and check “comma”, as the items in our cells are separated by commas, and not tabs. In the “Fields” area we can see that for each comma the spreadsheet finds, it will move the data after that comma into a new column: am a ESET Select Okay and the list of categories will appear. (ii) Make one big from the huge mess 1. Save the spreadsheet. Now delete column A and save it again with a new filename. Now save it again, this time as a Text CSV. 2. Start up a text editor and open the Text CSV we just created. It'll look something like this when it’s opened: Milk, olive oil, potatoessrrssssserree Ricezzis2raraiaesit OFL PALM rr reerersseee OfL palms sssecreaes Sugar CANE, rrreserr nvr ree OLiseed,..,,,+++201046 RICE eerie SOYBEONSS cies recsdsgs Maize, soybeans, wheat, ssrsrrrsrree Barley, maize, soybeans, sunflower, wheat, sisssssss CHOPS. ss ress rresseee Crops (mainly Soybean), ssssersreseree Crops teteneane ROD cerreaienese: Soybeans and other crops,,,, COPS, sss ssssvereees Maize, SoybEAN,,,+sssr 100004 3. Now copy and paste this text into a new document in Libre Office Writer. We will use Writer to process this data as text, before we return it to a spreadsheet for analysis. So, in Writer we will use the Find and Replace tool ~ it’s the same as in the spreadsheet —to create a single list: Replace all the commas with new lines: open the Find and Replace feature. Make sure “Regular Expressions” are enabled (look in “More Options”). In Find, enter a single comma. In Replace enter n. This will replace every comma with a new line. There will now be lots of blank lines. . Select all the text (Ctrl-A) and copy it into the clipboard (CtrI-C). Go back to our spreadsheet in Libre Office Cale. In a new worksheet select cell Al and paste the text (Ctrl-V), The data will appear, but with lots of blank rows. Select column A and go to Data — Pivot Table. Choose the options like this the below image. en ores fA ea In a new worksheet a list of unique terms used to describe Production, arranged alphabetically, along with a count of how many times they occur, will appear: (iii) Remove duplicates and correct categories that are nearly the same In the GRAIN dataset, ‘© “cofee” is mis-spelt, so is “forstry” (which should be ‘forestry’, which could be the same as “forest’) © there are entries for “fruit” and “fruits”: which one is best? This work is called ‘reconciliation’ and is a process designed to bring clarity to data. It involves looking through a list of terms and: ‘© identifying terms that mean the same thing and creating a new list ‘© applying the new list to the dataset. ‘© Copy the list of items from the Pivot Table we made above and add it into a new worksheet. Use CLEAN and TRIM on it, and then sort it alphabetically in ascending order. ‘© Insert a row at the top. Label column A “Category”, Column B “Issue”, Column C “New Term”, Identifying terms that mean the same thing 1. Go down the list and look for terms that mean the same thing. Here are some things that we should keep in mind: © Spelling mistakes e.g. “bananas” vs “banabas” © Differences in case: “fruit” vs “Fruit”. Choose the case and stick to it (© Multiples: “fruits” vs “fruit”. Choose one. Adjectives: “sweet sorghum”, “winter barley”. If there's a similar category, like just “barley”, it may make sense to remove this more specific category. © Additional terms: “and” in the text eg. “Dairy and Grain farms”; “Citrus and Olives”; “Crops (sorghum)”. The rule is to have only one category in each cell. So delete one of the terms and add it to the list on its own if it doesn't exist. © Qualifying terms eg “beef cattle” vs “beef”. which one. Crops” vs “food crops”. Choose 1, In column B record what we think is the problem e.g. Near match, none, Spelling error. This will just help us keep track of the changes we make. 2. When we've gone down the list and identified the problems, then make the changes in column C. Here’s what we did (we also ringed the suspected duplicates as we went along): 1 2_[Atfaifa '3_ Almonds “4 [Animal ranching 'S_[Asparagus 6 _[Banaba 7 [Banana [Bananas ‘9 [Barley 10 [Beet 11 [Beet cattle “12_[Biofue! and food crops biotuels [Cassava [6_|Cassava and oll paim 7_|Castor ci 18 {cattle 19 |Cattie 20 [Cattle and cropsSpecuiat 21_|oereal 22 [Cereal 23. [Cereal crops 24 [Cereals [25 [Cereals and other crops 1. Once we've done this, run Data —> Pivot Table on the list of new terms (in Column C). We'll see a huge difference. 2. By removing duplicates, spelling and grammar differences and so on we have cut down the categories from 149 to 88, which is still quite an extensive list! Anyhow, we have a more usefull controlled vocabulary. (iv) Edit the data to fit the more accurate list of categories We can do this using a combination of three useful features of the spreadsheet: ‘* Conditional formatting: It changes the formatting of a cell based on a rule that we give it eg, turn any cell in a given range red if it has the word “Sheep” in it. We can use this to highlight production categories that are not in our new improved list of categories. ‘* Data validation: This enables us to restrict what data is entered into a cell. So we specify a list of allowed values, and rather than type what we like into a cell we choose from a list. We can use this to make the changes to the data to bring it in line with the new categories, whilst reducing the risk of introducing more errors. Concatenation: This merges the contents of cells together. We will use it to put the improved data about production back together. EXPERIMENT - 3 AIM: Program of Data Warehouse Cleansing to remove redundancy in data Theory: Remove duplicates and correct categories that are nearly the same In the GRAIN dataset, © “cofee” is mi as “forest”) spelt, so is “forstry” (which should be ‘forestry’, which could be the same + There are entries for “fruit” and “fruits”: which one is best? This work is called ‘reconciliation’ and is a process designed to bring clarity to data. It involves looking through a list of terms and: + identifying terms that mean the same thing and creating a new list + applying the new list to the dataset. Bring the data into a form in which it’s easy to do this task + Copy the list of items from the Pivot Table we made above and add it into a new worksheet. Use CLEAN and TRIM on it, and then sort it alphabetically in ascending order. + Insert a row at the top. Label column A “Category”, Column B “Issue”, Column C “New Term”, Identifying terms that mean the same thing 1. Go down the list and look for terms that mean the same thi we should keep in mind: Here are some things that + Spelling mistakes e.g. “bananas” vs “banabas” + Differences in case: “ fruit” vs “Fruit”. Choose the case and stick to it, + Multiples: “fruits” vs “fruit”, Choose one. Adjectives: “sweet sorghum”, “winter barley”. If there’s a similar category, like just “barley”, it may make sense to remove this more specific category. + Additional terms: “and” in the text eg. “Dairy and Grain farms”; “Citrus and Olives”; “Crops sorghum)”. The rule is to have only one category in each cell. So delete one of the terms and add it to the list on its own if it doesn’t exist. + Qualifying terms: eg “beef cattle” vs “beef”. “Crops” vs “food crops”. Choose which one, 1. In column B record what we think is the problem e.g. Near match, none, Spelling error. This will just help us keep track of the changes we make. 2. When we've gone down the list and identified the problems, then make the changes in column C. Here's what we did (we also ringed the suspected duplicates as we went along): 3. Once we've done this, run Data — Pivot Table on the list of new terms (in Column C). We'll see a huge difference. | Pioachs |Pongemia pia |Pork [Potatoes JPoutty Pulses [Rape Rice JRubber lRye |Seeds |Sesame [Sheep [Sorghum |Soybean {Squash Sugar Sugar beet |Sugarcane [Sunflower |Sweetcom [Tea [Teak [Tett [Tomato farm |Vegetable oil IVegetables Wheat lWine Woot || |e asa |||] |@|9laa|s alae lela) [Total Result By removing duplicates, spelling and grammar differences and so on we have cut down the categories from 149 to 88, which is quite an extensive list! Anyhow, we have a more useful controlled vocabulary. The next step is to apply this to the data, so it can help us with our analysis Edit the data to fit the more accurate list of categories We can do this using a combination of three useful features of the spreadsheet: + Conditional formatting: It changes the formatting of a cell based on a rule that we give it eg. turn any cell in a given range red if it has the word “Sheep” in it. We can use this to highlight production categories that are not in our new improved list of categories. + Data validation: This enables us to restrict what data is entered into a cell. So we specify a list of allowed values, and rather than type what we like into a cell we choose from a list. We can use this to make the changes to the data to bring it in line with the new categories, whilst reducing the risk of introducing more errors. + Concatenation: This merges the contents of cells together. We will use it to put the improved data about production back together. ‘The two things you will need are: + the spreadsheet in which you used Text to Columns to separate out the terms used to describe land use * your new, cleaned up list of terms taken from the previous pivot table. Step 1: Find the items that need correcting ‘We can use the spreadsheet to find the items that need correcting by comparing the data we have to the new list of categories. Here’s how: 1. In the worksheet where you have the data about production, make sure you have cleaned the data (using CLEAN and TRIM and paste special to remove the formulas). 2. Have the split items in a worksheet called “Split”. Place your new list of categories in a worksheet called “Categories”, Select your list of categories. Go to Data — Define Range. 3. This window will appear, which will enable us to make the list of categories into a “range” against which we can make comparisons: 4, You've already selected the list of categories, which you can see displayed in the Range area. Type ProductionCategories into the Name area and then select Add. We can now use this range. 5, Select all the data (Shortcut: Ctrl-A) in the worksheet called “Split”, where the data about production use is split across different columns. Go to Format -> Conditional Formatting. The formula to use is COUNTIF(ProductionCategories,A1)=0. Also, select “New Style”. A new window called “Cell Style” will pop up. In the “Font Effects” tab choose a colour and then select okay. We chose red, which is shown below: Select OK. This highlights in red the text in cells that are not found in the data range we have called “ProductionCategories”, The effect this has is to highlight the entries that we have to now correct, You spreadsheet will look like this: Step 2: Correct these entries Now we know where to look, we can make corrections to the data. The way to do this is to introduce data validation to the spreadsheet. This restricts the data that can be entered into a cell. 1. Select the complete dataset in the worksheet called ‘Split”, where you've just highlighted the values that don’t appear on the improved list of categories. Go to Data — Validity. 2. In the window that opens, make the fields look like those in the image below: Unpaid ow empty ee show secon st Ulsoenbiesscendng ‘SCotegories SASDSASES Se ‘Avald source can only consis of contiguous ‘section of rms anc columes ora formula that reste inan aes or ary. What this does is tell the spreadsheet that the only values that it should allow you to enter come from the list to be found in Cells A2:A88 of the worksheet called “Categories”. In others words: your list of categories. We also need to decide what to do when a value that isn’t on the list is entered. Select “Error Alert” and make it look like this to stop any non-list values being entered: Click OK, and go to a cell with red text in it and click on it. You'll see that a little drop down. selector on the right hand side of the cell ‘| soybeans wheat maize Click on it to display the list of ‘approved’ terms: You can now go through the data, correcting it to remove the errors and make it more useful for analysis. There are a few things to watch out for: As you go through, increasingly you can use keyboard shortcuts and auto-complete and rely on the validation to tell you when you've typed a wrong entry. +When you have changed a value, notice that the text changes to colour to show that it is now a recognised term. When there’s no more red, you're done. *With values like “Soyabean and other crops”, you should change it to “Soybean” and then add a new entry for “Crops”. Don’t forget! Step 3: put it all back together again We will take the improved data about land use and re-incorporate it into the full dataset + Inyou first worksheet, where you have been progressively cleaning the data, insert 15 columns to the right of Column F. Take the data from the worksheet that you validated the production category data in and copy it across into the new columns. If you didn’t put column heading, remember to paste starting in Row 2. You'll be able to see the original data in Column F, and the separated about and cleaned data in Columns G to S. + What we need to do now is put the clean data back into a single cell. We can do this with the CONCATENATE function. This allows you to take data from different cells and blend 10 one, For example: row | F G Formula, ‘Output 22 | Cereal | Pam | =CONCATENATE(E22,F22) CerealPatm OL oil 23. | Cereal | Palm | =CONCATENATE(E22,"",G22) | Cereal Palm Oi! om 24 | Cereal | Pam | =CONCATENATE(F22,",",G22) | Cereal, Palm Oil oil 25. | Cereal | Pam | =CONCATENATE(F22,” and | Cereal and Palm oil i22) oil + We can blend data from other cells (which we called ‘referencing’), with other text to make it more readable. The examples above show how this works. + Now let’s apply it to the data in our spreadsheet. In cell T2 insert the following formula: 72 (ormula) 72 (output) =CONCATENATE(G2,", "sH2," "M2,", "2" "SK2," 5 Milk, Olives, 2,",")M2,","JN2.” ,”,02,", ”,P2,",".Q2," ,",R2, potatoes, + It looks messy, but follow the logic. It looks across row 2 in columns G to S for data, and then prints it with a comma in between. We don’t know which cells have data in them precisely, so there is a list of trailing commas which print where a cell is empty. We can get rid of these easily with the LEFT and SEARCH formulas we used ‘above’_. In cell U2: ‘72 (output) U2 (ormula) U2 (Output) Milk, Olives, potatoes, .,,, | =LEFT(T2,SEARCH( Milk, Olives, SPH) potatoes + So, with these two formulas in place, they can be copied down the spreadsheet to complete the operation. The only row that this doesn’t work for is row 94, Temasek’s purchase of land in China. You can correct it by hand :) +The final step is to make the new data usable. If you're confident the work is done, insert column to the left of the original column F called Production, Call this new Column G “Production (cleaned)”. Select all the data in Column T (rows 2 to 417), move the cursor and use Paste Special to insert it into Column G. + You can now either delete or hide columns H to W in which you've been working, Step 4: ising the cleaned data about types of production land use ‘You can repeat the steps that we outlined in Problem 4 above, using the standard filters to build queries on the cleaned data in column G. This time you will not have problems with inconsistencies, mis- spellings and so on, EXPERIMENT 4 Aim: Introduction to WEKA tool WEKA stands for Waikato Environment for Knowledge Learning. It was developed by the University of Waikato, New Zealand. WEKA is a collection of machine learning algorithms for data mining tasks, The algorithms can either be applied directly to a dataset or called from your own Java code, WEKA supports many data mining tasks such as data processing, classification, clustering, regression and feature selection to name a few. ‘What WEKA offers is summarized in the following diagram — Raw data Installation of Weka You can download Weka from the official website [Link] Once the download is completed, run the exe file and choose the default set-up. Weka data formats Weka uses the Attribute Relation File Format for data analysis, by default. But listed below are some formats that Weka supports, from where data can be imported: + csv © ARFF © Database using ODBC Attribute Relation File Format (ARFF): This has two parts: 1) The header section defines the relation (data set) name, attribute name and the type. 2) The data section lists the data instances. ‘An ARFF file requires the declaration of the relation, attribute and data. - @relation: This is the first line in any ARFF file, written in the header section, followed by the relation/data set name. The relation name must be a string and if it contains spaces, then it should be enclosed between quotes. @attribute: These are declared with their names and the type or range in the header section. ‘Weka supports the following data types for attributes: © Numeric * © String © date © @data— Defined in the Data section followed by the list of all data segments ‘The Workflow of WEKA would be as follows: Data Pre-processing Data Mining Knowledge Getting started with WEKA Choose “WEKA 3.8.4” from Programs. The first interface that appears looks like the one given below. Weka GUI Chooser * Ss Program Visualization Tools Help WEKA The University Emotes of Waikato © Explorer: An environment for exploring data, It supports data preprocessing, attribute selection, learning and visualization © Experimenter: An environment for performing experiments and conducting statistical tests between machine learning algorithms. © Knowledge Flow: It is similar to Explorer but has a drag-and-drop interface. It gives a visual design of the KDD process. © Simple CLI: Provides a simple command-line interface for executing WEKA commands. WEKA Tools © Preprocessing Filters: The data file needs to be loaded first. Given below is an example. TT a Es —_— —— = ‘The supported data formats are ARFF, CSV, C4.5 and binary. Alternatively you could also import from URL or an SQL database. After loading the data, preprocessing filters could be used for adding/removing attributes, discretization, Sampling, randomizing ete, © Select attributes: WEKA has a very flexible combination of search and evaluation methods for the dataset’s attributes. Search methods include Best-first, Ranker, Genetic search, etc, Evaluation measures include InformationGain, GainRatio, ReliefF, etc. © Classification: The predicted target must be categorical. WEKA includes methods such as Decision Trees, Naive Bayes and Neural Networks to name a few. Evaluation methods also include test data set and cross validation. © Clustering: The learning process occurs from data clusters. Methods include k-means, Cobweb and FarthestFirst. © Regression: The predicted target is continuous, Methods such as linear regression, Neural networks and Regression trees are included in the library. Exercise (Using built-in d 1. Click Explorer on the first interface screen and load a dataset from the library. 2 oven SS — 0 TS eicrs 5 (a) (¢) (a) le) Sanne ot = Resuecen vont more coton ao Ejnnaccamna e-t300 anime “corsat resotte: (saute at 5 neni) [cane Given here is an illustration for the dataset ‘[Link]’ prone s. = . Le ES 2. Click over each attribute to visualize the distribution of the samples for each of them. You can also visualize all of them at the same time by clicking the “Visualize all’ on the right pane. [O aratmas ecing ss iom_i teat amount seas sa enponment perona se ser, paren rumen once vies fl esl 3. Under the Classify tab, click ‘Choose’ and select a classifier from the drop-down menu. E.g.: ‘Decision Stump” 4. Once, a classifier is chosen, select percentage split and leave it with its default values. The default ratio is 66% for training and 34% for testing. 5. Click ‘Start’ to train and test the classifier. The interface will now like this: a ‘You could also try using ‘Crossvalidation’ method to train and test the data. 7. ‘The right pane shows the results for training and testing. It also indicates the number of correctly classified and misclassified samples. 8. You could right click on the model generated and do various operations. You could also ave the model if you wanted. Another performance measure is the ROC curve that can be viewed as shown in the next picture. Select “bad” in the option to view thecurve. Peso st oon-chex tor optens) Tay Claire View in main window View in separate window ‘sve resut bu Delete esut bute) ‘Load mae Seve model Re-nvauate model on cuentas et ‘Re-appy is modes configuration Visualize clssiferenors visualize margin cure ). Click on the ‘Select Attributes’ tab and to analyze the attributes. A number of ‘Attribute Evaluator’ and ‘Search methods’ can be combined to gain insight about the attributes, Given below is an example. 10. Click on the Visualize tab to see the pair wise relationship of the attributes. ‘Once the model has been trained and tested, we need to measure the performance of the model For this purpose we used three measures namely: precision, recall and accuracy. + True positive (tp): 231 + True negative (tn): 37 + False positive (fp): 53 + False negative (fn): 19 Precision (P) = tp/(tp+fp) = 231/(231+53) = 813 Recall (R) = tp/(tp+fn) = 231(231+19) = 0.924 Accuracy (A) = (tp+tn)/Total No. of samples = (231+37) / 340 = 788235 or 78.8235%.

También podría gustarte