BY : MOHIT YADAV (096 ) JAYEETA CHATTERJEE ( 101) MONIKA KATARIA ( 112 )
MEANING OF DATA MINING
Data mining (the analysis step of the knowledge discovery in databases process), a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems.
ROLE OF DATA MINING
Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals.
Analyze the data by application software.
Present the data in a useful format, such as a graph or table.
EXAMPLE OF DATA MINING
ADVANTAGES AND DISADVANTAGES OF DATAMINING
Marketing / Retail
Finance / Banking Manufacturing Government Privacy Issues Security Issues Misuse of Information / Inaccurate
Information
MEMORY BASED REASONING TECHNIQUE - MEANING
Memory-Based Reasoning (MBR) tries to mimic human behavior in an automatic way. Memories of specific events are used directly to make decisions, rather than indirectly (as in systems which use experience to infer rules). MBR is a two step procedure: first, identifying similar cases from experience, secondly, applying the information from these cases to new cases. MBR is specifically well suited to non-numerical data. MBR needs a distance measure to assign dissimilarity of two observations and a combination function to combine the results from the neighboring points to achieve an answer. Generating examples is much easier than generating rules which makes MBR so attractive. However, applying rules to new observations is much easier and faster than comparing new cases to a bulk of memorized objects.
The human ability to reason from experience depends on the ability to recognize appropriate examples from the past. A doctor diagnosing diseases, a claims analyst identifying fraudulent insurance claims, Each first identifies similar cases from experience and then applies knowledge of those examples to the problem at hand. This is the essence of memory-based reasoning. A database of known records is searched to find preclassified records similar to a new record. These neighbors are used for classification and estimation.
ELEMENTS OF MBR
It uses known instances of a model to predict unknown instances. Maintains a dataset of known records. When a new record arrives for evaluation, the algorithm finds neighbors similar to new record which helps in :
Prediction
Classification
HOW IT WORKS?
When a new record arrives, the tool first calculates the distance between new record and the records existing in the training dataset. The distance function does the calculation. This determine which training dataset qualify to be considered as neighbors.
SOLVING A DATA MINING PROBLEM USING MBR
Selecting the most suitable historical records to form the training or base dataset. Establishing the best way to compose the historical record. Determining the two essential functions:
Distance Function
Combination Function
MBR APPLICATIONS
Fraud detection Customer response prediction
Medical treatments
Classifying responses MBR can process free-text responses and assign codes
12
PREDICTIVE DATA MINING USED IN MBR
Honest
Tridas Vickie Mike
Crooked
Wally Waldo Barney
13
PREDICTION
Tridas
Vickie
Mike
Honest = has round eyes and a smile
14
ADVANTAGES
Can use data as is.
Able to adapt easily to new data. Adding/deleting example does not give side effect.
Explanation of answers is based on real examples.
It is possible to apply to ordered data as well as
Nominal data and ratio data.
High parallelism is possible.
DISADVANTAGES
Resource intensive
No ability to generate the answer that does not exist in the examples data base.
Prediction accuracy strongly depends on the definition of similarity. Choosing appropriate historical data for use in training Choosing the most efficient way to represent the training data Choosing the distance function, combination function, and the number of neighbors
CONCLUSION
It produces results that are readily understandable.
It is applicable to arbitrary data types, even nonrelational data.
It works efficiently on almost any number of fields. Maintaining the training set requires a minimal amount of effort. It is computationally expensive when doing classification and prediction. It requires a large amount of storage for the training set. Results can be dependent on the choice of distance function, combination function, and number of neighbors.