0% found this document useful (0 votes)
57 views24 pages

DM Lesson4

Data preprocessing is a crucial step in data mining that transforms raw data into a usable format by addressing issues like incompleteness and inconsistencies. It involves several processes including data cleaning, integration, transformation, reduction, and discretization, which prepare data for analysis in various applications such as finance, retail, telecommunications, and biological research. The document also discusses the importance of data mining systems, their scalability, visualization tools, and current trends in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views24 pages

DM Lesson4

Data preprocessing is a crucial step in data mining that transforms raw data into a usable format by addressing issues like incompleteness and inconsistencies. It involves several processes including data cleaning, integration, transformation, reduction, and discretization, which prepare data for analysis in various applications such as finance, retail, telecommunications, and biological research. The document also discusses the importance of data mining systems, their scalability, visualization tools, and current trends in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

DATA PREPROCESSING

• Data preprocessing is a data mining technique that involves transforming raw


data into an understandable format.

• Real-world data is often incomplete, inconsistent, and/or lacking in


certain behaviors or trends, and is likely to contain many errors.

• Data preprocessing is a proven method of resolving such issues.

• Data preprocessing prepares raw data for further processing.

• Data preprocessing is used database-driven applications such as customer


relationship management and rule-based applications (like neural networks).
Data goes through a series of
steps during pre processing:

 Data Cleaning: Data is cleansed through processes such as filling in missing

values, smoothing the noisy data, or resolving the inconsistencies in the data.
 Data Integration: Data with different representations are put together and
conflicts within the data are resolved.
 Data Transformation: Data is normalized, aggregated and generalized.

 Data Reduction: This step aims to present a reduced representation of

the data in a data warehouse.


 Data Discretization: Involves the reduction of a number of values of a continuous
attribute by dividing the range of attribute intervals.
Integration of a data mining
system with a data warehouse:
DATA MINING APPLICATIONS

Here is the list of areas where data mining is widely used −


 Financial Data Analysis
 Retail Industry
 Telecommunication Industry
 Biological Data Analysis
 Other Scientific Applications
 Intrusion Detection
Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of
high quality which facilitates systematic data analysis and data mining.
Some of the typical cases are as follows −
• Design and construction of data warehouses for multidimensional data
analysis and data mining.
• Loan payment prediction and customer credit policy analysis.
• Classification and clustering of customers for targeted marketing.
• Detection of money laundering and other financial crimes.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of

data from on sales, customer purchasing history, goods transportation, consumption

and services. It is natural that the quantity of data collected will continue to

expand rapidly because of the increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends that
lead to improved quality of customer service and good customer retention and satisfaction.
Here is the list of examples of data mining in the retail industry −

• Design and Construction of data warehouses based on the benefits of data mining.
• Multidimensional analysis of sales, customers, products, time and region.
• Analysis of effectiveness of sales campaigns.
• Customer Retention.
• Product recommendation and cross-referencing of items.
Telecommunication Industry

Today the telecommunication industry is one of the most emerging


industries providing various services such as fax, pager, cellular phone,

internet messenger, images, e-mail, web data transmission, etc.

Due to the development of new computer and communication technologies,

the telecommunication industry is rapidly expanding.

This is the reason why data mining is become very important to

help and understand the business.


Data mining in telecommunication industry helps in identifying the telecommunication
patterns, catch fraudulent activities, make better use of resource, and improve quality of
service. Here is the list of examples for which data mining improves
telecommunication services −

• Multidimensional Analysis of Telecommunication data.


• Fraudulent pattern analysis.
• Identification of unusual patterns.
• Multidimensional association and sequential patterns analysis.
• Mobile Telecommunication services.
• Use of visualization tools in telecommunication data analysis.
Biological Data Analysis

In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very

important part of Bioinformatics. Following are the aspects in which data mining
contributes for biological data analysis −
• Semantic integration of heterogeneous, distributed genomic
and proteomic databases.
• Alignment, indexing, similarity search and comparative analysis
multiple nucleotide sequences.
• Discovery of structural patterns and analysis of genetic networks and protein pathways.
• Association and path analysis.
• Visualization tools in genetic data analysis.
Other Scientific Applications

The applications discussed above tend to handle relatively small and homogeneous data
sets for which the statistical techniques are appropriate. Huge amount of data have been
collected from scientific domains such as geosciences, astronomy, etc. A large amount
of data sets is being generated because of the fast numerical simulations in various
fields such as climate and ecosystem modelling, chemical engineering, fluid dynamics,
etc. Following are the applications of data mining in the field of Scientific Applications

 Data Warehouses and data preprocessing.
 Graph-based mining.
 Visualization and domain specific knowledge.
Intrusion Detection

Intrusion refers to any kind of action that threatens integrity, confidentiality,

or the availability of network resources. In this world of connectivity, security

has become the major issue. With increased usage of internet and availability

of the tools and tricks for intruding and attacking network prompted intrusion

detection to become a critical component of network administration. Here is the list of

areas in which data mining technology may be applied for intrusion detection −
Here is the list of areas in which data mining technology may be applied for
intrusion detection −
• Development of data mining algorithm for intrusion detection.
• Association and correlation analysis, aggregation to help select and
build discriminating attributes.
• Analysis of Stream data.
• Distributed data mining.
• Visualization and query tools.
Data Mining System Products

There are many data mining system products and domain specific data mining applications.
The new data mining systems and applications are being added to the previous systems.
Also, efforts are being made to standardize data mining languages.
 Data Sources − Data sources refer to the data formats in which data mining system
will operate. Some data mining system may work only on ASCII text files while others
on multiple relational sources. Data mining system should also support ODBC connections
or OLE DB for ODBC connections.
 Data Mining functions and methodologies − There are some data mining systems
 that provide only one data mining function such as classification while some
 provides multiple data mining functions such as concept description, discovery-driven
 OLAP analysis, association mining, linkage analysis,
 statistical analysis, classification, prediction, clustering, outlier
 analysis, similarity search, etc.
Scalability − There are two scalability issues in data mining −

o Row (Database size) Scalability − A data mining system is considered as

row scalable when the number or rows are enlarged 10 times. It takes no more than
10 times to execute a query.
o Column (Dimension) Scalability − A data mining system is considered as column
scalable if the mining query execution time increases linearly with the number of columns.
Visualization Tools − Visualization in data mining can be
categorized as follows −
o Data Visualization
o Mining Results Visualization
o Mining process visualization
o Visual data mining
Data Mining query language and graphical user interface

− An easy-to-use graphical user interface is important to promote

user-guided, interactive data mining. Unlike relational database

systems, data mining systems do not share underlying


 data mining query language.
Trends in Data Mining

Data mining concepts are still evolving and here are the latest trends
that we get to see in this field
 Application Exploration.
 Scalable and interactive data mining methods.

 Integration of data mining with database systems, data warehouse

systems and web database systems.


 Standardization of data mining query language.
 Visual data mining.
 New methods for mining complex types of data.
 Biological data mining.
 Data mining and software engineering.
 Web mining.
 Distributed data mining.
 Real time data mining.
 Multi database data mining.
 Privacy protection and information security in data mining

You might also like