• Data mining, also known as knowledge discovery in data (KDD), is the process of uncovering
patterns and other valuable information from large data sets.
• Data mining can be used by corporations for everything from learning about what customers
are interested in or want to buy to fraud detection and spam filtering.
• It can help them to develop more effective marketing strategies, increase sales, and decrease
costs. Data mining relies on effective data collection, warehousing, and computer processing.
• It also is a market research tool that helps reveal the sentiment or opinions of a given group
of people.
• Social media companies use data mining techniques to commodify their users in order to
generate profit.
• Data mining is the process of discovering patterns, trends, and insights from large datasets
using various techniques from statistics, machine learning, and database systems.
• It involves extracting useful information from data, often with the goal of making informed
decisions or predictions
key properties of data mining
• Automatic discovery of patterns
• Prediction of likely outcomes
• Creation of actionable information
• Focus on large datasets and databases
Techniques of Data Mining
• Anomaly detection (Outlier/change/deviation detection) – The identification of unusual
data records, that might be interesting or data errors that require further investigation.
• Association rule learning (Dependency modelling) – Searches for relationships between
variables. For example a supermarket might gather data on customer purchasing habits.
• Using association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
• Clustering – is the task of discovering groups and structures in the data that are in some way
or another "similar", without using known structures in the data.
• Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".
• Regression – attempts to find a function which models the data with the least error.
• Summarization – providing a more compact representation of the data set, including
visualization and report generation.
How Data Mining Works?
1. Data is collected and loaded into data warehouses on site or on a cloud service.
2. Business analysts, management teams, and information technology professionals access the
data and determine how they want to organize it.
3. Custom application software sorts and organizes the data.
4. The end user presents the data in an easy-to-share format, such as a graph or table.
Knowledge Discovery in Databases(KDD)
• Some people treat data mining the same as Knowledge discovery while some people view
data mining essential step in the process of knowledge discovery.
1. Data Cleaning - In this step the noise and inconsistent data is removed.
2. Data Integration - In this step multiple data sources are combined.
3. Data Selection - In this step relevant to the analysis task are retrieved from the database.
4. Data Transformation - In this step data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
5. Estimate the model: The selection and implementation of the appropriate data-mining
technique is the main task in this phase.
6. Interpret the model and draw conclusions.
Data mining process
• Setting objectives,
• Data gathering
• Data preparation,
• 1. Outlier detection (and removal) – Outliers are unusual data values that are not
consistent with most observations.
• 2. Scaling, encoding, and selecting features – Data preprocessing includes several
steps such as variable scaling and different types of encoding. For example, one
feature with the range [0, 1] and the other with the range [−100, 1000] will not have
the same weights in the applied technique; they will also influence the final data-
mining results differently.
• Applying data mining algorithms
• Supervised Learning
• Classification
• Regression
• Unsupervised Learning
Evaluating results
Supervised learning
• Supervised learning is an approach to machine learning that uses labeled data sets to train
algorithms in order to properly classify data and predict outcomes.
Challenges of supervised learning
• Supervised learning models can require certain levels of expertise to structure accurately.
• Training supervised learning models can be very time-intensive.
• Datasets can have a higher likelihood of human error, resulting in algorithms learning
incorrectly.
• Unlike unsupervised learning models, supervised learning cannot cluster or classify data on
its own.