0% found this document useful (0 votes)

37 views26 pages

Unit-3 Introduction To Data Mining

Uploaded by

G.Akshaya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views26 pages

Unit-3 Introduction To Data Mining

Uploaded by

G.Akshaya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT 3

INTRODUCTION TO DATA MINING

Data Mining: Concepts and applications - Data mining process - Text & Web Analytics:
Text analytics and text mining overview- Text mining applications- Web mining
overview- Social media analytics- Sentiment analysis overview- Big Data Analytics:
Definition and characteristics of big data- Fundamentals of big data analytics

3.1 Data Mining: Concepts and applications

Definition of Data Mining:
Data mining is the process of discovering patterns, correlations, anomalies, and other
meaningful insights from large datasets using various computational methods and techniques.
It involves extracting useful information and knowledge from data, which can then be used for
decision-making, predictive modeling, trend analysis, and other applications. Data mining
typically involves techniques from statistics, machine learning, and database systems to
uncover hidden patterns and relationships that may not be immediately apparent. The goal is
to transform raw data into valuable and actionable information.
What is data mining?
Data mining is an important role for IT professionals, and a degree in data analytics can help
you be qualified to have a career in data mining. But everyone in business also needs to
understand data mining—it is vital to how many business processes are done and how
information is gleaned, so current and aspiring business professionals need to understand how
this process works as well.
Simply put, data mining is the process that companies use to turn raw data into useful
information. They utilize software to look for patterns in large batches of data so they can learn
more about customers. It pulls out information from data sets and compares it to help the
business make decisions. This eventually helps them to develop strategies, increase sales,
market effectively, and more.
Data mining sometimes gets confused with machine learning and data analysis, but these terms
are all very different and unique.
While both data mining and machine learning use patterns and analytics, data mining looks for
patterns that already exist in data, while machine learning goes beyond to predict future
outcomes based on the data. In data mining, the “rules” or patterns aren’t known from the start.
In many cases of machine learning, the machine is given a rule or variable to understand the
data. Additionally, data mining relies on human intervention and decisions, but machine
learning is meant to be started by a human and then learn on its own. There is quite a bit of
overlap between data mining and machine learning, machine learning processes are often
utilized in data mining in order to automate those processes.
Similarly, data analysis and data mining aren’t interchangeable terms. Data mining is used in
data analytics, but they aren’t the same. Data mining is the process of getting the information
from large data sets, and data analytics is when companies take this information and dive into
it to learn more. Data analysis involves inspecting, cleaning, transforming, and modeling data.
The ultimate goal of analysis is discovering useful information, informing conclusions, and
making decisions.
Data mining, data analysis, artificial intelligence, machine learning, and many other terms are
all combined in business intelligence processes that help a company or organization make
decisions and learn more about their customers and potential outcomes.

3.1.1 Overview of the data mining process:

Almost all businesses use data mining, and it’s important to understand the data mining process
and how it can help a business make decisions.
Business understanding. The first step to successful data mining is to understand the overall
objectives of the business, then be able to convert this into a data mining problem and a plan.
Without an understanding of the ultimate goal of the business, you won’t be able to design a
good data mining algorithm. For example, a supermarket may want to use data mining to learn
more about their customers. The business understanding is that a supermarket is looking to find
out what their customers are buying the most.
Data understanding. After you know what the business is looking for, it’s time to collect data.
There are many complex ways that data can be obtained from an organization, organized,
stored, and managed. Data mining involves getting familiar with the data, identifying any
issues, getting insights, or observing subsets. For example, the supermarket may use a rewards
program where customers can input their phone number when they purchase, giving the
supermarket access to their shopping data.
Data Preparation. Data preparation involves getting the information production ready. This
is the biggest part of data mining. It is taking the computer-language data, and converting it
into a form that people can understand and quantify. Transforming and cleaning the data for
modeling is key for this step.
Modeling. In the modeling phase, mathematical models are used to search for patterns in the
data. There are usually several techniques that can be used for the same set of data. There is a
lot of trial and error involved in modeling.
Evaluation. When the model is complete, it needs to be carefully evaluated and the steps to
make the model need to be reviewed, to ensure it meets the business objectives. At the end of
this phase, a decision about the data mining results will be made. In the supermarket example,
the data mining results will provide a list of what the customer has purchased, which is what
the business was looking for.
Deployment. This can be a simple or complex part of data mining, depending on the output of
the process. It can be as simple as generating a report, or as complex as creating a repeatable
data mining process to happen regularly.
After the data mining process has been completed, a business will be able to make their
decisions and implement changes based on what they have learned.
3.2 Data mining Process/Techniques:
Businesses collect and store an unimaginable amount of data, but how do they turn all that data
into insights that help them build a better business? Data mining, the process of sifting through
massive amounts of data to identify hidden business trends or patterns, makes these
transformational business insights possible.
Data mining is not a new technology. The advent of modern computers and application of data
mining techniques meant businesses could finally analyze exponential amounts of data and
extract non-intuitive, valuable insights; forecasting likely business outcomes, mitigating risks,
and taking advantage of newly identified opportunities.
3.2.1 Data mining techniques in business analytics:
Now that you understand why data mining is important, it’s beneficial to see how data mining
works specifically in business settings.
Here are 10 data mining techniques that we will explore in detail:

1. Clustering
2. Association
3. Data Cleaning
4. Data Visualization
5. Classification
6. Machine Learning
7. Prediction
8. Neural Networks
9. Association Rules
10. Regression Analysis
11. Outlier Detection
12. Data Warehousing

1. Clustering:
Clustering is a technique used to represent data visually — such as in graphs that show buying
trends or sales demographics for a particular product.

What Is Clustering in Data Mining?

Clustering refers to the process of grouping a series of different data points based on their
characteristics. By doing so, data miners can seamlessly divide the data into subsets, allowing
for more informed decisions in terms of broad demographics (such as consumers or users) and
their respective behaviors.
Methods for Data Clustering:
 Partitioning method: This involves dividing a data set into a group of specific
clusters for evaluation based on the criteria of each individual cluster. In this method,
data points belong to just one group or cluster.
 Hierarchical method: With the hierarchical method, data points are a single cluster,
which are grouped based on similarities. These newly created clusters can then be
analyzed separately from each other.
 Density-based method: A machine learning method where data points plotted together
are further analyzed, but data points by themselves are labeled “noise” and discarded.
 Grid-based method: This involves dividing data into cells on a grid, which then can
be clustered by individual cells rather than by the entire database. As a result, grid-
based clustering hase a fast processing time.
 Model-based method: In this method, models are created for each data cluster to locate
the best data to fit that particular model.
Examples of Clustering in Business
Clustering helps businesses manage their data more effectively. For example, retailers can use
clustering models to determine which customers buy particular products, on which days, and
with what frequency. This can help retailers target products and services to customers in a
specific demographic or region.
Clustering can help grocery stores group products by a variety of characteristics (brand, size,
cost, flavor, etc.) and better understand their sales tendencies. It can also help car insurance
companies that want to identify a set of customers who typically have high annual claims in
order to price policies more effectively. In addition, banks and financial institutions might use
clustering to better understand how customers use in-person versus virtual services to better
plan branch hours and staffing.
2. Association
Association rules are used to find correlations, or associations, between points in a data set.

What Is Association in Data Mining?

Data miners use association to discover unique or interesting relationships between variables
in databases. Association is often employed to help companies determine marketing research
and strategy.

Methods for Data Mining Association

Two primary approaches using association in data mining are the single-dimensional and multi-
dimensional methods.

 Single-dimensional association: This involves looking for one repeating instance of a

data point or attribute. For instance, a retailer might search its database for the instances
a particular product was purchased.
 Multi-dimensional association: This involves looking for more than one data point in
a data set. That same retailer might want to know more information than what a
customer purchased — such as their age, method of purchase (cash or credit card), or
age.
Examples of Association in Business:
The analysis of impromptu shopping behavior is an example of association — that is, retailers
notice in data studies that parents shopping for childcare supplies are more likely to purchase
specialty food or beverage items for themselves during the same trip. These purchases can be
analyzed through statistical association.
Association analysis carries many other uses in business. For retailers, it’s particularly helpful
in making purchasing suggestions. For example, if a customer buys a smartphone, tablet, or
video game device, association analysis can recommend related items like cables, applicable
software, and protective cases.
Additionally, association is used by the government to employ census data and plan for public
services; it is also used by doctors to diagnose various illnesses and conditions more effectively.

3. Data Cleaning
Data cleaning is the process of preparing data to be mined.

What Is Data Cleaning in Data Mining?

Data cleaning involves organizing data, eliminating duplicate or corrupted data, and filling in
any null values. When this process is complete, the most useful information can be harvested
for analysis.
Methods for Data Cleaning
 Verifying the data: This involves checking that each data point in the data set is in the
proper format (e.g, telephone numbers, social security numbers).
 Converting data types: This ensures data is uniform across the data set. For instance,
numeric variables only contain numbers, while string variables can contain letters,
numbers, and characters.
 Removing irrelevant data: This clears useless or inapplicable data so full emphasis
can be placed on necessary data points.
 Eliminating duplicate data points: This helps speed up the mining process by
boosting efficiency and reducing errors.
 Removing errors: This eliminates typing mistakes, spelling errors, and input errors
that could negatively affect analysis outcomes.
 Completing missing values: This provides an estimated value for all data and reduces
missing values, which can lead to skewed or incorrect results.

Examples of Data Cleaning in Business

According to Experian, 95 percent of businesses say they have been impacted by poor data
quality. Working with incorrect data wastes time and resources, increases analysis costs
(because models need to be repeated), and often leads to faulty analytics.
Ultimately, no matter how great their models or algorithms are, businesses suffer when their
data is incorrect, incomplete, or corrupted.

4. Data Visualization
Data visualization is the translation of data into graphic form to illustrate its meaning to
business stakeholders.

What Is Data Visualization in Data Mining?

Data can be presented in visual ways through charts, graphs, maps, diagrams, and more. This
is a primary way in which data scientists display their findings.
Methods for Data Visualization
Many methods exist for representing data visually. Here are a few:

 Comparison charts: Charts and tables express relationships in the data, such as
monthly product sales over a one-year period.
 Maps: Data maps are used to visualize data pertaining to specific geographic locations.
Through maps, data can be used to show population density and changes; compare
populations of neighboring states, counties, and countries; detect how populations are
spread over geographic regions; and compare characteristics in one region to those in
other regions.
 Heat maps: This is a popular visualization technique that represents data through
different colors and shading to indicate patterns and ranges in the data. It can be used
to track everything from a region’s temperature changes to its food and pop culture
trends.
 Density plots: These visualizations track data over a period of time, creating what can
look like a mountain range. Density plots make it easy to represent occurrences of single
events over time (e.g., month, year, decade).
 Histograms: These are similar to density plots but are represented by bars on a graph
instead of a linear form.
 Network diagrams: These diagrams show how data points relate to each other by using
a series of lines (or links) to connect objects together.
 Scatter plots: These graphs represent data point relationships on a two-variable axis.
Scatter plots can be used to compare unique variables such as a country’s life
expectancy or the amount of money spent on healthcare annually.
 Word clouds: These graphics are used to highlight specific word or phrase instances
appearing in a body of text; the larger the word’s size in the cloud, the more frequent
its use.

Examples of Data Visualization in Business:

Representing data visually is an important skill because it makes data readily understandable
to executives, clients, and customers. According to Markets and Markets, the market size for
global data visualization tools is expected to nearly double (to $10.2 billion) by 2026.
Companies can make faster, more informed decisions when presented with data that is easy to
understand and interpret. Today, this is typically accomplished through effective, visually
accessible mediums such as graphs, 3D models, and even augmented reality. As a result, it’s a
good idea for aspiring data professionals to consider learning such skills through a data science
and visualization bootcamp.

5. Classification
Classification is a fundamental technique in data mining and can be applied to nearly every
industry. It is a process in which data points from large data sets are assigned to categories
based on how they’re being used.

What Is Classification in Data Mining?

In data mining, classification is considered to be a form of clustering — that is, it is useful for
extracting comparable points of data for comparative analysis. Classification is also used to
designate broad groups within a demographic, target audience, or user base through which
businesses can gain stronger insights.
Methods for Data Mining Classification

 Logistic regression: This algorithm attempts to show the probability of a specific

outcome within two possible results. For example, an email service can use logistic
regression to predict whether or not an email is spam.
 Decision trees: Once data is classified, follow-up questions can be asked, and the
results diagrammed into a chart called a decision tree. For example, if a computer
company wants to predict the likelihood of laptop purchases, it may ask, Is the potential
buyer a student? The data is classified into “Yes” and “No” decision trees, with other
questions to be asked afterward in a similar fashion.
 K-nearest neighbors (KNN): This is an algorithm that tries to identify an unknown
object by comparing it to others. For instance, grocery chains might use the K-nearest
neighbors algorithm to decide whether to include a sushi or hot meals station in their
new store layout based on consumer habits in the local marketplace.
 Naive Bayes: Based on the Bayes Theorem of Probability, this algorithm uses
historical data to predict whether similar events will occur based on a different set of
data.
 Support Vector Machine (SVM): This machine learning algorithm is often used to
define the line that best divides a data set into two classes. An SVM can help classify
images and is used in facial and handwriting recognition software.
Examples of Classification in Business
Financial institutions classify consumers based on many variables to market new loans or
project credit card risks. Meanwhile, weather apps classify data to project snowfall totals and
other similar figures. Grocery stores also use classification to group products by the consumers
who buy them, helping forecast buying patterns.
6. Machine Learning
Machine learning is the process by which computers use algorithms to learn on their own. An
increasingly relevant part of modern technology, machine learning makes computers “smarter”
by teaching them how to perform tasks based on the data they have gathered.

What Is Machine Learning in Data Mining?

In data mining, machine learning’s applications are vast. Machine learning and data mining
fall under the umbrella of data science but aren’t interchangeable terms. For instance,
computers perform data mining as part of their machine learning functions.
Methods for Machine Learning
 Supervised learning: In this method, algorithms train machines to learn using pre-
labeled data with correct values, which the machines then classify on their own. It’s
called supervised because the process trains (or “supervises”) computers to classify data
and predict outcomes. Supervised machine learning is used in data mining
classification.
 Unsupervised learning: When computers handle unlabeled data, they engage in
unsupervised learning. In this case, the computer classifies the data itself and then looks
for patterns on its own. Unsupervised models are used to perform clustering and
association.
 Semi-supervised learning: Semi-supervised learning uses a combination of labeled
and unlabeled data, making it a hybrid of the above models.
 Reinforcement learning: This is a more layered process in which computers learn to
make decisions based on examining data in a specific environment. For example, a
computer might learn to play chess by examining data from thousands of games played
online.
Examples of Machine Learning in Business
With machine learning, companies can use computers to quickly identify all sorts of data
patterns (in sales, product usage, buying habits, etc.) and develop business plans using those
insights. This is a growing need in many industries.

7. Neural Networks
Computers process large amounts of data much faster than human brains but don’t yet have the
capacity to apply common sense and imagination in working with the data. Neural networks
are one way to help computers reason more like humans.

What Are Neural Networks in Data Mining?

Artificial neural networks attempt to digitally mimic the way the human brain operates. Neural
networks combine many computer processors (similar to the way the brain uses neurons) to
process data, make decisions, and learn as a human would — or at least as closely as possible.

Neural Network Methods

Neural networks consist of three main layers: input, “hidden,” and output. Data enters through
the input layer, is processed in the hidden layer, and is resolved in the output layer where any
relevant action based on the data is then taken. The hidden layer can consist of many processing
layers, depending on the amount of data being used and learning taking place.
Supervised and unsupervised learning also apply to neural networks; neural networks use these
types of algorithms to “train” themselves to function in ways similar to the human brain.

Examples of Neural Networks in Business

Neural networks have a wide range of applications. They can help businesses predict consumer
buying patterns and focus marketing campaigns on specific demographics. They can also help
retailers make accurate sales forecasts and understand how to use dynamic pricing.
Furthermore, they help to improve diagnostic and treatment methods in healthcare, improving
care and performance.

[Link] rules. Association in data mining is all about tracking patterns, specifically
based on linked variables. In the supermarket example, this may mean that many customers
who buy a specific item may also buy a second, related item. This is how stores may know how
to group certain food items together, or in online shopping they may show “people also bought
this” section.

[Link] analysis. Regression is used to plan and model, identifying the likelihood of a
specific variable. The supermarket may be able to project price points based on availability,
consumer demand, and their competition. Regression helps data mining by identifying the
relationship between variables in a set.

10. Outlier Detection

Outlier detection is a key component of maintaining safe databases. Companies use it to test
for fraudulent transactions, such as abnormal credit card usage that might suggest theft.

What Is Outlier Detection in Data Mining?

While other data mining methods seek to identify patterns and trends, outlier detection looks
for the unique: the data point or points that differ from the rest or diverge from the overall
sample. Outlier detection finds errors, such as data that was input incorrectly or extracted from
the wrong sample. Natural data deviations can be instructive as well.
Methods for Outlier Detection

 Numeric outlier: Outliers are detected based on the Interquartile Range, or the middle
50 percent of values. Data points outside that range are considered outliers.
 Z-score: The Z-Score denotes how many standard deviations a data point is from the
sample’s mean. This is also known as extreme value analysis.
 DBSCAN: This stands for “density-based spatial clustering of applications with noise”
and is a method that defines data as core points, border points, and noise points, which
are the outliers.
 Isolation forest: This method isolates anomalies in large sets of data (the forest) with
an algorithm that searches for those anomalies instead of profiling normal data points.

Examples of Outlier Detection in Business

Almost every business can benefit from understanding anomalies in their production or
distribution lines and how to fix them. Retailers can use outlier detection to learn why their
stores witness an odd increase in purchases, such as snow shovels being bought in the summer,
and how to respond to such findings.
Generally, outlier detection is employed to enhance logistics, instill a culture of preemptive
damage control, and create a smoother environment for customers, users, and other key
groups.

11. Prediction
Predictive modeling seeks to turn data into a projection of future action or behavior. These
models examine data sets to find patterns and trends, then calculate the probabilities of a future
outcome.

What Is Prediction in Data Mining?

Predictive modeling is among the most common uses of data mining and works best with large
data sets that represent a broad sample size.

Methods for Prediction

Predictive modeling uses some of the same techniques and terminology as other data mining
processes. Here are four examples:
Forecast modeling: This is a common technique in which the computer answers a question
(for instance, How much milk should a store have in stock on Monday?) by analyzing historical
data.
Classification modeling: Classification places data into groups where it can be used to answer
direct questions.
Cluster modeling: By clustering data into groups with shared characteristics, a predictive
model can be used to study those data sets and make decisions.
Time series modeling: This model analyzes data based on when the data was input. A study of
sales trends over a year is an example of time series modeling.

[Link] Warehousing:
Definition: Data warehousing involves the process of collecting and storing data from various
sources into a central repository, known as a data warehouse.
Purpose: The primary goal of a data warehouse is to provide a unified and structured view of
data for analytical purposes. It integrates data from different operational sources (such as
databases, CRM systems, ERP systems, etc.) into a single coherent data set.
Structure: Data warehouses are typically structured to support analytical queries and reporting.
They often use techniques like ETL (Extract, Transform, Load) to integrate and clean data
before storing it.
Characteristics: Data warehouses are designed for query and analysis rather than transaction
processing. They store historical data and are optimized for complex queries across large
datasets.
So why is data mining important for businesses? Businesses that utilize data mining are able to
have a competitive advantage, better understanding of their customers, good oversight of
business operations, improved customer acquisition, and new business opportunities. Different
industries will have different benefits from their data analytics. Some industries are looking for
the best ways to get new customers, others are looking for new marketing techniques, and
others are working to improve their systems. The data mining process is what gives businesses
the opportunities and understanding for how to make their decisions, analyze their information,
and move forward.
Free data mining tools for businesses:
DataMelt : DataMelt performs mathematics, statistics, calculations, data analysis, and
visualization. Many scripting languages and Java packages are available in this system.
ELKI Data Mining Framework: ELKI focuses on algorithms with a specific emphasis on
unsupervised cluster and outlier systems. ELKI is designed to be easy for researchers, students,
and business organizations to use
Orange Data Mining: Orange data mining helps organizations do simple data analysis and use
top visualization and graphics. Heatmaps, hierarchical clustering, decision trees, and more are
used in this process.
The R Project for Statistical Computing: The R Project is used in statistical modeling and
graphics and is utilized on many operating systems and programs
Rattle GUI: Rattle GUI presents statistical and visual summaries of data, helps prepare it to be
modeled, and utilizes supervised and unsupervised machine learning to present the information.
3.2.2 Data mining process
Data mining is the process of discovering patterns, correlations, and anomalies in large datasets
to predict outcomes. The process involves several key steps, each essential for turning raw data
into actionable insights. Here's a detailed look at the data mining process:
1. Problem Definition
 Goal Identification: Clearly define the business objective or problem you want to
solve.
 Requirements Gathering: Determine the data requirements, scope, and desired
outcomes.
2. Data Understanding
 Data Collection: Gather relevant data from various sources such as databases, data
warehouses, or external sources.
 Data Description: Analyze the data to understand its structure, format, and content.
 Data Exploration: Perform initial data analysis using statistical summaries and
visualization tools to identify patterns and anomalies.
3. Data Preparation
 Data Cleaning: Handle missing values, remove duplicates, and correct errors to
improve data quality.
 Data Integration: Combine data from different sources to create a unified dataset.
 Data Transformation: Normalize, aggregate, and format data to make it suitable for
analysis.
 Data Reduction: Reduce the volume of data by selecting relevant features, sampling,
or aggregating data.
4. Data Modeling
 Selection of Techniques: Choose appropriate data mining techniques (e.g.,
classification, regression, clustering, association) based on the problem.
 Model Building: Develop data mining models using selected techniques and
algorithms.
 Model Training: Train the models on the prepared dataset to identify patterns and
relationships.
5. Model Evaluation
 Model Testing: Evaluate the performance of the models using a separate test dataset.
 Validation: Use metrics such as accuracy, precision, recall, and F1 score to assess
model performance.
 Model Tuning: Fine-tune model parameters to improve performance and avoid
overfitting or underfitting.
6. Deployment
 Implementation: Integrate the data mining model into the business process or
decision-making system.
 Monitoring: Continuously monitor model performance and update it as necessary to
adapt to new data and changing conditions.
7. Knowledge Representation
 Result Interpretation: Interpret the results of the data mining process to derive
actionable insights.
 Visualization: Use visualization techniques to present findings in an understandable
and compelling way.
 Reporting: Generate reports summarizing the findings and their implications for the
business.
8. Feedback and Refinement
 Feedback Loop: Gather feedback from stakeholders and end-users to refine the data
mining process and models.
 Iteration: Iterate through the data mining process to continuously improve models and
insights based on new data and feedback.

Fig 1: Data Mining Process

3.3 Text and Web Analytics:
3.3.1 Text Analytics:
Text analytics (also known as text mining) is the process of transforming unstructured text
data into meaningful and actionable information. This involves various techniques and
methodologies to analyze, interpret, and extract useful insights from text. The goal is to uncover
patterns, trends, and relationships within the data.
Key Components and Techniques:

1. Text Preprocessing: Preparing raw text data for analysis by cleaning, tokenizing,
removing stop words, and normalizing the text.
2. Text Representation: Converting text into a format suitable for analysis, such as:
o Bag of Words (BoW): Representing text as a collection of word frequencies.
o TF-IDF: Measuring the importance of a word in a document relative to a
collection of documents.
o Word Embeddings: Using vector representations of words to capture semantic
relationships (e.g., Word2Vec, GloVe).
3. Text Analysis Techniques:
o Sentiment Analysis: Determining the emotional tone of text (positive,
negative, neutral).
o Topic Modeling: Identifying themes or topics within a set of documents (e.g.,
Latent Dirichlet Allocation).
o Named Entity Recognition (NER): Identifying and classifying entities such as
names, dates, and organizations within text.
o Text Classification: Categorizing text into predefined categories (e.g., spam
detection, sentiment classification).
o Natural Language Processing (NLP): Applying advanced linguistic analysis
to text, including parts of speech tagging, dependency parsing, and machine
translation.
Applications:
 Customer Feedback Analysis: Understanding customer sentiments and opinions from
reviews and surveys.
 Social Media Monitoring: Tracking and analyzing public sentiment and trends on
social media platforms.
 Content Categorization: Organizing and classifying large volumes of text data into
meaningful categories.
 Healthcare: Extracting valuable information from medical records and literature.

3.3.2 Text Mining Overview:

Text mining, also known as text analytics, involves extracting valuable insights and
information from unstructured text data using various techniques from natural language
processing (NLP), machine learning, and statistics. It encompasses processes like text
preprocessing (tokenization, normalization, stop words removal, stemming, and
lemmatization), feature extraction (Bag of Words, TF-IDF, word embeddings), and text
classification (supervised and unsupervised learning). Additionally, it includes tasks like
information extraction (named entity recognition, relation extraction), sentiment analysis, text
summarization, and topic modeling. Text mining has diverse applications in business
intelligence, healthcare, legal analysis, finance, marketing, and academic research, allowing
organizations to analyze customer feedback, medical records, legal documents, financial
reports, and scholarly articles to gain actionable insights. Popular tools and libraries for text
mining include Python, R, NLTK, spaCy, scikit-learn, Gensim, TensorFlow, and PyTorch.
Key Components of Text Mining:
1. Text Preprocessing:
o Tokenization: Splitting text into words, phrases, symbols, or other
meaningful elements called tokens.
o Normalization: Converting text to a standard format, such as lowercasing,
removing punctuation, and correcting spelling errors.
o Stop Words Removal: Eliminating common words (e.g., "the," "is," "and")
that may not contribute significant meaning.
o Stemming and Lemmatization: Reducing words to their root forms (e.g.,
"running" to "run").
2. Feature Extraction:
o Bag of Words (BoW): Representing text as a collection of word frequencies.
o TF-IDF (Term Frequency-Inverse Document Frequency): Evaluating the
importance of a word in a document relative to a collection of documents.
o Word Embeddings: Capturing semantic meanings of words using techniques
like Word2Vec, GloVe, or BERT.
3. Text Classification:
o Supervised Learning: Training models on labeled data to categorize text into
predefined classes (e.g., spam detection, sentiment analysis).
o Unsupervised Learning: Discovering natural groupings of text (e.g., topic
modeling using algorithms like Latent Dirichlet Allocation).
4. Information Extraction:
o Named Entity Recognition (NER): Identifying entities like names, dates,
locations, and organizations in text.
o Relation Extraction: Determining relationships between entities in text.
5. Sentiment Analysis:
o Analyzing the sentiment expressed in text, such as positive, negative, or
neutral opinions.
6. Text Summarization:
o Extractive Summarization: Selecting key sentences from the text.
o Abstractive Summarization: Generating new sentences that capture the main
ideas.
7. Clustering and Topic Modeling:
o Grouping similar documents together and identifying common topics within a
collection of texts.
Tools and Technologies:

1. NLP Libraries and Frameworks:

o NLTK (Natural Language Toolkit): A comprehensive library for text
processing and linguistic analysis in Python.
o SpaCy: An industrial-strength NLP library that is fast and efficient for
processing large volumes of text.
o Stanford NLP: A suite of NLP tools developed by Stanford University,
providing various linguistic analysis tools.
o Gensim: A library for topic modeling and document similarity analysis using
word embeddings.
2. Machine Learning Frameworks:
o Scikit-learn: A Python library for machine learning that includes tools for text
classification, clustering, and more.
oTensorFlow and PyTorch: Deep learning frameworks that can be used to build
advanced text analysis models.
3. Text Mining Tools:
o RapidMiner: An advanced data mining tool that supports text mining and
machine learning.
o KNIME: An open-source platform for data analytics, reporting, and integration
with capabilities for text mining.
o IBM SPSS Modeler: A data mining and text analytics software that helps build
predictive models.

3.4 Applications of Text Mining

1. Business Intelligence: Analyzing customer feedback, reviews, and social media to
gain insights into customer preferences and sentiment.
2. Healthcare: Extracting information from medical records and research papers to
support clinical decision-making and research.
3. Legal: Analyzing legal documents, contracts, and case law to identify relevant
information and trends.
4. Finance: Monitoring news, reports, and social media to make informed investment
decisions.
5. Marketing: Understanding consumer behavior and trends through analysis of
surveys, reviews, and social media interactions.
6. Academic Research: Analyzing scholarly articles, patents, and publications to
uncover trends and relationships in research.
Challenges:
1. Data Quality: Ensuring the text data is clean, accurate, and representative of the
domain.
2. Volume and Variety: Handling large volumes of text data from diverse sources.
3. Language and Ambiguity: Dealing with different languages, dialects, and the
inherent ambiguity of human language.
4. Context Understanding: Capturing the context and nuances of the text to make
accurate interpretations.
5. Scalability: Processing and analyzing text data efficiently as the volume grows.

3.5 Web mining overview

We propose to use web mining as a process to develop advanced decision support systems in
order to support the management activities of a web site. We describe the web mining process
as a sequence of steps for the development of advanced decision support systems. By following
such a sequence, we can develop advanced decision support systems which integrate data
mining with business intelligence. We have applied our proposed web mining process in the
development of intelligent monitoring/management systems to guarantee the quality of web
sites. Examples of monitoring activities include:
• Usage: Keep track of the paths users take during their accesses, the efficiency of
pages/hyperlinks in guiding the users to accomplish their goals;
• Users: How users are grouped taking into account their browsing behavior, how groups
change with time, how groups of users relate with the success of the site;
• Data quality: How adequate the content and meta-data of a web site are;
• Automation: The effect of personalization actions. For instance, if users are following the
recommendations of products and pages or not.
In this section, we briefly describe other uses of web mining introduced by the web mining
research community:
• Web Personalization/Recommendation: The user navigation behavior can be used to
personalize web pages by making dynamic recommendations (e.g., pages, services, etc) for
each web user (Anand, & Mobasher, 2003);
• Categorization/Clustering of Content: Content data can be used to categorize/ cluster web
pages into topic directories (Chakrabarti, 2000);
• Automatic Summarization of Content: The goal is to construct automatically summaries from
web page text content (Zhang, Zincir-Heywood, & Milios, 2004). An example of such
application is the presentation of summaries by search engines;
• Extraction of Keywords from Web Pages: A keyword is a word or a set of words which
characterizes the content of a web page or site, and is used by users in their search process.
Using content and usage information from a web page/site, we can extract/identify keywords
which attract and retain users (Velasquez, & Palade, 2008);
• Web Page Ranking: Hyperlinks can be used to rank web pages, in accordance with the
interest of the user, such as in search engines (Page, Brin, Motwani, & Winograd, 1999);
• Web Caching Improvement: Access patterns extracted from web logs can be used to extend
caching policies in order to improve the performance of web accesses
• Clickstream and Web Log Analysis: The access logs can also be used to perform other types
of analyses, from simple access statistics to user behavioral patterns, that help to improve the
quality of web sites
3.5.1 Using Web Mining as a Process to Develop an Advanced Decision Support System for
an E-News Web Portal
The goal of many web portals is to select, organize and distribute content (information, or other
services and products) in order to satisfy its users/ customers. The methods to support this
process are to a large extent based on meta-data (such as keywords, categories, authors and
other descriptors) that describe content and its properties. For instance, search engines often
take into account keywords that are associated with the content to compute their relevance to a
query. Likewise, the accessibility of content by navigation depends on their position in the
structure of the portal,
• Analysis of Web Site Topology: Web content and hyperlinks are used to analyse the topology
of a web site and improve its organization, possibly reducing the number of alternative
pages/hyperlinks that must be considered when we browse a web site
• Identifying Hubs and Authorities: Hyperlinks can also be used to identify hubs (directory
pages) and authorities (popular pages). A hub is a page that points to many other pages. An
authority is a page that is pointed to by many different hubs
• Identifying Web Communities: Hubs and Authorities can be combined to identify web
communities, which are groups of pages sharing the same subject
• OLAP Analysis: The historical evolution of web data (e.g., usage, content and structure data)
is analyzed on several perspectives/dimensions which is usually defined by a specific meta-
data descriptor (e.g., category). Meta-data is usually filled in by the authors who publish
content in the portal. The publishing process, which goes from the insertion of content to its
actual publication on the portal is regulated by a workflow. The complexity of this workflow
varies: the author may be authorized to publish content directly; alternatively, content may
have to be analyzed by one or more editors, who authorize its publication or not. Editors may
also suggest changes to the content and to the meta-data that describe it, or make those changes
themselves. In the case where there are many different authors or the publishing process is less
strict, the meta-data may describe content in a way which is not suitable for the purpose of the
portal, thus decreasing the quality of the services provided.
3.5.2 Web Analytics:
Web analytics is the measurement, collection, analysis, and reporting of web data to
understand and optimize web usage. It focuses on analyzing user interactions with websites
and other digital platforms to gain insights into user behavior and improve website
performance.
Key Components:
1. Data Collection: Gathering data from various sources, including web server logs,
JavaScript tags, and cookies.
2. Metrics and Key Performance Indicators (KPIs): Tracking specific measurements
to evaluate website performance, such as:
o Page Views: The total number of times pages are viewed.
o Sessions: A group of interactions by a user within a given time frame.
o Bounce Rate: The percentage of visitors who leave after viewing only one page.
o Conversion Rate: The percentage of visitors who complete a desired action,
such as making a purchase.
3. Traffic Analysis: Understanding where visitors come from (e.g., search engines, direct
traffic, social media) and their geographical locations.
4. Behavior Analysis: Analyzing user interactions, such as click paths, time spent on
pages, and heatmaps.
5. Outcome Analysis: Measuring the success of specific goals, such as sales, sign-ups,
and other conversions.

Tools and Technologies:

 Web Analytics Platforms: Google Analytics, Adobe Analytics, Piwik.
 Tag Management Systems: Google Tag Manager, Adobe Dynamic Tag Management.
 Heatmap Tools: Hotjar, Crazy Egg.

Applications:
 Marketing Optimization: Improving digital marketing strategies by understanding
user behavior and preferences.
 User Experience (UX) Enhancement: Identifying areas for improving website
usability and engagement.
 E-commerce: Optimizing online sales and customer journeys.
 Content Management: Tailoring content strategies based on user interactions and
preferences.
3.5.3 Integration of Text and Web Analytics
Combining text and web analytics provides a comprehensive understanding of both the
quantitative and qualitative aspects of user interactions. For instance:

 Enhanced Customer Insights: Analyzing user comments and reviews (text analytics)
alongside user behavior on the website (web analytics) to gain deeper insights into
customer needs and preferences.
 Improved Marketing Strategies: Using sentiment analysis from social media (text
analytics) together with website traffic data (web analytics) to tailor marketing
campaigns more effectively.
 Content Optimization: Evaluating which types of content resonate most with users by
analyzing both their engagement patterns (web analytics) and their feedback (text
analytics).

3.6 Social media analytics:

Social media analytics is the ability to gather and find meaning in data gathered from social
channels to support business decisions — and measure the performance of actions based on
those decisions through social media.
Social media analytics is broader than metrics such as likes, follows, retweets, previews, clicks,
and impressions gathered from individual channels. It also differs from reporting offered by
services that support marketing campaigns such as LinkedIn or Google Analytics.
Social media analytics uses specifically designed software platforms that work similarly to web
search tools. Data about keywords or topics is retrieved through search queries or web
‘crawlers’ that span channels. Fragments of text are returned, loaded into a database,
categorized and analyzed to derive meaningful insights.
Social media analytics includes the concept of social listening. Listening is monitoring social
channels for problems and opportunities. Social media analytics tools typically incorporate
listening into more comprehensive reporting that involves listening and performance analysis.
Why is social media analytics important?
Social media analytics is important for several reasons, particularly in today's digital age where
social media platforms play a significant role in communication, marketing, and decision-
making. Here are some key reasons why social media analytics is crucial:

1. Understanding Audience Behavior: Social media analytics helps businesses and

organizations understand how their target audience interacts with their brand. It
provides insights into consumer preferences, interests, sentiment towards products or
services, and behavior patterns such as browsing habits and purchasing decisions.
2. Monitoring Brand Reputation: By monitoring social media conversations, businesses
can track mentions of their brand, products, or services in real-time. This enables them
to promptly address customer feedback, respond to complaints, and manage crises
effectively to protect their brand reputation.
3. Competitive Analysis: Social media analytics allows businesses to monitor
competitors' activities and performance on social platforms. By analyzing competitors'
strategies, content engagement, and customer feedback, organizations can identify
strengths, weaknesses, and opportunities for differentiation.
4. Improving Marketing Strategies: Social media analytics provides actionable insights
into the effectiveness of marketing campaigns. By measuring metrics such as
engagement rates, click-through rates, conversion rates, and ROI (Return on
Investment), businesses can optimize their marketing efforts, allocate resources more
effectively, and improve campaign targeting.
5. Identifying Influencers and Advocates: Social media analytics helps identify
influential individuals (influencers) and brand advocates who have a significant impact
on their audience's purchasing decisions. Collaborating with influencers can amplify
brand reach and credibility, while leveraging brand advocates can foster brand loyalty
and advocacy.
6. Tracking Trends and Sentiment: Social media analytics allows businesses to monitor
trends and sentiment related to their industry, products, or services. By identifying
emerging trends and understanding consumer sentiment (positive, negative, or neutral),
organizations can adapt their strategies, launch timely promotions, and mitigate
potential risks.
7. Enhancing Customer Service: Social media analytics enables organizations to track
and analyze customer feedback and inquiries in real-time. By identifying common
issues or concerns, businesses can improve their products or services, enhance customer
support processes, and strengthen customer relationships.
8. Data-Driven Decision Making: Ultimately, social media analytics provides data-
driven insights that support strategic decision-making across various departments
within an organization. Whether in marketing, sales, customer service, product
development, or corporate communications, data-driven decisions based on social
media analytics can lead to improved efficiency, innovation, and overall business
performance.
In summary, social media analytics is important because it empowers businesses to gain deep
insights into consumer behavior, enhance brand reputation, optimize marketing strategies, stay
competitive, and make informed decisions that drive business growth and success in the digital
era.
3.7 Sentiment analysis:
Sentiment analysis is a branch of natural language processing (NLP) that involves identifying
and extracting subjective information from text to determine the sentiment or opinion
expressed by the writer. Here's an overview of sentiment analysis:

1. Definition and Purpose:

o Sentiment analysis, also known as opinion mining, aims to analyze attitudes,
emotions, and opinions expressed in text data.
o It helps businesses and organizations understand public opinion, customer
feedback, and social media sentiment towards products, services, or events.
2. Techniques and Approaches:
o Lexicon-based: Uses dictionaries of words annotated with their sentiment
scores.
o Machine Learning: Supervised learning models classify text based on labeled
training data.
o Deep Learning: Utilizes neural networks to learn representations of text data
for sentiment classification.
o Aspect-based: Analyzes sentiment about specific aspects or features of a
product or service.
3. Applications:
o Business Intelligence: Helps companies gauge customer satisfaction, identify
trends, and monitor brand perception.
o Social Media Monitoring: Analyzes public opinion on platforms like Twitter,
Facebook, etc.
o Customer Feedback Analysis: Automates the process of categorizing and
analyzing customer reviews.
o Market Research: Provides insights into consumer preferences and behavior.
4. Challenges:
o Context and Sarcasm: Understanding sarcasm, irony, and nuanced
expressions.
o Multilingualism: Handling sentiment in multiple languages.
o Data Noise: Dealing with noisy or unstructured data.
o Domain-specific Sentiment: Adapting models to specific industries or topics.
5. Tools and Libraries:
o Popular NLP libraries like NLTK (Natural Language Toolkit), SpaCy, and
TensorFlow provide tools for sentiment analysis.
o Cloud services like Google Cloud Natural Language API and Azure Text
Analytics offer sentiment analysis as a service.
6. Ethical Considerations:
o Bias: Ensuring models are trained on diverse datasets to avoid biased results.
o Privacy: Handling sensitive user data appropriately.
o Transparency: Being clear about how sentiment analysis influences decision-
making.
7. Future Trends:
o Emotion Recognition: Going beyond positive/negative sentiment to recognize
complex emotions.
o Real-time Analysis: Improving speed and accuracy for live data streams.
o Personalization: Tailoring sentiment analysis to individual user preferences.

Overall, sentiment analysis plays a crucial role in understanding and interpreting large volumes
of textual data, providing valuable insights for businesses, researchers, and organizations
across various domains. A practical use case of sentiment analysis can be found in social media
monitoring for brand reputation management. Here’s how sentiment analysis can be applied in
this scenario:

Use Case: Social Media Monitoring for Brand Reputation Management:

1. Problem Statement: A company wants to monitor how people perceive their brand on
social media platforms like Twitter, Facebook, and Instagram. They receive a large
volume of mentions and comments daily, making it challenging to manually analyze
each one.
2. Solution with Sentiment Analysis:
o Data Collection: The company collects social media posts, comments, and
mentions related to their brand using APIs or scraping tools.
o Preprocessing: Text preprocessing techniques are applied to clean and
normalize the text data, removing noise such as special characters and
stopwords.
o Sentiment Analysis: Sentiment analysis models are applied to each piece of
text to determine whether the sentiment expressed is positive, negative, or
neutral.
o Classification: Based on sentiment scores generated by the model, posts are
classified into categories like positive, negative, or neutral sentiments.
o Visualization and Reporting: Results are visualized through dashboards or
reports that provide insights into overall sentiment trends, sentiment distribution
over time, and sentiment breakdown by platform or geography.
o Actionable Insights: The company uses these insights to:
 Crisis Management: Quickly identify and address negative sentiment
spikes that could indicate a potential PR crisis.
 Campaign Effectiveness: Evaluate the success of marketing campaigns
by analyzing sentiment changes before, during, and after campaign
launches.
 Customer Feedback: Understand customer perceptions of new product
launches or service changes based on sentiment analysis of user reviews
and comments.
 Competitive Analysis: Compare sentiment towards their brand versus
competitors to identify competitive strengths and weaknesses.
3. Benefits:
o Proactive Response: Enables the company to respond swiftly to negative
sentiment, minimizing reputational damage.
o Customer Insights: Provides valuable insights into customer preferences,
concerns, and expectations.
o Data-driven Decisions: Guides strategic decisions in marketing, customer
service, and product development based on real-time customer feedback.
o Efficiency: Automates the process of sentiment analysis, saving time and
resources compared to manual review of each social media mention.
4. Challenges and Considerations:
o Accuracy: Ensuring sentiment analysis models are accurate, especially in
handling sarcasm, slang, and context-specific expressions.
o Scalability: Managing and analyzing large volumes of social media data
efficiently.
o Ethical Use: Respecting user privacy and ensuring ethical use of data collected
from social media platforms.
In conclusion, sentiment analysis in social media monitoring for brand reputation management
enhances a company's ability to understand public perception, manage crises effectively, and
make data-driven decisions to improve customer satisfaction and brand loyalty.
3.8 Big Data Analytics: Definition and characteristics of big data
Big data analytics describes the process of uncovering trends, patterns, and correlations in large
amounts of raw data to help make data-informed decisions. These processes use familiar
statistical analysis techniques—like clustering and regression—and apply them to more
extensive datasets with the help of newer tools. Big data refers to datasets that are extremely
large and complex, typically exceeding the capacity of traditional data processing software.
The concept of big data is characterized by several key attributes:

1. Volume: Big data involves vast amounts of data that cannot be processed efficiently
with traditional database systems. This volume can range from terabytes to petabytes
and beyond.
2. Velocity: Data in big data environments is generated at high speed and needs to be
processed quickly. Examples include real-time data streams from sensors, social media
feeds, or financial transactions.
3. Variety: Big data encompasses data in various formats and types, including structured
data (like databases), semi-structured data (like XML, JSON), and unstructured data
(like text, images, videos). Managing and analyzing this diverse data requires
specialized tools and techniques.
4. Veracity: Refers to the quality and reliability of data. Big data sources often include
data from uncertain or unreliable sources, requiring careful validation and cleaning to
ensure accuracy.
5. Value: The ultimate goal of big data is to extract valuable insights and knowledge from
large datasets. This can involve discovering patterns, trends, correlations, and other
information that can lead to better decision-making, improved processes, and
innovation.
6. Variability: Data can be inconsistent in its format, structure, or meaning over time.
Variability in big data refers to the challenges of managing and integrating diverse data
sources that may change unpredictably.
7. Complexity: Big data environments are inherently complex due to the volume, variety,
and velocity of data. Analyzing big data often requires advanced analytics techniques,
machine learning algorithms, and distributed computing frameworks.
8. Context: Big data analysis often considers the context in which data is generated or
used. Understanding the context helps interpret data correctly and derive meaningful
insights.
In summary, big data is characterized by its massive volume, high velocity, and diverse variety
of data types. Effectively harnessing big data requires advanced technologies, tools, and
methodologies to extract actionable insights and create business value. Big data is a term
applied to data sets whose size or type is beyond the ability of traditional relational databases
to capture, manage and process the data with low latency. Big data has one or more of the
following characteristics: high volume, high velocity or high variety. Artificial intelligence
(AI), mobile, social and the Internet of Things (IoT) are driving data complexity through new
forms and sources of data. For example, big data comes from sensors, devices, video/audio,
networks, log files, transactional applications, web, and social media much of it generated in
real time and at a very large scale.
Analysis of big data allows analysts, researchers and business users to make better and faster
decisions using data that was previously inaccessible or unusable. Businesses can use advanced
analytics techniques such as text analytics, machine learning, predictive analytics, data mining,
statistics and natural language processing to gain new insights from previously untapped data
sources independently or together with existing enterprise data.
The Lifecycle Phases of Big Data Analytics:
Stage 1 - Business case evaluation - The Big Data analytics lifecycle begins with a business
case, which defines the reason and goal behind the analysis.
Stage 2 - Identification of data - Here, a broad variety of data sources are identified.
Stage 3 - Data filtering - All of the identified data from the previous stage is filtered here to
remove corrupt data.
Stage 4 - Data extraction - Data that is not compatible with the tool is extracted and then
transformed into a compatible form.
Stage 5 - Data aggregation - In this stage, data with the same fields across different datasets
are integrated.
Stage 6 - Data analysis - Data is evaluated using analytical and statistical tools to discover
useful information.
Stage 7 - Visualization of data - With tools like Tableau, Power BI, and QlikView, Big Data
analysts can produce graphic visualizations of the analysis.
Stage 8 - Final analysis result - This is the last step of the Big Data analytics lifecycle, where
the final results of the analysis are made available to business stakeholders who will take action.
3.8.1 How does big data analytics work?
Big Data Analytics is a powerful tool which helps to find the potential of large and complex
datasets. To get better understanding, let’s break it down into key steps:
 Data Collection: Data is the core of Big Data Analytics. It is the gathering of data from
different sources such as the customers’ comments, surveys, sensors, social media, and
so on. The primary aim of data collection is to compile as much accurate data as possible.
The more data, the more insights.
 Data Cleaning (Data Pre-processing): The next step is to process this information. It
often requires some cleaning. This entails the replacement of missing data, the correction
of inaccuracies, and the removal of duplicates. It is like sifting through a treasure trove,
separating the rocks and debris and leaving only the valuable gems behind.
 Data Processing: After that we will be working on the data processing. This process
contains such important stages as writing, structuring, and formatting of data in a way it
will be usable for the analysis. It is like a chef who is gathering the ingredients before
cooking. Data processing turns the data into a format suited for analytics tools to process.
 Data Analysis: Data analysis is being done by means of statistical, mathematical, and
machine learning methods to get out the most important findings from the processed data.
For example, it can uncover customer preferences, market trends, or patterns in healthcare
data.
 Data Visualization: Data analysis usually is presented in visual form, for illustration –
charts, graphs and interactive dashboards. The visualizations provided a way to simplify
the large amounts of data and allowed for decision makers to quickly detect patterns and
trends.
 Data Storage and Management: The stored and managed analyzed data is of utmost
importance. It is like digital scrapbooking. May be you would want to go back to those
lessons in the long run, therefore, how you store them has great importance. Moreover,
data protection and adherence to regulations are the key issues to be addressed during this
crucial stage.
 Continuous Learning and Improvement: Big data analytics is a continuous process of
collecting, cleaning, and analyzing data to uncover hidden insights. It helps businesses
make better decisions and gain a competitive edge

Fig 2: Big Data Process

3.8.2 Benefits and Advantage of Big Data Analytics:
1. Enhanced Decision Making
 Data-Driven Insights: Big data analytics provides comprehensive insights based on
large volumes of data, allowing organizations to make informed decisions.
 Predictive Analytics: Helps in forecasting future trends and behaviors, enabling
proactive decision-making.
2. Improved Operational Efficiency
 Process Optimization: Identifies bottlenecks and inefficiencies in operations, leading
to streamlined processes and reduced costs.
 Automation: Facilitates automation of repetitive tasks, improving overall productivity.

3. Customer Insights and Personalization

 Customer Behavior Analysis: Helps in understanding customer preferences and
behaviors, enabling personalized marketing and improved customer service.
 Targeted Marketing: Allows businesses to create targeted marketing campaigns based
on customer data, increasing conversion rates.
4. Innovation and Product Development
 Market Trends Analysis: Identifies emerging trends and market demands, guiding the
development of new products and services.
 Feedback Integration: Uses customer feedback and data to improve existing products
and develop new offerings.
5. Competitive Advantage
 Benchmarking: Analyzes industry trends and competitor performance to identify areas
for improvement and growth.
 Agility: Provides the ability to respond quickly to market changes and customer needs,
maintaining a competitive edge.
6. Cost Reduction
 Resource Optimization: Identifies areas where resources can be optimized, reducing
waste and cutting costs.
 Predictive Maintenance: In industries like manufacturing, predictive analytics can
forecast equipment failures, reducing downtime and maintenance costs.
7. Risk Management
 Fraud Detection: Identifies patterns indicative of fraudulent activity, enhancing
security and reducing losses.
 Compliance: Ensures regulatory compliance by monitoring and analyzing relevant
data, avoiding legal issues and fines.
8. Enhanced Customer Experience
 Real-Time Feedback: Analyzes real-time data to provide immediate responses and
solutions to customer issues.
 Personalized Interactions: Uses data to tailor interactions and communications,
enhancing customer satisfaction.
9. Supply Chain Management
 Inventory Management: Optimizes inventory levels by predicting demand and
managing supply chains more efficiently.
 Logistics: Improves logistics and distribution processes by analyzing data on routes,
weather, and other factors.
10. Healthcare Advancements
 Patient Care: Analyzes patient data to provide better diagnostics, personalized
treatment plans, and improved patient outcomes.
 Research and Development: Accelerates medical research by analyzing large datasets
to identify patterns and correlations.
11. Enhanced Financial Performance
 Revenue Growth: Identifies new revenue streams and opportunities for growth by
analyzing market data and consumer trends.
 Cost Management: Helps in identifying cost-saving opportunities through efficient
resource allocation and process optimization.
12. Scalability
 Handling Large Volumes: Efficiently processes and analyzes vast amounts of data
from various sources, providing insights that smaller datasets might miss.
 Adaptability: Scales with the growth of data, ensuring continuous improvement and
relevance of insights.

3.9 Fundamentals of Big Data Analytics:

Big data analytics involves examining large and varied data sets to uncover hidden patterns,
correlations, market trends, customer preferences, and other useful business information. The
fundamentals of big data analytics can be broken down into several key components and
concepts:
1. Data Sources
 Structured Data: Data that is organized into rows and columns, such as databases and
spreadsheets.
 Unstructured Data: Data that does not have a predefined format, such as text, video,
audio, and social media posts.
 Semi-structured Data: Data that does not fit into a strict structure but has some
organizational properties, such as XML and JSON files.
2. Data Collection
 Data Ingestion: The process of gathering and importing data for immediate use or
storage in a database.
 ETL (Extract, Transform, Load): A process that extracts data from various sources,
transforms it into a suitable format, and loads it into a storage system.
3. Data Storage
 Data Warehouses: Central repositories of integrated data from one or more disparate
sources used for reporting and data analysis.
 Data Lakes: Large storage repositories that hold vast amounts of raw data in its native
format until it is needed.
 Distributed Storage Systems: Systems like Hadoop Distributed File System (HDFS)
that store data across multiple machines to handle large volumes.
4. Data Processing
 Batch Processing: Processing large volumes of data at once, typically used for
historical data analysis.
 Real-Time Processing: Processing data as it arrives, enabling immediate analysis and
action.
 Stream Processing: Continuously processing data streams, often used for monitoring
and real-time analytics.
5. Data Analysis
 Descriptive Analytics: Analyzing historical data to understand what happened in the
past.
 Diagnostic Analytics: Examining data to determine why something happened.
 Predictive Analytics: Using statistical models and machine learning techniques to
predict future outcomes.
 Prescriptive Analytics: Recommending actions based on predictive analytics and
other data insights.
6. Data Visualization
 Dashboards: Visual displays of key metrics and trends for quick, at-a-glance
understanding.
 Reports: Detailed documents that present data analysis results, often including charts,
graphs, and tables.
 Interactive Visualizations: Tools that allow users to explore data through interactive
charts and graphs.
7. Big Data Technologies and Tools
 Hadoop: An open-source framework for storing and processing large data sets in a
distributed computing environment.
 Spark: An open-source data processing engine for large-scale data processing, known
for its speed and ease of use.
 NoSQL Databases: Non-relational databases like MongoDB, Cassandra, and
Couchbase that handle unstructured data.
 Data Integration Tools: Tools like Apache NiFi and Talend that facilitate the flow of
data between systems.
8. Machine Learning and AI
 Supervised Learning: Training models on labeled data to make predictions.
 Unsupervised Learning: Finding hidden patterns in unlabeled data.
 Reinforcement Learning: Training models through trial and error, receiving rewards
for correct actions.
9. Data Governance and Security
 Data Quality Management: Ensuring the accuracy, completeness, and reliability of
data.
 Data Privacy: Protecting sensitive information and ensuring compliance with
regulations like GDPR and CCPA.
 Data Security: Safeguarding data against unauthorized access and cyber threats.
10. Scalability and Performance
 Horizontal Scaling: Adding more machines to handle increased data volume and
processing power.
 Vertical Scaling: Adding more power (CPU, RAM) to existing machines.
 Load Balancing: Distributing workloads evenly across multiple systems to optimize
resource use and avoid overload.
11. Cloud Computing
 Infrastructure as a Service (IaaS): Providing virtualized computing resources over
the internet.
 Platform as a Service (PaaS): Offering hardware and software tools over the internet.
 Software as a Service (SaaS): Delivering software applications over the internet.
12. Ethical Considerations
 Bias and Fairness: Ensuring algorithms and data analysis do not perpetuate biases.
 Transparency: Providing clear explanations of how data is used and how decisions are
made.
 Responsibility: Using data ethically and responsibly, considering the impact on
individuals and society.
Understanding these fundamentals equips organizations with the knowledge to effectively
leverage big data analytics for strategic advantage, operational efficiency, and enhanced
decision-making.
PART A:
1. How does data mining inform business analytics?
2. What are Benefits of text mining?
3. Specify the Use cases of sentiment analysis?
4. Why is social media analytics important?
5. How does data mining inform business analytics?
6. What are applications of data mining analytics?
7. List out the Examples of Outlier Detection in Business?
8. Compare Business Understanding and Data Understanding.
9. List out the Lifecycle Phases of Big Data Analytics?
10. What are the advantages of big data analytics?
PART B:
1. Explain Data mining techniques in business analytics.
2. Discuss Benefits and Advantages of Big Data Analytics in detail.
3. Justify the free data mining tools for businesses in detail.
4. Develop an advanced decision support system for an e-news web portal using web mining
5. Describe Text mining challenges and issues.
6. Explain the working of big data analytics with neat diagram.

Unit 3 Ba
No ratings yet
Unit 3 Ba
29 pages
Data Mining OVERVIEW
No ratings yet
Data Mining OVERVIEW
8 pages
DWDM 2
No ratings yet
DWDM 2
15 pages
DM Unit-1
No ratings yet
DM Unit-1
27 pages
Unit 1
No ratings yet
Unit 1
27 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
DM Module1
No ratings yet
DM Module1
15 pages
Understanding Data Mining Techniques
No ratings yet
Understanding Data Mining Techniques
33 pages
Data Mining Tutorial Guide
No ratings yet
Data Mining Tutorial Guide
30 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
46 pages
What Is Data Mining
No ratings yet
What Is Data Mining
1 page
Data Mining - Docx Ghhdocx
No ratings yet
Data Mining - Docx Ghhdocx
6 pages
Data Mining
No ratings yet
Data Mining
18 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Data Mining Insights for Professionals
No ratings yet
Data Mining Insights for Professionals
89 pages
Combinepdf 1
No ratings yet
Combinepdf 1
74 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
15 pages
Big Data & Cloud Computing CME Unit 1
No ratings yet
Big Data & Cloud Computing CME Unit 1
23 pages
Understanding Data Mining Techniques
No ratings yet
Understanding Data Mining Techniques
23 pages
Question 2
No ratings yet
Question 2
1 page
Lps Week 16 Iatb
No ratings yet
Lps Week 16 Iatb
5 pages
Understanding Data Mining Techniques
No ratings yet
Understanding Data Mining Techniques
11 pages
Data Mining
No ratings yet
Data Mining
4 pages
Notes DATA MINING MBA III
No ratings yet
Notes DATA MINING MBA III
8 pages
Data Mining1
No ratings yet
Data Mining1
37 pages
Data Mining Unit 1 (MSC Ds 3 Sem)
No ratings yet
Data Mining Unit 1 (MSC Ds 3 Sem)
119 pages
Data Mining Concepts Overview
No ratings yet
Data Mining Concepts Overview
9 pages
DL4000 Backup in Data Mining Systems
No ratings yet
DL4000 Backup in Data Mining Systems
58 pages
Data Mining
No ratings yet
Data Mining
395 pages
Dmi Unit 1 - 186 - N3
No ratings yet
Dmi Unit 1 - 186 - N3
12 pages
Data Mining
No ratings yet
Data Mining
11 pages
ISS-DSS - Module 3
No ratings yet
ISS-DSS - Module 3
23 pages
Dataming T PDF
No ratings yet
Dataming T PDF
48 pages
Comprehensive Data Mining Guide
No ratings yet
Comprehensive Data Mining Guide
52 pages
VO - MCA - S4 - Data Mining Unit 1
No ratings yet
VO - MCA - S4 - Data Mining Unit 1
18 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
33 pages
Fundamental of Data Mining (CSI-508) .
No ratings yet
Fundamental of Data Mining (CSI-508) .
19 pages
Seminar Data Mining
No ratings yet
Seminar Data Mining
10 pages
Data Mining: A Comprehensive Guide
No ratings yet
Data Mining: A Comprehensive Guide
4 pages
Data Mining: Techniques and Applications
No ratings yet
Data Mining: Techniques and Applications
6 pages
Data Mining
No ratings yet
Data Mining
6 pages
Unit 3
No ratings yet
Unit 3
22 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
Data Mining PDF
No ratings yet
Data Mining PDF
6 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
24 pages
Presentation Data Mining
No ratings yet
Presentation Data Mining
22 pages
Data Mining-1
No ratings yet
Data Mining-1
7 pages
Data Mining
No ratings yet
Data Mining
8 pages
Week-1-Introduction To Data Mining
No ratings yet
Week-1-Introduction To Data Mining
43 pages
Data Mining Mids
No ratings yet
Data Mining Mids
24 pages
Data Mining Notes1
No ratings yet
Data Mining Notes1
56 pages
Advance Database With Lab: Professor & Head (Department of Software Engineering)
No ratings yet
Advance Database With Lab: Professor & Head (Department of Software Engineering)
5 pages
Data Mining
No ratings yet
Data Mining
3 pages
Data Mining Notes
No ratings yet
Data Mining Notes
46 pages
Data Mining
No ratings yet
Data Mining
19 pages
Understanding Data Mining Processes
No ratings yet
Understanding Data Mining Processes
6 pages
Data Mining Techniques Unit-1
No ratings yet
Data Mining Techniques Unit-1
122 pages
Unit 3 DWM Notes
No ratings yet
Unit 3 DWM Notes
17 pages
Data Mining M1
No ratings yet
Data Mining M1
64 pages
Data Mining Process Overview
100% (1)
Data Mining Process Overview
51 pages
Unit-1 Overview of Business Analytics Word
No ratings yet
Unit-1 Overview of Business Analytics Word
19 pages
Unit-2 Types of Digital Data
No ratings yet
Unit-2 Types of Digital Data
41 pages
Unit - 4
No ratings yet
Unit - 4
120 pages
Unit-4 Transaction Processing
No ratings yet
Unit-4 Transaction Processing
22 pages
Unit 3 (CPS)
No ratings yet
Unit 3 (CPS)
5 pages
Scsa3002 May23
No ratings yet
Scsa3002 May23
2 pages
U1&2 2marks Cps
No ratings yet
U1&2 2marks Cps
5 pages
Unit-5 Business Performance Management
No ratings yet
Unit-5 Business Performance Management
22 pages
QE Sem QP
No ratings yet
QE Sem QP
2 pages
Types of Sensors Unit 1 and Unit 2
No ratings yet
Types of Sensors Unit 1 and Unit 2
5 pages
Unit-II Data Warehousing
No ratings yet
Unit-II Data Warehousing
98 pages
CPS Ex4
No ratings yet
CPS Ex4
3 pages
UNIT 1 Fundamentals of Networks Design
No ratings yet
UNIT 1 Fundamentals of Networks Design
55 pages
Use-Case Driven Object Analysis
No ratings yet
Use-Case Driven Object Analysis
53 pages
8086 Microprocessor Guide
No ratings yet
8086 Microprocessor Guide
126 pages
LOCAL AREA NETWORKS Unit 2
No ratings yet
LOCAL AREA NETWORKS Unit 2
52 pages
Scsa1401 - Ooase - Unit 4
No ratings yet
Scsa1401 - Ooase - Unit 4
51 pages
8051 Microcontroller Memory Structure
No ratings yet
8051 Microcontroller Memory Structure
33 pages
JESU DAA-Unit 1
No ratings yet
JESU DAA-Unit 1
106 pages
Scsa1301 Dbms Unit-3
No ratings yet
Scsa1301 Dbms Unit-3
59 pages
Kendriya Vidyalaya Sangathan, Lucknow Region
No ratings yet
Kendriya Vidyalaya Sangathan, Lucknow Region
9 pages
Unit 5-Data Transfer Instruction
No ratings yet
Unit 5-Data Transfer Instruction
34 pages
KVS Chandigarh Region Paper
No ratings yet
KVS Chandigarh Region Paper
9 pages
SCSA1104 Unit 1
No ratings yet
SCSA1104 Unit 1
25 pages
Marketing Strategy Expert Profile
No ratings yet
Marketing Strategy Expert Profile
1 page
Web Based Project Management System
No ratings yet
Web Based Project Management System
18 pages
HTTPSM Flsenate Govsessionbill2023999billtextc2pdf
No ratings yet
HTTPSM Flsenate Govsessionbill2023999billtextc2pdf
33 pages
Capgemini HR Analytics Transformation
No ratings yet
Capgemini HR Analytics Transformation
10 pages
On The Cusp of Change North American Wealth - Management in 2030
No ratings yet
On The Cusp of Change North American Wealth - Management in 2030
8 pages
The Impact of Digitalization On The Efficiency of Public Administration
100% (1)
The Impact of Digitalization On The Efficiency of Public Administration
5 pages
Job Application Email Guide
No ratings yet
Job Application Email Guide
6 pages
Remote Tech & HR Job Openings
No ratings yet
Remote Tech & HR Job Openings
5 pages
Defining The Problem Statement and Project Objectives
No ratings yet
Defining The Problem Statement and Project Objectives
4 pages
NBK Video T24 Platform Videos Interview Script
No ratings yet
NBK Video T24 Platform Videos Interview Script
8 pages
02 Industrial Statistics and Operations Research
No ratings yet
02 Industrial Statistics and Operations Research
31 pages
Creating An Effective Claims Management System
No ratings yet
Creating An Effective Claims Management System
7 pages
Microsoft CoreBanking MIRA-B - Final
100% (1)
Microsoft CoreBanking MIRA-B - Final
62 pages
Shell Leverages Data To Transform From Reactive To Predictive Operations
No ratings yet
Shell Leverages Data To Transform From Reactive To Predictive Operations
6 pages
Placement Manual
No ratings yet
Placement Manual
171 pages
AD3491 UNIT 1 NOTES EduEngg
100% (1)
AD3491 UNIT 1 NOTES EduEngg
35 pages
BDA Syllabus - Sem VII - Mumbai University
No ratings yet
BDA Syllabus - Sem VII - Mumbai University
3 pages
Data Analysis for Alice's Chocolat Ltd.
No ratings yet
Data Analysis for Alice's Chocolat Ltd.
12 pages
MS Power Platform
No ratings yet
MS Power Platform
430 pages
Using XGBoost and Time-Series Forecasting to Predict Student Academic Trajectories in Educational Analytics Platforms
No ratings yet
Using XGBoost and Time-Series Forecasting to Predict Student Academic Trajectories in Educational Analytics Platforms
16 pages
Spark and Scala for Big Data Analytics
No ratings yet
Spark and Scala for Big Data Analytics
2 pages
IoT: Smart & Sustainable Future
No ratings yet
IoT: Smart & Sustainable Future
12 pages
Bia Unit 1
No ratings yet
Bia Unit 1
14 pages
ASM1 Planning A Computing Project FINAL REPORT (IND)
No ratings yet
ASM1 Planning A Computing Project FINAL REPORT (IND)
46 pages
Study On The Environmental Analysis of Digital Marketing
No ratings yet
Study On The Environmental Analysis of Digital Marketing
11 pages
Advanced Digital Marketing Certification
No ratings yet
Advanced Digital Marketing Certification
12 pages
Digital Transformation
No ratings yet
Digital Transformation
20 pages
7th Sem
No ratings yet
7th Sem
13 pages
Deloitte's AI Impact on Job Dynamics
No ratings yet
Deloitte's AI Impact on Job Dynamics
10 pages
Data Analytics Training Manual in R
No ratings yet
Data Analytics Training Manual in R
47 pages

Unit-3 Introduction To Data Mining

Uploaded by

Unit-3 Introduction To Data Mining

Uploaded by

UNIT 3

INTRODUCTION TO DATA MINING

3.1 Data Mining: Concepts and applications

3.1.1 Overview of the data mining process:

What Is Clustering in Data Mining?

What Is Association in Data Mining?

Methods for Data Mining Association

 Single-dimensional association: This involves looking for one repeating instance of a

What Is Data Cleaning in Data Mining?

Examples of Data Cleaning in Business

What Is Data Visualization in Data Mining?

Examples of Data Visualization in Business:

What Is Classification in Data Mining?

 Logistic regression: This algorithm attempts to show the probability of a specific

What Is Machine Learning in Data Mining?

What Are Neural Networks in Data Mining?

Neural Network Methods

Examples of Neural Networks in Business

10. Outlier Detection

What Is Outlier Detection in Data Mining?

Examples of Outlier Detection in Business

What Is Prediction in Data Mining?

Methods for Prediction

Fig 1: Data Mining Process

3.3.2 Text Mining Overview:

1. NLP Libraries and Frameworks:

3.4 Applications of Text Mining

3.5 Web mining overview

Tools and Technologies:

3.6 Social media analytics:

1. Understanding Audience Behavior: Social media analytics helps businesses and

1. Definition and Purpose:

Use Case: Social Media Monitoring for Brand Reputation Management:

Fig 2: Big Data Process

3. Customer Insights and Personalization

3.9 Fundamentals of Big Data Analytics:

You might also like