0% found this document useful (0 votes)
18 views33 pages

Unit 3

Uploaded by

iemhardik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views33 pages

Unit 3

Uploaded by

iemhardik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Unsupervised /Supervised

learning
Data Visualization
Data visualization is the practice of translating
information into a visual context, such as a
map or graph, to make data easier for the
human brain to understand and pull insights
from.
Techniques involve line chart, graphs, diagram
tree maps,scatter plot
Data classification
Process of organizing data in relevant
categories so that it may be used and
protected more efficiently.
Types of Data Classification
• Data classification involves the use of tags and labels to define
the data type, its confidentiality, and its integrity. There are
three main types of data classification that are considered the
industry standard:
• Content-based classification – inspects and interprets files,
looking for sensitive information
• Context-based classification – looks to the application,
location, metadata, or creator (among other variables) as
indirect indicators of sensitive information
• User-based classification – requires a manual, end-user
selection for each document. User-based classification takes
advantage of the user knowledge of the sensitivity of the
document, and can be applied or updated upon creation, edit,
review, or dissemination.
Data Classification

• Low Risk/sensitive (press release,job


advertisement,marketing material)
• Moderate Risk/sensitive (operation
documents,unpublished results,Supplier
contact information)
• High Risk /high sensitive data (financial
data,mdecial reports,intellectual
property,credit card,PII)
Data mining
• Data mining is the process of analyzing large data sets (Big Data) from
different perspectives and uncovering correlations and patterns to
summarize them into useful information.

• Data mining is a discipline with a long history. It starts with the early
Data Mining methods Bayes’ Theorem (1700`s) and Regression
analysis (1800`s) which were mostly identifying patterns in data.

• By the early 1990`s, data mining was recognized as a sub-process or a


step within a larger process called Knowledge Discovery in Databases
(KDD) .The most commonly used definition of KDD is “The nontrivial
process of identifying valid, novel, potentially useful, and ultimately
understandable patterns in data” (Fayyad, 1996).
Descriptive/Predictive
• Descriptive data mining: Descriptive data mining offers a detailed
description of the data, for example- it gives insight into what's
going on inside the data without any prior idea. This demonstrates
the common characteristics in the results. It includes any
information to grasp what's going on in the data without a prior
idea.
• Predictive Data Mining: This allows users to consider features that
are not specifically available. For example, the projection of the
market analysis in the next quarters with the output of the
previous quarters, In general, the predictive analysis forecasts or
infers the features of the data previously available. For an
instance: judging by the outcomes of medical records of a patient
who suffers from some real illness.
Data mining Task
• Data Characterization: The characterization of data is a
description of the general characteristics of objects in
a target class which creates what are called
characteristic rules. A database query usually
computes the data applicable to a user-specified class
and runs through a description component to retrieve
the meaning of the data at various abstraction levels.
Eg;-Bar maps, curves, and pie charts.
• Data Discrimination: Data discrimination creates a
series of rules called discriminate rules that is simply a
distinction between the two classes aligned with the
goal class and the opposite class of the general
characteristics of objects.
Data prediction :The prediction of the class mark using
the previously developed class model and the
prediction of incomplete or incomplete data using
prediction analysis are two ways of predicting data.
• Classification: It is used to create data structures of
predefined classes, as the model is used to classify
new instances whose classification is not understood.
• Association analysis : The link between the data and
the rules that bind them is discovered. And two or
more data attributes are associated. It associates
qualities that are transacted together regularly.
• Outlier analysis:Data components that cannot be
clustered into a given class or cluster are outliers.
They are often referred to as anomalies or surprises
and are also very important to remember.
Machine Learning

• Machine learning (ML) is the study of computer


algorithms that can improve automatically through
experience and by the use of data. It is seen as a part
of artificial intelligence. Machine learning algorithms
build a model based on sample data, known as training
data, in order to make predictions or decisions without
being explicitly programmed to do so. Machine
learning algorithms are used in a wide variety of
applications, such as in medicine, email filtering,
speech recognition, and computer vision, where it is
difficult or unfeasible to develop conventional
algorithms to perform the needed tasks.
Evolution of ML

• The term machine learning was coined in 1959


by Arthur Samuel, an American IBMer and
pioneer in the field of computer gaming and
artificial intelligence. Also the synonym
self-teaching computers was used in this time
period.
 How machine learning works

• 1. A Decision Process: In general, machine learning algorithms are


used to make a prediction or classification. Based on some input
data, which can be labelled or unlabeled, your algorithm will
produce an estimate about a pattern in the data.
• 2. An Error Function: An error function serves to evaluate the
prediction of the model. If there are known examples, an error
function can make a comparison to assess the accuracy of the
model.
• 3. A Model Optimization Process: If the model can fit better to the
data points in the training set, then weights are adjusted to reduce
the discrepancy between the known example and the model
estimate. The algorithm will repeat this evaluate and optimize
process, updating weights autonomously until a threshold of
accuracy has been met.
Supervised Learning/Unsupervised
Learning
• Supervised learning, also known as supervised machine learning, is defined by
its use of labelled datasets to train algorithms that to classify data or predict
outcomes accurately. As input data is fed into the model, it adjusts its weights
until the model has been fitted appropriately to avoid over fitting or under
fitting. ex. linear regression, logistic regression, random forest, support vector
machine (SVM), and more.
• COINS,
• Unsupervised machine learning
• Unsupervised learning, also known as unsupervised machine learning, uses
machine learning algorithms to analyze and cluster unlabeled datasets. These
algorithms discover hidden patterns or data groupings without the need for
human intervention.Techniques involve exploratory data analysis, cross-selling
strategies, customer segmentation, image and pattern recognition.
• PLAYERS RANKING
• Reinforcement Learning : Feedback based system.
• Semi-supervised learning Semi-supervised learning offers a happy
medium between supervised and unsupervised learning. During
training, it uses a smaller labelled data set to guide classification
and feature extraction from a larger, unlabeled data set.
Semi-supervised learning can solve the problem of having not
enough labelled data (or not being able to afford to label enough
data) to train a supervised learning algorithm. 
• Reinforcement machine learning Reinforcement machine
learning is a behavioural machine learning model that is similar to
supervised learning, but the algorithm isn’t trained using sample
data. This model learns as it goes by using trial and error. A
sequence of successful outcomes will be reinforced to develop the
best recommendation or policy for a given problem.
• Facebook recognizing face of your
friend??????
• Amazon suggesting you product based on ur
previous choice ????
• Netflix suggesting you movies on basis of
movies you watch????
Machine Learning Frameworks
• A Machine Learning Framework is an interface, library or tool which allows
developers to build machine learning models easily, without getting into the depth
of the underlying algorithms.
• TensorFlow Google’s Tensorflow is one of the most popular frameworks today. It is
an open-source software library for numerical computation using data flow graphs.
• Sci-Kit Learn
• Scikit-learn is one of the most well-known ML libraries. It is preferable for
administered and unsupervised learning calculations. Precedents implement direct
and calculated relapses, choice trees, bunching, k-implies, etc.
• Amazon Machine Learning provides visualization tools that help you go through
the process of creating machine learning (ML) models without having to learn
complex ML algorithms and technology.
• Google Cloud ML Engine
• Cloud Machine Learning Engine is a managed service that helps developers and
data scientists in building and running superior machine learning models in
production.
Cluster Analysis
• Cluster Analysis is the process to find similar groups
of objects in order to form clusters. It is an
unsupervised machine learning-based algorithm that
acts on unlabelled data. A group of data points would
comprise together to form a cluster in which all the
objects would belong to the same group.
• A cluster is nothing but a collection of similar data
which is grouped together.
• For example, similar customer together on the basis
of purchase history
Properties of Clustering:
• Clustering Scalability:In order to handle extensive databases, the
clustering algorithm should be scalable. Data should be scalable if
it is not scalable, then we can’t get the appropriate result and
would lead to wrong results.
• High Dimensionality: The algorithm should be able to handle high
dimensional space along with the data of small size.
• Algorithm Usability with multiple data kinds: Different kinds of
data can be used with algorithms of clustering. It should be
capable of dealing with different types of data like discrete,
categorical and interval-based data, binary data etc.
• Dealing with unstructured data: These would be some databases
that contain missing values, noisy or erroneous data. If the
algorithms are sensitive to such data then it may lead to poor
quality clusters. So it should be able to handle
Clustering Method

• Partitioning Method
• Hierarchical Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
• Partitioning Method: It is used to make partitions on the data
in order to form clusters. If “n” partitions are done on “p”
objects of the database then each partition is represented by
a cluster and n < p.

• Density-Based Method: The density-based method mainly


focuses on density. In this method, the given cluster will keep
on growing continuously as long as the density in the
neighbourhood exceeds some threshold, i.e, for each data
point within a given cluster. The radius of a given cluster has
to contain at least a minimum number of points.
• Model-Based Method: In the model-based method, all the
clusters are hypothesized in order to find the data which is best
suited for the model. The clustering of the density function is
used to locate the clusters for a given model. It reflects the
spatial distribution of data points and also provides a way to
automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account.
Therefore it yields robust clustering methods.

• Constraint-Based Method: The constraint-based clustering


method is performed by the incorporation of application or
user-oriented constraints. A constraint refers to the user
expectation or the properties of the desired clustering results.
Constraints provide us with an interactive way of
communication with the clustering process. Constraints can be
specified by the user or the application requirement.
Unsupervised learning (K means
clustering)
• Starts with k observations and clusters are
formed around it .The means are readjusted
and new clusters are formed.This process if
repeated till it is stablized
• Used for large data set .
• Use elbow method to decide number of
cluster
• Load data: The first step is to load the data into the programming language of your
choice. The data should be pre-processed and feature extracted.
• Choose the number of clusters: The next step is to choose the number of clusters for
K-means. Thenumber of clusters should be chosen based on the problem and data.
There are various methods to determine the number of clusters such as the elbow
method.
• Initialize centroids: The next step is to randomly initialize the centroids for K-means.
Centroids are the points that represent the centers of the clusters. You can use the
kmeans++ initialization method to choose the initial centroids.
• Assign data points to clusters: The next step is to assign each data point to the nearest
centroid. This is done by calculating the distance between each data point and each
centroid.
• Recalculate centroids: The next step is to recalculate the centroids for each cluster. This
is done by taking the mean of all the data points in the cluster.
• Repeat steps 4 and 5: Steps 4 and 5 are repeated until the centroids converge. The
convergence criteria can be based on the number of iterations or the change in
centroids.
• Evaluate the results: The final step is to evaluate the results. This can be done by
calculating the within-cluster sum of squares (WCSS) or by visualizing the clusters. You
can use the plot function to visualize the clusters in two or three dimensions.
Hierarchial Clustering
• group similar data points together based on
their distances from each other. The basic idea
of hierarchical clustering is to build a hierarchy
of clusters, starting with each data point in its
own cluster and then merging the most similar
clusters together until all of the data points
are in a single cluster.
Applications of Cluster Analysis:

• It is widely used in image processing, data


analysis, and pattern recognition.
• It helps marketers to find the distinct groups in
their customer base and they can characterize
their customer groups by using purchasing
patterns.
• It can be used in the field of biology, by deriving
animal and plant taxonomies, identifying genes
with the same capabilities.
• It also helps in information discovery by classifying
documents on the web.
Text Analytics
• Powerful tool for analyzing large volumes of text data and extracting valuable insights
that can be used to make data-driven decisions.
• or
• It involves applying statistical and computational techniques to text data to identify
patterns and relationships between words and phrases, and to uncover insights that can
help organizations make data-driven decisions.
• or
• Text analytics combines a set of machine learning, statistical and linguistic techniques to
process large volumes of unstructured text or text that does not have a predefined
format, to derive insights and patterns.

• Text mining can be used to identify if customers are satisfied with a product by
• Analyzing their reviews and surveys. Text analytics is used for deeper insights, like
identifying a pattern or trend from the unstructured text. For example, text analytics
can be used to understand negative spike in the customer experience or popularity of a
product.
• As of 2020, around 4.57 billion people have access to the
internet. That’s roughly 59 percent of the world’s
population. Out of which, about 49 percent of people are
active on social media. An enormous amount of text data
is generated every day in the form of blogs, tweets,
reviews, forum discussions, and surveys. Besides, most
customer interactions are now digital, which
createsanother huge text database.Most of the text data
is unstructured and scattered around the web. If this text
data is gathered,collated, structured, and analyzed
correctly, valuable knowledge can be derived from it.
Advantages of Text Analytics
• Help businesses to understand customer trends, product performance, and
service quality. This results in quick decision making, enhancing business
intelligence, increased productivity, and cost savings.
• Helps researchers to explore a great deal of pre-existing literature in a short
time, extracting what is relevant to their study. This helps in quicker scientific
breakthroughs.
• Assists in understanding general trends and opinions in the society, that
enable governments and political bodies in decision making.
• Text analytic techniques help search engines and information retrieval
systems to improve their performance, thereby providing fast user
experiences.
• Refine user content recommendation systems by categorizing related
content.
Text Analytics Techniques
• Sentiment analysis is used to identify the emotions conveyed by the unstructured
text. The input text includes product reviews, customer interactions, social media
posts, forum discussions, or blogs. Angry,Happy,neutral.
• Customer feedback analysis: Sentiment analysis can be used to analyze customer
feedback, such as product reviews or survey responses, to understand customer
sentiment and identify areas for improvement. For example, a company could use
sentiment analysis to identify common themes in negative reviews and use that
information to improve product features or customer service.
• Brand reputation monitoring:help analyzing mentions of a company or brand on
social media, news articles, or other online sources.
• Market research: help companies identify opportunities for innovation or new
product development.
• Financial analysis: help investors to make more informed decisions by identifying
potential risks and opportunities.
• Named Entity Recognition (NER) used for identifying named entities like
people, places, organizations, and events in unstructured text. NER extracts nouns
from the text and determines the values of these nouns.
• Customer relationship management: NER can be used to identify the names and
contact information of customers mentioned in customer feedback, such as
product reviews or survey responses. This can help companies to personalize their
communication with customers and improve customer satisfaction.
• Fraud detection: NER can be used to identify names, addresses, and other
personal information associated with fraudulent activities,
• Media monitoring: NER can be used to monitor mentions of specific companies,
individuals, or topics in news articles or social media posts. This can help
companies to stay up-to-date on the latest trends and monitor their brand
reputation.
• Market research: NER can be used to identify the names and affiliations of experts
or key influencers in a particular industry or field.
• Document categorization: NER can be used to automatically categorize documents
based on the named entities mentioned in the text. This can help companies to
quickly identify relevant documents and extract useful information.
• Event extraction: Event extraction recognizes events mentioned in
text content, for example, mergers, acquisitions, political moves, or
important meetings. Advanced algorithms strive to recognize not
only events but the venue, participants, date, and time wherever
applicable.
• Link analysis: This is a technique to understand “who met whom and
when” through event extraction from communication over social
media.possible threats
• Social media monitoring: Event extraction can be used to monitor
social media posts and identify
• events or activities related to a particular topic or brand.
• Business risk monitoring:
• Event extraction techniques allow businesses to monitor the web to
find out if any of their partners, like suppliers or vendors, are dealing
with adverse events like lawsuits orbankruptcy.
Artificial Intelligence
• AI can be traced to classical philosophers'
attempts to describe human thinking as a
symbolic system.
• Artificial Intelligence can gather and organize
vast volumes of data in order to draw
inferences and estimates that are outside of
the human ability to comprehend manually.
• Self driving vehicle,

You might also like