0% found this document useful (0 votes)
14 views5 pages

Data Mining Graphs and Networks

The document discusses two primary methods for frequent substructure mining: the Apriori-based approach and the Pattern-growth approach, detailing their algorithms and processes. It also covers constraint-based substructure mining, network analysis, link mining, and community mining, emphasizing the challenges and techniques involved in analyzing relationships within large datasets. The text highlights the importance of links in networks for effective data mining and the complexities of handling multi-relational data.

Uploaded by

Priyam Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views5 pages

Data Mining Graphs and Networks

The document discusses two primary methods for frequent substructure mining: the Apriori-based approach and the Pattern-growth approach, detailing their algorithms and processes. It also covers constraint-based substructure mining, network analysis, link mining, and community mining, emphasizing the challenges and techniques involved in analyzing relationships within large datasets. The text highlights the importance of links in networks for effective data mining and the complexities of handling multi-relational data.

Uploaded by

Priyam Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Data Mining Graphs and Networks

There are two methods for frequent substructure mining.


The Apriori-based approach: The approach to find the frequent graphs begin from the graph
with a small size. The approach advances in a bottom-up way by creating candidates with extra
vertex or edge. This algorithm is called an Apriori Graph. Let us consider Q k as the frequent
sub-structure set with a size of k. This approach acquires a level-wise mining technique. Before
the Apriori Graph technique, the generation of candidates must be done. This is done by
combining two same but slightly varied frequent subgraphs. After the formation of new
substructures, the frequency of the graph is checked. Out of that, the graphs found frequently
are used to create the next candidate. This step to generate frequent substructure candidates is a
complex step. But, when it comes to generating candidates in itemset, it is easy and effortless.
Let’s consider an example of having two itemsets of size three such
that and So, the itemset derived using join would be pqrs. But when
it comes to substructures, there is more than one method to join two substructures.
Algorithm:
This approach is based on frequent substructure mining.
Input:
F= a graph data set.
min_support= minimum support threshold
Output:
Q1,Q2,Q3,....QK,
a frequent substructure set graphs with the size range from 1 to k.
Q1 <- all the frequent 1 subgraphs in F;

while Qk-1 ≠ ∅ do
k <- 2;

Qk <- ∅;

foreach candidate l ∈ Gk do
Gk <- candidate_generation(Qk-1);

foreach Fi ∈ F do
l.count <- 0;

if isomerism_subgraph(l,Fi) then
l.count <- l.count+1;
end

if l.count ≥ min_support(F) ∧ l∉Qk then


end

Qk = Qk U l;
end
end
k <- k+1;
end
It is an iterative method in which the first candidate generation takes place followed by the
support computation. The subgraphs are generated using subgraph isomorphism. Thus frequent
subgraphs are generated by efficiently using this approach which helps in FSM. Apriori
approach uses BFS(Breadth-First Search) due to the iterative level-wise generation of
candidates. This is necessary because if you want to mine the k+1 graph, you should have
already mined till k subgraphs.
The Pattern- growth approach: This pattern-growth approach can use both BFS and
DFS(Depth First Search). DFS is preferred for this approach due to its less memory
consumption nature. Let us consider a graph h. A new graph can be formed by adding an edge
e. The edge can introduce a vertex but it is not a need. If it introduces a vertex, it can be done
in two ways, forward and backward. The Pattern-growth graph is easy but it is not that
efficient. Because there is a possibility of creating a similar graph that is already created which
leads to computation inefficiency. The duplicate graphs generated can be removed but it
increases the time and work. To avoid the creation of duplicate graphs, the frequent graphs
should be introduced very carefully and conservatively which calls the need for other
algorithms.
Algorithm:
The below algorithm is a pattern-growth-based frequent substructure mining with a simplistic
approach. If you need to search without duplicating, you must go with a different algorithm
with gSpan.
Input:
q= a frequent graph
F= a graph data set.
min_support= minimum support threshold
Output:

P <- ∅;
P = the frequent graph set

Call patterngrow_graph(q, F, min_support, P);

if q ∈ P then return;
procedure patterngrow_graph(q, F, min_support, P)

else insert q into P;


scan F once, find all the edges e such that q can be extended to q -> e;
for each frequent q -> e do
PatternGrowthGraph(q -> e, D, min_support, P);
return;
An edge e is used to extend a new graph from the old one q. The newly extend graph is
denoted as The extension can either be backward or forward.
Constraint-based substructure mining
According to the request of the user, the constraints described changes in the mining process.
But, if we generalize and categorize them into specific constraints, the mining process would
be handled easily by pushing them into the given frameworks. constraint-pushing strategy is
used in pattern growth mining tasks. Let’s see some important constraint categories.
 Subgraph containment constraint: When a user requests a pattern with specified
subgraphs, this constraint is used. This constraint is also called a set containment
constraint. The given set of subgraphs is taken as a query and then mining is done based on
the chosen data by extending the patterns from the subgraph sets. This technique can be
used to mine when the user requests patterns with specific sets of edges or vertices.
 Value- sum constraint: Here, the constraint is the sum of weights on the edges. There are
two ranges high and low. The two constraints are designated as
and The first condition is called monotonic constraint because
once the condition is satisfied, still the extension can take place by adding edges until the
next condition is satisfied. But the latter condition is called anti-monotonic
constraint because once the condition becomes satisfied, further no more extension can be
made. By this method, the constraint-pushing technique will work out well.
 Geometric Constraint: In this constraint, the angle between pair of edges within a given
range that is connected is taken. Let us consider a graph h, such
that

where E1, E2 are the edges connected at the vertex V and connected to the other two
vertices at the other two ends V 1, V2. Ah is called the anti-monotonic constraint because if
any one of the angles formed by combining two edges didn’t satisfy, it does not move to
the next level and it will never satisfy A h. It can be pushed to the edge extension process
and eliminate any extension that doesn’t satisfy A h.
Network Analysis
In the concept of network analysis, the relationship between the units is called links in a graph.
From the data mining outlook, this is called link mining or link analysis. The network is a
diversional dataset with a multi-relational concept in form of a graph. The graph is very large
with nodes as objects, edges as links which in turn denote the relationship between the nodes
or objects. Telephone networking systems, WWW( World Wide Web) are very good examples.
It also helps in filtering the datasets and providing customer-preferred services. Every network
consists of numerous nodes. The datasets are widely enormous. Thus by studying and mining
useful information from a wide group of datasets would help in solving problems and effective
transmission of data.

Link Mining

There are some conventional methods of machine learning in which taking homogeneous
objects from one relationship is taken. But in networks, this is not applicable due to a large
number of nodes and its multi-relational, heterogeneous nature. Thus the link mining has
appeared as a new field after many types of research. Link mining is the convergence of
multiple research held in graph mining, networks, hypertexts, logic programming, link
analysis, predictive analysis, and modeling. Links are nothing but the relationship between
nodes in a network. With the help of links, the mining process can be held efficiently. This
calls for the various functions to be done.
 Link-based object classification: In link mining, only attributes are not enough. Here the
links and the traits of the linked nodes are also necessary. One best example is Web-based
classification. In web-based classification, the system predicts the categorization of a
webpage based on the presence of that specified word which means the searched word
occurs on that page. Anchor text is which the person clicks the hyperlink that opens while
searching. These two things act as attributes in web-based classification. The attributes can
be anything that relates to the link and network pages.
 Link type prediction: According to the resources of the object involved, the system
predicts the motive of that link. In organizations, it helps in suggesting interactive
communication sessions between employees if needed. In the online retail market, it helps
predict what a customer prefers to buy which can increase sales and recommendations.
 Object type prediction: Here the prediction is based on the type of the object involved, its
attributes and properties, links and traits of the object linked to it. For example in the
restaurant domain, a similar method is done to predict if a customer prefers ordering food
or directly visiting the restaurant. It also helps in predicting the method of communication a
customer prefers whether by phone or mail.
 Link Cardinality estimation: In this task, there are two types of estimation. The first one
is predicting the number of links linked to an object. For example, the percentage of the
authority of a web page can be calculated by finding the number of links linked to it which
is called in-links. Web pages that act as a hub which means a set of web pages denotes
other links which come under the same topic can be identified using out-links. For
example, when a pandemic strikes, finding the links of the affected patient can lead us to
the other patients which helps in the control of the transmission. The second one is done by
predicting the number of objects outreaching along a route from an object. This method is
crucial in estimating the object number returned as output by a query.
 Predicting link existence: In link type prediction, the type of the link is predicted. But,
here the system predicts whether a link exists between two objects. For an instance, this
task is used to predict if a link exists between two web pages.
 Object Reconciliation: In this method, the function is to predict if any two objects are the
same on the basis of their attributes or traits or links. This method is also called identity
uncertainty or record linkage. This task has it’s the same procedure in the matching of
citation, extraction of details, getting rid of duplicates, consolidating objects. For an
instance, this task is to help if one website is reflecting the other website like a mirror to
each other.

Challenges in Link Mining

 Statistical compared to logical dependencies: The logical relationship between objects is


denoted by graph-link structures. The statistical relationship is denoted by probabilistic
dependencies. The rational handling of these two dependencies is difficult in data mining
which is multi-relational. One must be careful enough to find the logical dependencies
between objects along with probabilistic relationships between attributes. These
dependencies take a large amount of space which complicates the mathematical model
deployed.
 Collective classification and consolidation: Let us consider a training model based on
objects that are class-labeled. In conventional classification, classification is only done
based on the attribute.

 If there is a chance that classification occurs after giving training with unlabeled objects,
the model becomes incapable of classification due to the complications of the correlations
of the objects. This calls for the need for another supplementary iterative step which
consolidates the labels of objects based on the labels of objects linked to it. Here collective
classification takes place.
 Constructive use of labeled and unlabeled data: One emerging technique is to merge
both labeled and unlabeled data. Unlabeled data assist in identifying the distribution of
attributes. The links that are present in unlabeled data help us in extracting the linked
object’s attributes. The links that are present between unlabeled and labeled data help in
establishing dependencies which increases the efficiency in interference.
 Open compared to closed-world assumptions: In the conventional method, it is assumed
that we know all the possible objects/ entities present in the domain which is closed-world
assumptions. But, closed world assumption is impractical in the application of reality. This
calls for the introduction of specific language for probability distributions with respect to
relational objects that contains a varied set of objects.

Community Mining

Network analysis includes the finding of objects which are in groups that share similar
attributes. This process is known as community mining. In the web page linkage, the
introduction of community where a group of web pages is made which follow a common
theme. Many community mining algorithms decide that there is only one network and it tries to
establish a homogeneous relationship. But in the real world web pages, there are multiple
networks with heterogeneous relationships. This proves the need for multi-relational
community mining.

You might also like