0% found this document useful (0 votes)
15 views18 pages

Genetic Algorithm-Based Community Detection in Lar

Uploaded by

Iyade Salah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views18 pages

Genetic Algorithm-Based Community Detection in Lar

Uploaded by

Iyade Salah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Neural Computing and Applications (2020) 32:9649–9665

https://doi.org/10.1007/s00521-019-04487-0
(0123456789().,-volV)(0123456789().
,- volV)

ORIGINAL ARTICLE

Genetic algorithm-based community detection in large-scale social


networks
Ranjan Kumar Behera1 • Debadatta Naik2 • Santanu Kumar Rath1 • Ramesh Dharavath1

Received: 14 February 2019 / Accepted: 29 August 2019 / Published online: 6 September 2019
Ó Springer-Verlag London Ltd., part of Springer Nature 2019

Abstract
Communities in social networks are the essential feature which may be considered as a potential parameter in modeling the
behavior of the social entities. Detection of communities has attracted a lot of attention in research in social network
analysis. It is one of the major challenging problems as it involves high complexity in processing complex web structure. In
fact, this problem can be considered as a NP-complete problem in large-scale networks, as this problem is somewhat
reducible to the clique problem in graph theory. A number of meta-heuristic algorithms have been proposed to explore the
hidden communities. Most of these algorithms have considered the modularity of the network as their objective function.
But, the aspect of optimizing the value of modularity is associated with a problem known as resolution limit, where the size
of the detected communities depends on the number of edges existing in the network. In this paper, a genetic algorithm-
based community detection has been proposed where an efficient single objective function based on similarity matrix has
been devised. The similarity index between each pair of nodes has been calculated in a distributed manner over multiple
computing nodes. Similarity index proposed in this paper is based on the topological structure of the network. The
effectiveness of the proposed approach is examined by comparing the performance with other state-of-the-art community
detection algorithms applied over some real-world network datasets.

Keywords Genetic algorithm  Community detection  Similarity index  Fitness function

1 Introduction about the functionality of large-scale system. However,


detecting communities in the large-scale network is a
In the real world, the entity of a complex structure can be challenging task due to its heterogeneous nature and high
represented as a network structure. Communities are the complexity. It is also due to the existence of several defi-
basic building blocks in network evolution. They usually nitions for community based on different features and
resemble with functional units in the complex systems. By characteristics. As per the terminology, communities are
analyzing the communities, one can get the detailed insight the group of nodes that are densely connected within
the group [1]. Connections among the nodes indicate the
similarity between the nodes. Similarities between the
& Ramesh Dharavath
[email protected] nodes are based on either topological features or content
features. Similarities between the nodes within the com-
Ranjan Kumar Behera
[email protected] munity tend to be more similar as compared to nodes
outside the community.
Debadatta Naik
[email protected] Traditional community detection algorithms fail to
identify all the hidden communities in large-scale networks
Santanu Kumar Rath
[email protected] in reasonable amount of time. In fact, the problem is found
to be a NP-complete [2]. Research on community detection
1
Department of Computer Science and Engineering, National has received a lot of attention in various application
Institute of Technology, Rourkela 769008, India domains. For example, communities in biological network
2
Department of Computer Science and Engineering, Indian may correspond to protein complex unit. In social network,
Institute of Technology (ISM), Dhanbad 769008, India

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


9650 Neural Computing and Applications (2020) 32:9649–9665

it may be a group formed by the users based on common comparison are presented in Sect. 7. Section 8 presents the
interest. Communities in the network are the higher-order limitations of the proposed approach. Finally, conclusion
substructure within the network. Identifying and analyzing and future work are discussed in Sect. 9.
these substructures may help in understanding the inherent
features and predicting the behavior of the complex sys-
tems. The approach to solve the community detection 2 Related literature
problem can either be heuristic based or optimization
based. Heuristic-based community detection algorithms are Several algorithms have been proposed to identify the
often based on few assumptions such as in Girvan Newman communities in large-scale networks as reported in the lit-
algorithm, where the assumption is that the edges in inter- erature [5–8]. The most commonly accepted one was pro-
communities have higher edge-betweenness value as posed by Newman and Girvan, which is similar to the
compared to edges in intra-communities [3]. Communities approach followed on the hierarchal clustering in data
can be identified sequentially by removing the edges in mining [9]. In this algorithm, communities were identified
non-increasing order of the edge-betweenness value. after the removal of edges in the non-increasing order of
Optimization-based community detection algorithms deal their edge-betweenness value. The edge-betweenness
with maximizing the objective function that captures the measure is defined as the ratio of number of shortest paths
structural information in the network [4]. However, the (between any two nodes), that are passed through it, to the
algorithms based on the maximizing modularity expose total number of shortest paths in the network. The
with two major problems. First, computing modularity assumption taken in Newman and Girvan algorithm is that
value for all the possible combination of partition is a NP- the edges, present between the communities, are having
hard problem, and second, modularity-based algorithm higher edge-betweenness value [10]. On the other hand, in
often is prone to resolution limits, which proves that the case of weighted graph, the edge betweenness value
detected community size depends on the size of the net- depends on the content associated with the links [11]. By
work, and few small-sized communities may not be iden- identifying and removing these edges, one can identify the
tified in the detection process. hidden communities. However, in their algorithm edge-
In order to handle the NP-completeness of the problems, betweenness value is recalculated after the removal of each
genetic algorithm has been considered as one of the suit- edge which is a very time-consuming process, especially for
able approaches to obtain an optimal result. The method- large-scale networks. In this paper, we have tried to reduce
ology of GA is inspired by the biological evolution of the complexity of the problem by applying meta-heuristic
genes to obtain better offspring. Standard genetic algo- approaches such as genetic algorithm. Clauset et al. [12]
rithms randomly initialize set of candidate solutions for the proposed a community detection algorithm which is based
problem, where an objective function is defined to measure on the greedy paradigm. It is based on the agglomerative
the fitness value for each solution. Set of solutions having approach, where initially each node in the network is con-
better fitness value are considered for processing the next sidered as a community. Communities are then merged on
generation sequences. Also, a set of new offsprings are the basis of a measure known as modularity (Q). It is
generated based on the genetic operator. To solve the defined as the difference between inter-community edge-
related issues, in this paper, genetic algorithm has been density and intra-community edge-density. It depends on
applied to identify and analyze the community structure in two structural parameters eij and ai which are defined as
large networks. Depending on the structure of the network, follows:
it is often partitioned into different number of communi- 1 X
ties. In this study, a new objective function has been pro- eij ¼ Avw ð1Þ
2m vw
posed to measure the fitness value for each chromosome.
The subsequent section of the paper is presented as and
follows: Sect. 2 presents the related literature about the
1 X
community detection problem. Section 3 briefly explains ai ¼ kvw ð2Þ
2m vw
the problem statement of the work. Similarity index has
been considered as the major parameter in evaluating fit- Here, eij measures the fraction of inter-community edges
ness quality. Different similarity indices are presented in and ai measures the intra-community edges in the network.
Sect. 4. The proposed community detection approach The modularity proposed by the author Clauset et al. [12]
based on genetic algorithm is discussed in Sect. 5. can be expressed as:
Section 6 presents the quality score to evaluate the effec-  
1 X kv kw
tiveness of the algorithm. Experimental setup, detail of the Q¼ A vw  dðCv ; Cw Þ ð3Þ
dataset used for implementation and algorithms used for 2m vw 2m

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Neural Computing and Applications (2020) 32:9649–9665 9651

where m is the total number of edges in the network Avw is gene concept was combined with an efficient mutation
1, if an edge exist between node v and node w. Otherwise, method to analyze the modularity function. Another col-
its value is 0. k2mv kw
is the probability of having an edge laborative evolutionary model was proposed by Chira and
between node v and node w with degree kv and kw , Gog to detect the communities in a network [17]. In this
respectively, if edges are randomly assigned in the net- work, search space was guided by best–worst recombina-
work. dðCv ; Cw Þ is one, if both v and w belong to the same tion and model was dependent on collaborative selection.
community and it is zero, if they belong to the different The advantage of this algorithm is that it does not require
communities. The performance of the algorithm depends prior knowledge of number of communities present in the
on the efficient computation of the modularity. It also network. On the other hand, this algorithm does not com-
suffers from a problem known as resolution limit, where pute the overlapping community structures of the network.
small-sized communities remain unidentified in the pro- Gong et al. [18] proposed a single objective memetic
cess. In this paper, we address this issue by proposing a algorithm for community detection, which optimizes the
new quality metric to quantify the community structure in modularity density function to obtain the optimal solutions.
the network. We name it as group similarity density. Piz- In this approach, hill climbing strategy was used for local
zuti et al. [13] proposed a genetic algorithm for community search procedure. In the subsequent year, Gong et al. [19]
detection in social networks. The algorithm optimizes fit- proposed an improved version of memetic algorithm by
ness function known as community score to effectively presenting population generation by means of label prop-
identify the densely connected nodes. In their algorithm, agation strategy, an elitism strategy and an ISACLS local
each chromosome consists of vector of n number of genes, search strategy. A differential evolution-based optimization
where n is the number of nodes in the dataset. The value algorithm was proposed by Jia et al. [20] to detect the
assigned at each index is in the range of 1 to n. Here, the communities in a given network, considering modularity as
maximum number of solutions could be nn which is fitness function. Binary crossover was used for the trans-
exponential in nature. Unlike the encoding scheme adopted mission of information in the evolution. No prior knowl-
by Pizzuti et al. [13], we have encoded our solution as edge regarding the number of communities was required,
vector of n elements, in which the value assigned to it is in and quality was increased by implementing biased initial-
the range of 1 to k, where k is the predefined number of ization process and clean-up operations. Shang et al. [21]
communities. In our encoding scheme, the maximum proposed the improved genetic algorithm for community
number of solutions could be kn . This reduces the solution detection, in which modularity function is used as the fit-
space drastically, which leads to faster convergence of the ness function and simulated annealing procedure as local
algorithm. In the algorithm proposed by Clara Pizzuti et al., search strategy. This algorithm needs prior knowledge of
the number of communities detected is same as the number number of community structures. A local search-based
of connected components existing in the network. How- genetic algorithm was proposed by Liu et al. [22]. In this,
ever, this could be unrealistic in real-world network as in a concepts of marginal gene and monotonicity were used to
connected component there could be more than one com- overcome the problems associated with state-of-the-art
munity. Another algorithm proposed by Radicchi is based mutation methods. Along with these concepts, local search
on two quantitative definitions [14]. A subgraph in a net- strategy was combined to form a new mutation method.
work is said to be a community, if it has more inter-con- Similarly, Shi et al. [23] developed a genetic algorithm-
nections as compared to the number of connections outside based strategy that uses genetic operations to cluster the
the subgraph in the stronger sense. On the other hand, if the links associated among nodes of the network. Since a node
sum of all in-degrees in vertex set V is greater than the sum can be a member of more than one cluster, the proposed
of the out-degrees, then the subgraph is said to be a com- algorithm was able to identify the overlapping community
munity in a weaker sense. This algorithm works in a structures. A knowledge-based evolutionary algorithm was
manner similar to the one proposed by Newman [9]. proposed by Zadeh et al. [24] to detect the communities in
However, the edge with smallest value of edge-clustering a network. Extracted knowledge from network was used to
coefficient is first removed. The edge-clustering coefficient obtain the optimal solutions and search strategy. Knowl-
of an edge euv is defined as the ratio between the number of edge has been updated in each progression depending on
common neighbors of node u and v to the minimum degree the present status of the network. A glance of comparative
of node u and v [15]. analysis between the single objective methods and our
To overcome the existing mutation techniques used to algorithm is present in Table 1.
detect the community structures in the networks, a genetic Gupta et al. [25] proposed a parallel execution of
algorithm based on local search was proposed by Jin et al. community detection algorithm, which uses variations of
[16]. In this work, local search strategy based on marginal the outstanding quantum-inspired evolutionary algorithm
(QIEA). QIEA algorithm is likewise described by the

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


9652 Neural Computing and Applications (2020) 32:9649–9665

Table 1 Comparison of single objective function


Authors Year Fitness function Representation Mutation Crossover OC Distributed
computing

Jin et al. [16] 2010 Modularity (Q) Locus based Neighbor Uniform N N
Chira and Gog [17] 2011 Community score (CS) Locus based Random Collaborative N N
Gong et al. [18] 2011 Modularity density (D) Label based Neighbor Two-way N N
Gong et al. [19] 2012 Modularity density (D) Label based Neighbor label Two-way N N
Jia et al. [20] 2012 Modularity (Q) Label based Rand/1 Binary N N
Shang et al. [21] 2013 Modularity (Q) Label based Random Two-way N N
Liu et al. [22] 2013 Modularity (Q) Locus based Neighbor Uniform N N
Shi et al. [23] 2013 Modularity (Q) Locus based Neighbor Uniform Y N
Zadeh et al. [24] 2015 Community score (CS) Locus based Neighbor Uniform N N
Proposed method (GA-BCD) 2019 Group similarity density (Z) Label based Random Multi-point N Y
OC, overlapping community; Q, modularity; CS, community score; D, modularity density; Z, group similarity density
A set of communities are said to be overlapping communities (OC) if they are sharing at least one common node. The communities Ci and Cj are
said to be overlapping with each other if Ci \ Cj 6¼ ;; where i 6¼ j
Community score (CS) is one of the quality measures used to quantify the community partition adopted by many authors in the literature. It is
defined as the sum of scores corresponding to each of the communities explored in the community partition. They have measured the score by
maximizing the in-degree of individual community. Modularity density (D) is the modified version of the standard modularity, which was
defined by Clauset et al. [12]. The prime objective of modularity density was to overcome the resolution limit problem existing in standard
modularity. We have proposed a different quality measure to quantify the community partition obtained through genetic algorithm. We name it
as group similarity density (Z). The details of this measure are discussed in Sect. 3

individual’s representation, population dynamics and the mapped to the actual community. As a result, quality of the
evaluating function. The algorithms are parallelized at community partition may be degraded. Guendouz et al.
thread as well as block level. It is able to achieve the [29] attempted to detect the communities with best solu-
maximum modularity value for the identified community tions without considering the structure of the system. Here,
structures, and computation is faster than serial version of author used the fireworks algorithm (FWA) that consists of
algorithms. Zhang et al. [26] proposed a faster and mixed new initialization and mutation methodologies. This
representation evolutionary algorithm which is able to property increases the algorithm’s speed of convergence.
extract the overlapping communities in the complex net- Chen et al. [30] adopted a novel similarity measure to
works. Mixed representation algorithms consist of two detect the communities for signed network. The primary
categories of nodes: one the nodes which are common in procedure of the algorithm depends on the reconstructed
more than one community and others are the nodes which neighbor sets, which are found by the similarity measure.
are found in a single community. Updating individual The continuous change in states of nodes is imitated by
methodologies are applied in two categories of nodes. The differential equation. The performance of the proposed
performance of the proposed algorithm is effective and algorithm is high. Ju et al. [31] proposed a membrane-
compared with six other overlapping community detection based multi-objective evolutionary algorithm to detect
algorithms. Guerrero et al. [27] proposed a new general communities in networks, where the whole population is
genetic algorithm (GGA?), which adopts a flexible and separated into different membranes. Here, the value of two
adaptive analysis of the network structure. Under the evaluating parameters named Ratio Cut and Kernel
guidance of modularity, efficient strategies for initializa- J-means are minimized. The algorithm is time efficient and
tion and search operators are included in GGA? algorithm diverse in nature. Some of the other meta-heuristic algo-
to evaluate the network structure. This approach is efficient rithms such as ant colony, bee colony and bat algorithm are
and scalable in nature. Wen et al. [28] proposed a maximal also applied in several papers for community detection
clique-based evolutionary algorithm for community algorithms. Rani et al. [32] have proposed a hybrid version
detection. They have considered a multi-objective function of bat algorithm which employs Tabu search strategy for
for the detection of overlapping communities. Maximal obtaining better solution for discrete optimization problems
clique graph provides the intrinsic property called over- such as community detection. In this work, the authors
lapping of nodes. Although maximal clique represents have shown that Tabu search strategy is the promising
community in a strong sense, but the nodes may not be approach to enhance the ability of local search for bat

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Neural Computing and Applications (2020) 32:9649–9665 9653

algorithm. Behera et al. [33] have presented a parallel maximum group similarity density of the network. It can be
version of community detection algorithm for small world modeled as:
network. In this work, the authors have considered the   k
1 Y
concept of six degrees of separation to explore the com- arg maxZ ¼ 1 þ pffiffiffi SDi ð6Þ
munities in a real-world complex network. Moreover, the k i¼1
 
algorithm has been implemented in Hadoop distributed Here, the factor 1 þ p1ffiffik is considered as the adjust-
platform with MapReduce framework. Ji et al. [34] have ment factor in order to maintain the divergence of nodes
considered the social communities based on user rating and into more than the expected community. Higher the num-
product reviews for improving recommendation accuracy. ber of communities, more will be the Z value. SDi is the
They have proposed the hybrid recommendation system by similarity density for the ith community. Similarity, density
including social communities identified from the social for each community can be obtained by adding the simi-
media reviews. In this paper, non-overlapping and over- larity value for all pairs of nodes within the community. It
lapping social communities have been detected using CNM can be obtained by the following equation:
and CoDA community detection algorithms, respectively. X
An extensive survey of different community detection SDi ¼ a;b2C
Sða; bÞ ð7Þ
i
algorithms has been presented by authors Azaouzi et al.
[35]. In their paper, the authors classified all the commu- where a and b are the entities in the ith community and S(a,
nity detection algorithms based on either distributed or b) is the similarity value between a and b. Problem of
centralized computation. They have also presented the community detection can be reduced to an optimization
algorithms that are suitable for both static and dynamic problem that seeks to maximize the value of Z in Eq. 6.
networks.

4 Similarity index
3 Problem statement
Users in each social community tend to have certain
In this section, we represent the network structure as fol- interest in common. Similarity between two entities in a
lows. Complex network can be represented in the form of a network can be measured based on various parameters.
graph G = (V, E), where V represents set of entities in the These parameters depend on either topological structure of
network and E represents set of relationships between the the network or attributes associated with the nodes. In this
entities and n = |V| and m = |E|. Nodes within a commu- study, similarity between each pair of nodes has been
nity tend to be more similar as compared to nodes outside evaluated based on network structure around the nodes. All
the community. the similarity measures considered in this paper are based
In community detection problem, the objective is to find on the acquaintance model in sociology [36]. The intuition
out k subgraphs or a cover of k communities which are behind this model is that more the number of common
represented as follows: friends, higher the chances for them to be familiar with
each other. Five different similarity measures have been
C ¼ fC1 ; C2 ; C3 . . .; Ck g and jC j ¼ k ð4Þ
considered for evaluating the degree of similarity between
and the entities. These are defined as follows:
[
k [
k
V¼ Vi and E ¼ Ei þ e ð5Þ 4.1 Jaccard index
i¼1 i¼1
Paul Jaccard proposed the statistic instance for measuring
where Vi and Ei are the set of vertices and edges in the ith
the similarity between the set of sample datasets [37].
community and e is the set of inter-community edges. The
Jaccard similarity between nodes a and b is defined as the
detected cover of C is said to have disjoint communities, if
ratio of common neighbors of a and b to the total neighbors
Ci \ Cj ¼ ;; 8ði; jÞ 2 ½1; k, otherwise it is said to have of a and b. It can be defined as:
overlapping communities.
In this study, we represent community partition in the jNbðaÞ \ NbðbÞj
Jaccardða; bÞ ¼ ð8Þ
form of membership vector which is a numeric vector. jNbðaÞ [ NbðbÞj
Each index in the vector represents an entity in the net- where Nb(a) and Nb(b) represent the set of neighboring
work, and each element represents its community mem- nodes of a and b, respectively. The value of Jaccard sim-
bership in the partitioned network. The objective is to find ilarity is normalized between 0 and 1.
out the membership vector of the network that represents

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


9654 Neural Computing and Applications (2020) 32:9649–9665

4.2 Simpson index degree. In this paper, vector of neighboring nodes is con-
sidered as the sample vector. It is presented as:
Simpson similarity index is defined as the ratio of common jNbðaÞ \ NbðbÞj
neighbor to the minimum degree. Unlike Jaccard similarity Cosineða; bÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð11Þ
jNbðaÞj  jNbðbÞj
index, Simpson index is normalized to the minimum degree
of nodes. It is similar to the topological overlap coefficient
as pointed out by [38]. It is defined as: 4.5 Sorenson index
jNbðaÞ \ NbðbÞj
Simpsonða; bÞ ¼ ð9Þ Sorenson proposed a similarity measure similar to Jaccard
minfjNbðaÞj; jNbðbÞjg
index to measure the similarity between the species [37]. It
has been widely used in social network analysis. It is
4.3 Geometric index defined as the ratio of twice the number of common
neighbor to the sum of degree of the nodes. It is expressed
Geometric similarity index can be calculated by dividing as follows:
the square of number of common neighbors to the product
2  jNbðaÞ \ NbðbÞj
of degree of both the nodes [37]. It is expressed as: Sorensonða; bÞ ¼ ð12Þ
jNbðaÞj þ jNbðbÞj
2
jNbðaÞ \ NbðbÞj
Geometricða; bÞ ¼ ð10Þ
jNbðaÞj  jNbðbÞj
5 Proposed methodology: GA-based
4.4 Cosine index community detection

Salton proposed the cosine similarity index which has been The performance of genetic algorithm mainly depends on
widely used in a number of applications [39]. Mostly, they the operators and the fitness function associated with the
are helpful in finding the similarity between citation net- problem. The proposed genetic algorithm-based commu-
works. It is defined as the cosine angle between two sample nity detection (GA-BCD) has been presented in
vectors. It can be expressed as the ratio of number of Algorithm1.
common neighbors to the square root of the product of

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Neural Computing and Applications (2020) 32:9649–9665 9655

5.1 Genetic representation in each dataset. For each of the k value, initially 100
number of candidate solutions have been generated
Encoding scheme for genetic algorithm plays a vital role in randomly.
finding the efficiency of the algorithm. In this paper,
numerical encoding scheme is used for each individual 5.3 Proposed fitness function
chromosome. Each chromosome in the population is rep-
resented as a vector of n elements, where n is the number of To validate the proposed algorithm, population size has
users given in the network. Each position i in the vector been considered as 100, i.e., one hundred number of
corresponds to a user in the network. The value at the ith chromosomes have been randomly generated at the first
index of the vector represents the identity of the commu- generation. As they are randomly generated, most of them
nity, where the ith user belongs to. Each gene can assume a may not be able to provide feasible solution for community
value in the range 1 to k, where k is the number of com- detection problem. Therefore, suitable candidate solutions
munities in the network. In this paper, community analysis are to be chosen based on the fitness value which has been
for the network has been performed by predefining the determined by the fitness function. The proposed fitness
number of communities in the network. Figure 1 presents function is defined as:
the representation for a candidate solution. In this figure,   k
1 Y
C1, C2, C3 are the three detected communities. So the range arg maxZ ¼ 1 þ pffiffiffi SDi ð13Þ
of the allele for candidate solution is 1–3. k i¼1

where SDi is the similarity density for the ith community,


5.2 Initial population generation i.e., Ci . It is described in Eq. 7.
Fitness value of individual solution seems to be
The convergence of genetic algorithm is often sensitive to increased with the number of generations. Figure 2 repre-
the quality of initial population. More diversity and high sents the change in fitness value with the generation
average fitness value of the population converge to better number for five different datasets. Fitness value changes
partition of communities. In most of the real-world net- rapidly after 40–50 generations for most of the dataset. In
works, the number of communities to be formed is known. this study, experiment has been performed until the graph
In this experiment, we have predefined the number of is partitioned into ten numbers of communities. It is
communities in order to reduce the search space. Chro- observed from Fig. 2 that change in fitness value is more
mosomes are generated randomly at each generation. At prominent if the graph is to be partitioned into more
the same time, in-feasible solutions are prevented by safe number of communities. On the other hand, change in fit-
initialization where two nodes are kept in one community ness value is minimal, when the graph is partitioned into
only if they are connected. The disconnected nodes are less number of communities. The parameters used in GA
checked at the initialization process of individual chro- are shown in Table 2.
mosomes. The value for each gene is in the range of 1 to k,
where k is the number of predefined communities. The 5.4 Selection
performance of the algorithm is evaluated for different
values of k for each of the datasets. The k value has been Selection is one of the crucial phases in genetic algorithm
considered in the range of 2–10. For example, as shown in which helps in global search for the best solution of the
Fig. 1, a candidate solution for a network having ten problem. Before the selection process starts, each individ-
number of nodes is presented. At the same time, number of ual is assigned a rank according to their fitness value. The
communities to be partitioned, i.e., k, is considered to be 3. fitness value for each individual is calculated based on the
The value present in the candidate solution is in the range objective function as defined in this problem. In this work,
of 1–3. Similar to the example shown in Fig. 1, the chro- l ? k selection strategy has been adopted for selecting
mosome sets are generated for each value of k. The length chromosomes for further generation. l ? k selection
of the candidate solution depends on the number of nodes strategy is one of the preferred methods for solving evo-
lutionary algorithms. As per this strategy, l number of best
chromosomes from the current population is to be kept for
further processing. k is the number of offsprings which are
generated from the fittest parents through genetic operators
such as crossover and mutation. At the same time, k ? l
number of chromosomes are considered for solution in the
next generation.
Fig. 1 Genetic representation for candidate solution

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


9656 Neural Computing and Applications (2020) 32:9649–9665

Fig. 2 Change in fitness value in different number of generations

Table 2 Parameters used in genetic algorithm In the proposed work, each community has been iden-
tified by a numerical identifier. Each chromosome is an
Parameter Value Description
array of numerical values which represent the community
Clength No. of nodes in the dataset Chromosome length membership or identifier for individual entity. The index of
Psize 100 Population size the array represents the node identity in the network. The
Pm 0.3 Mutation probability same community identifier in different chromosomes may
Pc 0.6 Crossover probability correspond to different communities, or different commu-
gen 100 Number of generation nity identifiers in different chromosomes may correspond
to the same community. For example, (5,5,5,1,1,1) and
(1,1,1,5,5,5) represent the same network partition, but the
community whose identity is 5 in the first chromosome
5.5 Crossover corresponds to the same community with identifier 1 in the
second chromosome. If the crossover point happens to be at
Crossover is the genetic operator used to perform operation 1, the resulting chromosome will be (5,5,5,5,5,5) and
on two chromosomes at a time. It combines the features of (1,1,1,1,1,1). Still both the chromosomes will have the
both the chromosomes to create a new chromosome for same community structure. To avoid the above occur-
obtaining an optimal solution. The number of chromo- rences, multi-point crossover has been adopted for gener-
somes that undergoes with crossover operation is deter- ating chromosomes. In multi-point crossover, community
mined by a parameter known as crossover probability. This identities will be exchanged between the chromosomes at
operation is associated with a parameter known as cross- multiple points. Position of the crossover is chosen ran-
over probability. In our experiment, crossover probability domly at each iteration.
is chosen as 0.6 which means 60% of each population is to
be participated for crossover at each generation.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Neural Computing and Applications (2020) 32:9649–9665 9657

5.6 Mutation 0
X 1
Sintra ¼ ; Ci 2 C ð16Þ
S
a;b2C ab
i
Genetic algorithm contributes a powerful search capability.
Mutation operator attains its local search function by Here, a and b are the nodes in the network and C is the
shifting a node from one community to another commu- set of communities identified through the proposed genetic
nity, and crossover operator attains its global search algorithm. The objective is to maximize the quality score
function by combining and splitting the communities. In for better partition of the network. Higher the quality score,
this work, multi-point crossover strategy achieves its better the quality of detected community. Network parti-
objective. The traditional mutation operators fail to achieve tion with highest quality score is considered to be an
the target function by shifting a node between the com- optimal solution for the community detection problem
munities, because these operators may combine or spilt the (Table 3).
communities, which is strongly undesirable.
In the proposed approach, arg maxZ ¼
 Q 7 Experimental analysis
k
1 þ p1ffiffik i¼1 SDi is the objective function and SDi ¼
P Heuristic-based algorithms are often suitable for the com-
a;b2C Sða; bÞ is the sum of similarity measure of each
i
plex problems, where computation is highly expensive than
pair of nodes of a community. This similarity density
expectation. The proposed GA-BCD algorithm has been
function plays the role of local function from the local
performed in the distributed platform on five real-world
point of perspective of the node. The value of Z increases
network datasets. The details of dataset description,
with the local function SDi without changing the label of
experimental setup and analysis are provided in following
the other nodes. By this way, the local search strategy
subsections.
works and shifts a node between the communities which
resulted in better solutions.
7.1 Dataset description

In this paper, the following datasets are used for measuring


6 Proposed quality score the performance metrics. Details of these datasets are listed
in Table 4. Network dataset can be depicted as graph
Community has been detected based on each similarity
structure, where each node represents an entity and links
matrix for five real-world datasets. At each generation, the
between the nodes represent the relationships between the
proposed genetic algorithm optimizes the set of 100 com-
entities. Each of the following datasets is collected in the
munity partitions based on fitness value. Communities
form of edge list.
obtained at the 100th generation are considered to be
possible list of optimal candidate solutions. Individual with 1. Blogs [40]: This dataset is collected from the context of
the highest fitness value at the 100th generation is selected 2004 US-election, where nodes represent the blogs and
as the final partition of the network. In this study, an effi- edge represents the hyperlinks between them. One blog
cient measure known as quality score has been proposed to may have page with hyperlink that points to other
evaluate the final communities obtained through different blogs, while reverse link may not exist in the dataset.
similarity indices. The intuition behind quality score is to 2. Chicago transport [41]: In this dataset, road transporta-
find out the ratio of inter-community dissimilarity to the tion of Chicago is represented as network structure,
intra-community dissimilarity. It is expressed as: where node represents the place in Chicago and edge
0 represents the connections between the places.
Sinter
Quality Score ðQSÞ ¼ 0 ð14Þ 3. Dolphin [42]: This dataset is created by observing few
Sintra groups of dolphin communities during 1994 to 2001
0 0
where Sinter and Sintra are dissimilarity indices between the living in New Zealand, where dolphins and frequent
nodes in different communities and nodes in the same interactions between dolphins are represented as nodes
0 0 and edges, respectively. It is a kind of social network
communities, respectively. Sinter and Sintra are mathemati-
having directed edges.
cally expressed as follows:
4. Facebook [43]: Facebook is the most popular social
X 1
0
Sinter ¼ ; Ci ; Cj 2 C and i 6¼ j ð15Þ network platform for social interaction. A small part of
S
a2C ;b2C ab this network has been collected from website https://
i j
snap.stanford.edu/. The nodes and edges represent the
and social user and their relationship, respectively.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


9658 Neural Computing and Applications (2020) 32:9649–9665

Table 3 Quality score of


SI No. of comm.
community partition of different
datasets for different similarity 2 3 4 5 6 7 8 9 10
values
Blogs
Jaccard 0.316 0.525 0.665 0.796 0.860 0.914 0.989 0.998 1.051
Simpson 0.516 0.860 1.205 1.443 1.642 1.777 1.873 2.005 2.125
Geometric 0.273 0.454 0.602 0.674 0.766 0.811 0.863 0.879 0.908
Cosine 0.420 0.690 0.940 1.085 1.203 1.304 1.364 1.442 1.502
Sorenson 0.393 0.638 0.836 0.991 1.091 1.200 1.259 1.300 1.344
Chicago transport
Jaccard 0.636 1.147 1.567 1.884 2.184 2.488 2.675 2.856 3.061
Simpson 0.626 1.145 1.544 1.903 2.172 2.465 2.688 2.845 3.058
Geometric 0.637 1.147 1.564 1.892 2.157 2.483 2.643 2.875 3.067
Cosine 0.631 1.163 1.548 1.918 2.201 2.441 2.660 2.856 3.018
Sorenson 0.635 1.152 1.548 1.894 2.240 2.472 2.663 2.912 3.043
Dolphin
Jaccard 0.097 0.218 0.394 0.543 0.485 0.665 0.788 0.854 0.845
Simpson 0.175 0.467 0.762 0.923 1.064 1.253 1.401 1.521 1.479
Geometric 0.076 0.217 0.366 0.385 0.435 0.513 0.616 0.705 0.706
Cosine 0.127 0.386 0.608 0.732 1.003 1.067 1.080 1.227 1.440
Sorenson 0.132 0.288 0.560 0.656 0.897 0.954 1.110 1.177 1.116
Facebook
Jaccard 0.002 0.002 0.010 0.010 0.011 0.016 0.019 0.020 0.018
Simpson 0.007 0.010 0.014 0.019 0.023 0.027 0.035 0.032 0.037
Geometric 0.002 0.002 0.010 0.010 0.013 0.015 0.020 0.020 0.020
Cosine 0.003 0.003 0.011 0.011 0.015 0.016 0.023 0.023 0.020
Sorenson 0.002 0.002 0.011 0.011 0.012 0.017 0.021 0.021 0.020
Netscience
Jaccard 0.396 0.697 0.853 1.036 1.132 1.248 1.318 1.377 1.430
Simpson 0.537 0.970 1.292 1.602 1.809 2.008 2.161 2.286 2.400
Geometric 0.384 0.646 0.815 0.943 1.055 1.147 1.217 1.245 1.325
Cosine 0478 0.845 1.130 1.337 1.514 1.620 1.744 1.826 1.934
Sorenson 0.463 0.834 1.085 1.281 1.444 1.574 1.652 1.768 1.792

Table 4 Datasets used


S. no Datasets Node Edges Clustering coefficient

1 Blog [40] 1224 2615 0.226


2 Chicago transport [41] 1467 1298 0.169
3 Dolphin [42] 62 159 0.309
4 Facebook [43] 4039 88,234 0.6055
5 Netscience [44] 1589 2742 0.427

5. Net science [44]: This dataset reveals the co-authorship 7.2 Experimental setup for distributed
network between the scientists who are working on computing
network theory. The nodes correspond to scientist and
the edges exist between the nodes if the corresponding Experiments are performed on a cluster of ten number of
scientists have jointly published a paper. computational nodes, configured as master–slave archi-
tecture. One of them acts as master, and the rest of the
computing nodes act as slaves. Each computing node is
configured with i7 processor and 3.4 GHz clock speed.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Neural Computing and Applications (2020) 32:9649–9665 9659

They all have symmetric configuration with 1 TB hard disk when no further increase in modularity is possible. The
and 20 GB of RAM. The job is submitted at the master whole process is similar to agglomerative clustering used
node, and data are then distributed across the cluster. Each in data mining.
of the computing nodes performs operation on different
parts of the data independently. The results from each of 7.4 Parameter evaluation
the computational nodes are accumulated at the master
node for further analysis of the community structure. All Evaluating community detection algorithms is a challeng-
the computational nodes process the data in a parallel ing task as the size and number of communities are often
manner. Complex topological structure is analyzed, and unknown in a real-world network. However, it is possible
neighbors of each of the entities are revealed. Experiments to evaluate the performance, if ground truth statistics about
are performed for partitioning the network into 2 to 10 the community is available beforehand. The following
number of communities. Similarity between all pairs of performance parameters have been considered to evaluate
nodes has been evaluated as per the measures given in the performance of GA-BCD algorithm.
Sect. 4 for each of the datasets. Using the similarity values,
Normalized mutual information (NMI) It is the measure
fitness values of the chromosomes are calculated.
used to evaluate the similarity between two different
community partitions of the same dataset. This is similar to
7.3 Comparative analysis
the clustering evaluation parameter used in information
theory. Danon et al. [45] used it for the first time to mea-
The proposed algorithm GA-BCD is compared with the
sure the performance of community detection algorithms. It
following community detection algorithms on the five real-
is measured with the help of confusion matrix C, where the
world datasets listed in Table 4.
row i corresponds to the true communities and the column
Girvan Newman [5] This is the most popular community j corresponds to the found communities. Each element Cij
detection algorithm, where the community is detected by corresponds to the number of nodes of community i that are
optimizing the modularity value of the partition. Here, found in community j. It can be mathematically expressed
communities are identified by eliminating the edges with as follows:
the highest edge-betweenness value in an iterative manner. PjT j PjF j  
C C
This is similar to divisive hierarchical clustering algorithm 2 i¼1 j¼1 Cij log CiijCj
NMIðT; FÞ ¼ P PjF j   ð17Þ
used in data mining. jT j Ci Cj
i¼1 Ci log C þ j¼1 Cj log C
Walk-Trap [6] The intuition behind the Walk-Trap com-
munity detection algorithm is that when random walker where T and F are the actual and found community parti-
start visiting nodes in the network, it is more likely to get tions, respectively. jT j and jF j are the number of commu-
trapped in a dense region than the sparse one. Communities nities present in the actual and found partitions,
can be explored by analyzing the patterns of node-traversal respectively. Ci and Cj are the sum of all elements in row
by a random walker in the network. It is found to have i and column j of confusion matrix C, respectively.
better time complexity as compared to others. Modularity Modularity is the most widely used perfor-
Label propagation [7] Label propagation algorithm is mance parameter for community detection algorithm,
found to have near-linear time complexity in detecting which quantifies the community partition obtained in a
community for large-scale network. In this algorithm, ini- large-scale network. It is the difference between intra-
tially, each node is assigned with a unique label and the community links to the inter-community links in the net-
label of the node is getting updated over the processing work partition. The mathematical definition of this
time. The updating rule for a node is based the label of its parameter is described in Sect. 2.
neighboring nodes. Although it is faster, unique community Adjust Rand Index (ARI) Rand Index is one of the impor-
partition is hard to obtain over different runs. tant performance parameters used to measure, how much
Louvain [8] The Louvain community detection algorithm each pair of nodes is arranged same by two different
is similar to the Girvan Newman algorithm, where the clustering solutions. In other words, it is the ratio of the
modularity value is to be optimized over the different number of pairs of nodes belonging to same cluster or
partitions. However, the process of the Louvain algorithm belonging to different cluster in both the partitions with to
is in the reverse order of Girvan Newman algorithm. In this total number of pairs of nodes [46]. ARI can be mathe-
algorithm, initially, every node is treated as an individual matically represented as:
community. Nodes are merged with the neighboring nodes
to maximize the modularity value. This process is stopped

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


9660 Neural Computing and Applications (2020) 32:9649–9665

N ðP þ SÞ  ½ðP þ QÞðP þ RÞ þ ðR þ SÞðQ þ SÞ between zero and one. Figure 4 represents the comparative
ARI ¼ analysis of algorithms based on the modularity value of the
N 2  ½ðP þ QÞðP þ RÞ þ ðR þ SÞðQ þ SÞ
ð18Þ identified community structure. It is observed that the
Walk-Trap algorithm and label propagation algorithm
where P represents the number of pairs of nodes in the found to have better modularity value in Dolphin and
same cluster according to both the solutions. Q represents NetScience network, respectively. ARI is another perfor-
the number of pairs of nodes in the same cluster according mance measure, which quantifies the similarity of node
to solution1 and these many number of pairs of nodes in orientation in two different solutions. Comparative analysis
different clusters according to solution2. R represents the of algorithms based on ARI value is presented in Fig. 5. It
number of pairs of nodes in different clusters according to is observed that ARI value was found to be less for large-
solution1 and these many number of pairs of nodes in same sized datasets such as Facebook. The GA-BCD algorithm
cluster according to solution2. S represents the number of has a better ARI value for Facebook and NetScience net-
pairs of nodes in different clusters according to both the work. The Girvan Newman algorithm has better ARI value
solutions. N represents the total number of pairs of nodes. in Blogs and Dolphin network. The label propagation
algorithm has shown better ARI value in the Chicago
Execution time Execution time required to detect the
transport network. It is desirable to identify communities in
communities in a large and heterogeneous network is a
a reasonable amount of time in large-scale and complex
challenging task. It is considered to be an essential per-
network. In this paper, a comparison based on execution
formance parameter to compare the algorithms. In this
time is considered to evaluate the performance of algo-
work, a comparison graph has been given for five different
rithms and is presented in Fig. 6. Genetic algorithm is
algorithms based on this crucial parameter.
always expected to have better execution time as compared
to other traditional algorithms. From the graph, it is con-
7.5 Results and discussion
cluded that GA-BCD has shown better performance result
in terms of execution time, especially for large datasets
The performance of GA-BCD is compared with other
such as Facebook and NetScience.
standard community detection algorithms that are descri-
The different structures of partition have been found for
bed in Sect. 7.3. Figure 3 represents the comparative
all the datasets and are further analyzed based on quality
analysis of different community detection algorithms based
score which is presented in Sect. 6. Table 3 presents the
on the NMI value for the community partition. It is
quality score for community partition for different datasets
observed that the proposed GA-BCD algorithm performs
after completion of the 100th epoch in genetic algorithm.
well for large-sized datasets such as Facebook, NetScience
Multiple set of experiments have been performed by par-
and Blogs. The Girvan Newman algorithm was found to
titioning the network into 2 to 10 number of communities.
have better NMI in Dolphin and Chicago transport datasets.
The number of communities to be partitioned is repre-
Modularity value is one of the suitable measures to quan-
sented by each column in Table 3. Quality score quantifies
tify the community structure. Its value lies in the range
the quality of community partition in the network. It is the

Fig. 3 Comparative analysis 1


Girvan Newman Louvain GA-BCD
based on NMI Walk-Trap Label Propagation
Normalized Mutual Information (NMI)

0.8

0.6

0.4

0.2

0
Blogs Chicago Tranport Dolphin Facebook NetScience
Real-world Network Datasets

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Neural Computing and Applications (2020) 32:9649–9665 9661

Fig. 4 Comparative analysis 1


Girvan Newman Louvain GA-BCD
based on modularity Walk-Trap Label Propagation

0.8

Modularity
0.6

0.4

0.2

0
Blogs Chicago Transport Dolphin Facebook NetScience
Real-world Network Dataset

Fig. 5 Comparative analysis Girvan Newman


based on ARI Walk-Trap
0.8
Label Propagation
Louvain
0.7
GA-BCD
Adjusted Rand Index (ARI)

0.6

0.5

0.4

0.3

0.2

0.1

0
Blogs Chicago Transport Dolphin Facebook NetScience
Real-world Network Datasets

measure of difference between intra-communities link- partitions which is presented in the form of boxplot in
density and the inter-community link-density in the net- Fig. 7. From box plot in Fig. 7, it is observed that mean of
work. It depends on the similarity between all pairs of quality score for network partition increases with the
nodes which is based on similarity index discussed in number communities. Performance based on different
Sect. 4. More is the quality score, better is the community similarity indices is presented in the form of boxplot in
partition. Fig. 8. Although the minimum quality score is approxi-
It is observed from Table 3 that better partition for mately equal for all similarity indices, mean quality score
Blogs dataset has been obtained for Simpson similarity is better in the case of Simpson similarity index. Each
index and the quality score increases when the network is similarity index has its own significance in the network
partitioned into more number of communities. For Chicago structure. Their values between nodes depend on the
transport dataset, Sorenson index provides better commu- complexity of the network. Any particular similarity index
nity partition using genetic algorithm. Simpson similarity may not provide better partition in all the datasets.
index is found to be suitable for the identification of Identifying community structure using different simi-
communities in Dolphin, Facebook and Netscience data- larity values may help in deeper analysis of network
sets. Performance in terms of quality score for proposed structure. In this study, P value and mean difference of
approach is based on different number of community performance have been evaluated based on the similarity

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


9662 Neural Computing and Applications (2020) 32:9649–9665

Fig. 6 Comparative analysis 250


Girvan Newman
based on execution time
Walk-Trap

Label Propagation

200 Louvain

GA-BCD

Execution Time (in milisec)


150

100

50

0
Blogs Chicago Transport Dolphin Facebook NetScience
Real-World Network Datasets

Fig. 7 Performance of GA-


BCD with different number of
community partitions

Fig. 8 Performance of GA-


BCD with different similarity
indices

index. P value for similarity index is provided in Table 5. P value and mean deviation for the result obtained in dif-
P value less than 0.001 indicates that every similarity index ferent number of communities is shown in Tables 7 and 8,
is significantly different from others (Table 6). respectively. From Table 7, it is observed that the model
It can be concluded that application of different simi- built for each of the community partitions is significantly
larity measures for community detection is justified. different from each other.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Neural Computing and Applications (2020) 32:9649–9665 9663

Table 5 P value of performance based on similarity index 8 Threats to validity


Jaccard Simpson Geometric Cosine Sorenson
In this work, a genetic algorithm-based community detec-
Jaccard – 0.00 0.00 0.00 0.00 tion has been presented, where the nodes are mapped into
Simpson 0.00 – 0.00 0.00 0.00 fixed number of communities. This could be one of the
Geometric 0.00 0.00 – 0.00 0.00 limitations, since prior knowledge of number of commu-
Cosine 0.00 0.00 0.00 – 0.00 nities is required at the encoding phase of the genetic
Sorenson 0.00 0.00 0.00 0.00 – algorithm. The proposed algorithm can be suitable for
P value B 0.001 indicates the similarity indices are significantly processing directed graph but may fail to process weighted
different from each other network, where similarity between nodes are dependent on
the strength of the relationship between the nodes. Another
limitation is that it may fail to detect the overlapping
communities in the network.
Table 6 Mean Difference for similarity Index
Jaccard Simpson Geometric Cosine Sorenson
9 Conclusion and future work
Jaccard 0.000 - 0.358 0.055 - 0.189 - 0.143
Simpson 0.358 0.000 0.413 0.169 0.215
Community detection is observed to be NP-complete
Geometric - 0.055 - 0.413 0.000 - 0.245 - 0.198
problem, especially when processing the large-scale net-
Cosine 0.189 - 0.169 0.245 0.000 0.047
work. Heuristic-based algorithms are proven to be efficient
Sorenson 0.143 - 0.215 0.198 - 0.047 0.000
means to analyze the subgroup in large-scale complex
network. To accommodate this constraint, in this

Table 7 P value of performance


Com2 Com3 Com4 Com5 Com6 Com7 Com8 Com9 Com10
based on the number of
communities Com2 – 0 0 0 0 0 0 0 0
Com3 0 – 0 0 0 0 0 0 0
Com4 0 0 – 0 0 0 0 0 0
Com5 0 0 0 – 0 0 0 0 0
Com6 0 0 0 0 – 0 0 0 0
Com7 0 0 0 0 0 – 0 0 0
Com8 0 0 0 0 0 0 – 0 0
Com9 0 0 0 0 0 0 0 – 0
Com10 0 0 0 0 0 0 0 0 –
P value B 0.001 indicates that the experiments for different number of communities are significantly
different from each other

Table 8 Mean deviation for


Com2 Com3 Com4 Com5 Com6 Com7 Com8 Com9 Com10
different number of
communities Com2 0.00 - 0.26 - 0.48 - 0.64 - 0.78 - 0.90 - 1.00 - 1.08 - 1.15
Com3 0.26 0.00 - 0.22 - 0.38 - 0.52 - 0.64 - 0.73 - 0.82 - 0.89
Com4 0.48 0.22 0.00 - 0.16 - 0.30 - 0.42 - 0.52 - 0.61 - 0.67
Com5 0.64 0.38 0.16 0.00 - 0.14 - 0.26 - 0.36 - 0.44 - 0.51
Com6 0.78 0.52 0.30 0.14 0.00 - 0.12 - 0.22 - 0.31 - 0.37
Com7 0.90 0.64 0.42 0.26 0.12 0.00 - 0.10 - 0.18 - 0.25
Com8 1.00 0.73 0.52 0.36 0.22 0.10 0.00 - 0.09 - 0.16
Com9 1.08 0.82 0.61 0.44 0.31 0.18 0.09 0.00 - 0.07
Com10 1.15 0.89 0.67 0.51 0.37 0.25 0.16 0.07 0.00

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


9664 Neural Computing and Applications (2020) 32:9649–9665

manuscript, we have adopted genetic algorithm for com- 3. Newman ME (2004) Analysis of weighted networks. Phys Rev E
munity detection in real-world network. In support of this, 70(5):056131. https://doi.org/10.1103/PhysRevE.70.056131
4. Newman ME (2013) Spectral methods for community detection
we have proposed an objective function based on the and graph partitioning. Phys Rev E 88(4):042822. https://doi.org/
similarity density for the community partition. The group 10.1103/PhysRevE.88.042822
similarity density has been measured by considering dif- 5. Girvan M, Newman ME (2002) Community structure in social
ferent similarity indices, which is calculated from topo- and biological networks. Proc Natl Acad Sci 99(12):7821–7826.
https://doi.org/10.1073/pnas.122653799
logical information in the network. Unlike modularity, it 6. Pons P, Latapy M (2005) Computing communities in large net-
does not suffer from resolution limit problem. We have works using random walks. In: International symposium on
adopted a distributed framework, where data have been computer and information sciences. Springer, Berlin,
processed in multiple computational nodes for faster exe- pp 284–293. https://doi.org/10.1007/11569596_31
7. Raghavan UN, Albert R, Kumara S (2007) Near linear time
cution. Experiments have been performed on ten different algorithm to detect community structures in large-scale networks.
computational nodes. Networks have been partitioned into Phys Rev E 76(3):036106. https://doi.org/10.1103/PhysRevE.76.
2 to 10 number of communities. By keeping the number of 036106
communities fixed, search space of the population reduced 8. De Meo P, Ferrara E, Fiumara G, Provetti A (2011) Generalized
louvain method for community detection in large networks. In:
drastically which leads to faster convergence of optimal 2011 11th international conference on intelligent systems design
solution. Many evolutionary algorithms for community and applications (ISDA). IEEE, pp 88–93. https://doi.org/10.
detection available in the literature have been considered as 1109/isda.2011.6121636
locus-based representations, whereas we have considered 9. Newman ME (2006) Modularity and community structure in
networks. Proc Natl Acad Sci 103(23):8577–8582. https://doi.
label-based representation. Label-based representation org/10.1073/pnas.0601602103
provides better exploration of solution space after applying 10. Newman ME (2004) Fast algorithm for detecting community
the genetic operators as compared to locus-based repre- structure in networks. Phys Rev E 69(6):066133. https://doi.org/
sentation. Experimental analysis shows that GA-BCD 10.1103/PhysRevE.69.066133
11. Qi GJ, Aggarwal CC, Huang T (2012) Community detection with
outperforms as compared to other standard community edge content in social media networks. In: 2012 IEEE 28th
detection algorithms in identifying the hidden communities international conference on data engineering (ICDE). IEEE,
in the network. pp 534–545. https://doi.org/10.1109/icde.2012.77
Community detection on real-time streaming data is 12. Clauset A, Newman ME, Moore C (2004) Finding community
structure in very large networks. Phys Rev E 70(6):066111.
found to be a challenging research problem. The work can https://doi.org/10.1103/PhysRevE.70.066111
be extended to process dynamic network, where the topo- 13. Pizzuti C (2008) Ga-net: a genetic algorithm for community
logical structure changes over time. Some other heuristic detection in social networks. In: International conference on
approaches such as particle swarm optimization and ant parallel problem solving from nature. Springer, Berlin,
pp 1081–1090. https://doi.org/10.1007/978-3-540-87700-4_107
colony optimization can also be tested for exploring dis- 14. Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D (2004)
joint and overlapping communities. Defining and identifying communities in networks. Proc Natl
Acad Sci 101(9):2658–2663. https://doi.org/10.1073/pnas.
Acknowledgements This research work was supported by Fund for 0400054101
Improvement of S&T Infrastructure in Universities and Higher 15. Zhang P, Wang J, Li X, Li M, Di Z, Fan Y (2008) Clustering
Educational Institutions (FIST) scheme with a Grant No. SR/FST/ coefficient and community structure of bipartite networks. Phys
ETI-359/2014 under Department of Science and Technology, Govt. of A 387(27):6869–6875. https://doi.org/10.1016/j.physa.2008.09.
India. The authors wish to express their gratitude and heartiest thanks 006
to the Department of Computer Science and Engineering, Indian 16. Jin D, He D, Liu D, Baquero C (2010) Genetic algorithm with
Institute of Technology (ISM), Dhanbad, India, for providing their local search for community mining in complex networks. In:
research support. 2010 22nd IEEE international conference on tools with artificial
intelligence (ICTAI), vol 1, pp 105–112. IEEE. https://doi.org/10.
1109/ictai.2010.23
Compliance with ethical standards 17. Chira C, Gog A (2011) Collaborative community detection in
complex networks. In: International conference on hybrid artifi-
Conflict of interest The authors do not have any conflict of interest. cial intelligence systems. Springer, Berlin, pp 380–387. https://
doi.org/10.1007/978-3-642-21219-2_48
18. Gong M, Fu B, Jiao L, Du H (2011) Memetic algorithm for
References community detection in networks. Phys Rev E 84(5):056101.
https://doi.org/10.1103/PhysRevE.84.056101
19. Gong M, Cai Q, Li Y, Ma J (2012) An improved memetic
1. Nussbaum R, Esfahanian AH, Tan PN (2013) Clustering social
algorithm for community detection in complex networks. In:
networks using distance-preserving subgraphs. In: The influence
2012 IEEE Congress on evolutionary computation (CEC). IEEE,
of technology on social network analysis and mining. Springer,
pp 1–8. https://doi.org/10.1109/cec.2012.6252971
Vienna, pp 331–349. https://doi.org/10.1007/978-3-7091-1346-
20. Jia G, Cai Z, Musolesi M, Wang Y, Tennant DA, Weber RJ,
2_14
Heath JK, He S (2012) Community detection in social and bio-
2. Fortunato S (2010) Community detection in graphs. Phys Rep
logical networks using differential evolution. In: Learning and
486(3–5):75–174. https://doi.org/10.1016/j.physrep.2009.11.002

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Neural Computing and Applications (2020) 32:9649–9665 9665

intelligent optimization. Springer, Berlin, pp 71–85. https://doi. 33. Behera R, Rath S, Misra S, Damaševičius R, Maskeliūnas R
org/10.1007/978-3-642-34413-8_6 (2017) Large scale community detection using a small world
21. Shang R, Bai J, Jiao L, Jin C (2013) Community detection based model. Appl Sci 7(11):1173. https://doi.org/10.3390/app7111173
on modularity and an improved genetic algorithm. Phys A 34. Ji Z, Pi H, Wei W, Xiong B, Woźniak M, Damasevicius R (2019)
392(5):1215–1231. https://doi.org/10.1016/j.physa.2012.11.003 Recommendation based on review texts and social communities:
22. Liu D, Jin D, Baquero C, He D, Yang B, Yu Q (2013) Genetic a hybrid model. IEEE Access 7:40416–40427. https://doi.org/10.
algorithm with a local search strategy for discovering commu- 1109/ACCESS.2019.2897586
nities in complex networks. Int J Comput Intell Syst 35. Azaouzi M, Rhouma D, Romdhane LB (2019) Community
6(2):354–369. https://doi.org/10.1080/18756891.2013.773175 detection in large-scale social networks: state-of-the-art and
23. Shi C, Cai Y, Fu D, Dong Y, Wu B (2013) A link clustering based future directions. Soc Netw Anal Min 9(1):23. https://doi.org/10.
overlapping community detection algorithm. Data Knowl Eng 1007/s13278-019-0566-x
87:394–404. https://doi.org/10.1016/j.datak.2013.05.004 36. Moscovici S (1988) Notes towards a description of social rep-
24. Zadeh PM, Kobti Z (2015) A multi-population cultural algorithm resentations. Eur J Soc Psychol 18(3):211–250. https://doi.org/10.
for community detection in social networks. Procedia Comput Sci 1002/ejsp.2420180303
52:342–349. https://doi.org/10.1016/j.procs.2015.05.105 37. Pan Y, Li DH, Liu JG, Liang JZ (2010) Detecting community
25. Gupta S, Mittal S, Gupta T, Singhal I, Khatri B, Gupta AK, structure in complex networks via node similarity. Phys A
Kumar N (2017) Parallel quantum-inspired evolutionary algo- 389(14):2849–2857. https://doi.org/10.1016/j.physa.2010.03.006
rithms for community detection in social networks. Appl Soft 38. Hunter PR, Gaston MA (1988) Numerical index of the discrim-
Comput 61:331–353. https://doi.org/10.1016/j.asoc.2017.07.035 inatory ability of typing systems: an application of Simpson’s
26. Zhang L, Pan H, Su Y, Zhang X, Niu Y (2017) A mixed repre- index of diversity. J Clin Microbiol 26(11):2465–2466
sentation-based multiobjective evolutionary algorithm for over- 39. Hamers L (1989) Similarity measures in scientometric research:
lapping community detection. IEEE Trans Cybern 47(9): the Jaccard index versus Salton’s cosine formula. Inf Process
2703–2716. https://doi.org/10.1109/TCYB.2017.2711038 Manag 25(3):315–318
27. Guerrero M, Montoya FG, Baños R, Alcayde A, Gil C (2017) 40. Blogs network dataset-KONECT, October 2016. https://doi.org/
Adaptive community detection in complex networks using 10.1145/1134271.1134277
genetic algorithms. Neurocomputing 266:101–113. https://doi. 41. Chicago network dataset-KONECT, October 2016. http://konect.
org/10.1016/j.neucom.2017.05.029 uni-koblenz.de/networks/tntp-ChicagoRegiona
28. Wen X, Chen WN, Lin Y, Gu T, Zhang H, Li Y, Yin Y, Zhang J 42. Lusseau D, Schneider K, Boisseau OJ, Haase P, Slooten E,
(2017) A maximal clique based multiobjective evolutionary Dawson SM (2003) The bottlenose dolphin community of
algorithm for overlapping community detection. IEEE Trans Evol Doubtful Sound features a large proportion of long-lasting asso-
Comput 21(3):363–377. https://doi.org/10.1109/TEVC.2016. ciations. Behav Ecol Sociobiol 54(4):396–405. https://doi.org/10.
2605501 1007/s00265-003-0651-y
29. Guendouz M, Amine A, Hamou RM (2017) A discrete modified 43. Leskovec J, Mcauley JJ (2012) Learning to discover social circles
fireworks algorithm for community detection in complex net- in ego networks. In: Advances in neural information processing
works. Appl Intell 46(2):373–385. https://doi.org/10.1007/ systems, pp 539–547. http://dx.doi.org/10.1145/2556612
s10489-016-0840-9 44. Newman ME (2006) Finding community structure in networks
30. Chen J, Wang H, Wang L, Liu W (2016) A dynamic evolutionary using the eigenvectors of matrices. Phys Rev E 74(3):036104.
clustering perspective: community detection in signed networks https://doi.org/10.1103/PhysRevE.74.036
by reconstructing neighbor sets. Phys A 447:482–492. https://doi. 45. Danon L, Diaz-Guilera A, Duch J, Arenas A (2005) Comparing
org/10.1016/j.physa.2015.12.006 community structure identification. J Stat Mech: Theory Exp
31. Ju Y, Zhang S, Ding N, Zeng X, Zhang X (2016) Complex 2005(09):P09008
network clustering by a multi-objective evolutionary algorithm 46. Krieger AM, Green PE (1999) A generalized Rand-index method
based on decomposition and membrane structure. Sci Rep for consensus clustering of separate partitions of the same data
6:33870. https://doi.org/10.1038/srep33870 base. J Classif 16(1):63–89. https://doi.org/10.1007/s0035799
32. Rani S, Mehrotra M (2018) A hybrid bat algorithm for commu- 00043
nity detection in social networks. In: International conference on
intelligent systems design and applications. Springer, Cham, Publisher’s Note Springer Nature remains neutral with regard to
pp 943–954. https://doi.org/10.1007/978-3-030-16660-1_92 jurisdictional claims in published maps and institutional affiliations.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:

1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at

[email protected]

You might also like