0% found this document useful (0 votes)
50 views13 pages

Personality Prediction Using Social Media

This document compares the performance of five feature selection algorithms for computational personality prediction from social media data. The algorithms are evaluated based on their ability to select important features from a large set of linguistic, psycholinguistic and social network features. The best performing algorithm is able to improve the accuracy of personality prediction models.

Uploaded by

Vinod Thete
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views13 pages

Personality Prediction Using Social Media

This document compares the performance of five feature selection algorithms for computational personality prediction from social media data. The algorithms are evaluated based on their ability to select important features from a large set of linguistic, psycholinguistic and social network features. The best performing algorithm is able to improve the accuracy of personality prediction models.

Uploaded by

Vinod Thete
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS 1

Comparative Analysis of Feature Selection


Algorithms for Computational Personality
Prediction From Social Media
Ahmed Al Marouf , Md. Kamrul Hasan, and Hasan Mahmud

Abstract— With the rapid growth of social media, users are other entities as friends, connections, or followers. While using
getting involved in virtual socialism, generating a huge volume of these SNS’s, users are facilitated by many activities, such as
textual and image contents. Considering the contents such as sta- posting statuses/tweets, sharing others’ posts/retweets, liking
tus updates/tweets and shared posts/retweets, liking other posts is
reflecting the online behavior of the users. Predicting personality others’ posts, commenting on others’ posts, chatting directly
of a user from these digital footprints has become a computation- with the friends, and playing online games with the friends.
ally challenging problem. In a profile-based approach, utilizing It is evident that from the activities performed by users, online
the user-generated textual contents could be useful to reflect behavior could be depicted [1]. Understanding users’ behavior
the personality in social media. Using huge number of features may help to identify personality traits.
of different categories, such as traditional linguistic features
(character-level, word-level, structural, and so on), psycholinguis- Predicting users’ personalities from digital footprints of
tic features (emotional affects, perceptions, social relationships, social media is a challenging task as the context of identifying
and so on) or social network features (network size, betweenness, personality traits in social media is not trivial. Users behave
and so on) could be useful to predict personality traits from social differently in social media and real life. Therefore, the user-
media. According to a widely popular personality model, namely, generated content, such as status updates in social media, may
big-five-factor model (BFFM), the five factors are openness-to-
experience, conscientiousness, extraversion, agreeableness, and provide enough evidential reflection of personality as SNS
neuroticism. Predicting personality is redefined as predicting user posts statuses based on his/her current situation, a recent
each of these traits separately from the extracted features. political or popular event, hyped topics, and so on. For exam-
Traditionally, it takes huge number of features to get better ple, during an election of his/her country, he/she may posts
accuracy on any prediction task although applying feature positive or negative reviews/opinions about a political party.
selection algorithms may improve the performance of the model.
In this article, we have compared the performance of five feature These types of statuses may have contextual trends, as other
selection algorithms, namely the Pearson correlation coefficient friends of the users may also be involved in posting similar
(PCC), correlation-based feature subset (CFS), information gain statuses. Considering the trend, user may post his/her political
(IG), symmetric uncertainly (SU) evaluator, and chi-squared views. Users are creating trends as well as following different
(CHI) method. The performance is evaluated using the classic trends to become popular or socially accepted by their friends
metrics, namely, precision, recall, f-measure, and accuracy as
evaluation matrices. in social media. Moreover, each user has different perceptions
and different interest categories to be triggered to update
Index Terms— Chi-squared (CHI) method, computational per- statuses. For defining personality, we have followed the widely
sonality prediction, feature selection algorithms, information gain
(IG), Pearson correlation coefficient (PCC), social media. used big-five-factor model (BFFM). According to BFFM,
there are four positive personality traits, namely, openness-
to-experience (O), conscientiousness (C), extraversion (E),
I. I NTRODUCTION agreeableness (A), and the only negative trait neuroticism (N).

S OCIAL media platforms such as Facebook, Twitter,


Google+, and Instagram has gained popularity due to
ease access throughout the world and user-friendly interfaces
This personality model is also known as the OCEAN model.
It is evident that user-generated content could be an effective
data source to build a predictive model [2]. The status updates
to start communicating with others within a short period of posted by the SNS users have the influence of culture and
time. Each user in these social networking sites (SNSs) is personal issues. The structures of various languages actually
considered as an entity, and each entity is connected with influence on identity, culture, and diversity of persons [3].
Therefore, Facebook status has become a research tool for
Manuscript received April 13, 2019; revised August 31, 2019 and
October 31, 2019; accepted December 15, 2019. (Corresponding author: the researchers for identifying the personality [4]. Therefore,
Ahmed Al Marouf.) a model could be built based on supervised learning systems
Ahmed Al Marouf is with the Department of Computer Science and Engi- to predict personality traits from Facebook statuses.
neering, Daffodil International University (DIU), Dhaka 1207, Bangladesh
(e-mail: ahmedalmarouf@[Link]). Feature extraction and feature selection are applied after-
Md. Kamrul Hasan and Hasan Mahmud are with the Department ward to identify the most relevant features. Those features
of Computer Science and Engineering, Islamic University of Technol- are trained to a classification model, and testing is performed
ogy (IUT), Gazipur 1704, Bangladesh (e-mail: hasank@[Link];
hasan@[Link]). afterward. Therefore, finding the most relevant feature is one
Digital Object Identifier 10.1109/TCSS.2020.2966910 of the challenging tasks to be performed to get better accuracy
2329-924X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See [Link] for more information.

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

over the prediction. One of the main contributions of this naire could be easily adopted. However, the main limitation
article is to identify the best feature selection algorithm for of this process is the test-takers need to answer the questions
extracting the most prominent features among the traditional honestly. The different IPIP sets are discussed in Section III,
linguistic, psycholinguistic, and social network (SN) features. elaborately.
As the number of features is relevantly high and the accuracy The correlation between the usage of Facebook, thus, social
is low while applying all the features, we have investigated media, and personality has been studied in [6] and [7]. In [6],
several cases to find the important features and category of the study shows that the correlation is higher for neuroticism
features for predicting personality precisely. and extraversion trait but average for the other traits. Different
In this article, we try to compare the existing feature selec- literature works establish the relationships between the per-
tion algorithms to predict personality from Facebook status sonality and social media uses, such as personality of popular
updates. In summary, we have the following contributions. social media users [8], influence of personality from Facebook
1) Applying different feature selection algorithms, such as usage, and wall posting [9], by mining social interactions
the chi-squared (CHI) method, Pearson coefficient cor- in Facebook [10], capturing personality from photograph or
relation, information gain (IG), correlation-based feature photograph-related posts in social media [11], and so on.
subset (CFS)-based subset evaluation, and symmetrical Howlader et al. [12] proposed the topic modeling-based
uncertainty attribute evaluation, to predict the big-five- approach applied to Facebook status updates. For this work,
personality traits. they used the linear Dirichlet allocation (LDA) and term
2) We have extracted over 150 features to analyze the frequency-inverse document frequency (TF-IDF) as feature
predictive system over different types of features, such and applied flexible regression models for prediction. Deep
as traditional linguistic, psycholinguistic, and SN fea- learning-based methods were introduced by Tandera et al. [13].
tures. In the literature, many researchers have used few They applied the traditional deep learning algorithms, such
features to predict personality, but the overall outcome as the multilayer perceptron (MLP), long short-term memory
of those approaches is not quite satisfactory. Jelling up a (LSTM), gated recurrent unit (GRU), and 1-D convolutional
huge volume of features has given an evidentially better neural network (CNN-1-D). A huge feature set (725 features)
understanding of personality traits. has been analyzed in [14] considering basic linguistic features,
3) We have considered several scenarios/cases of feature POS-tagger parameters, AFINN (Lexicon list) parameters, and
combinations based on psycholinguistic features to find H4Lvd parameters. A review of emerging trends of personality
the best subset of features to predict each personality prediction from online social media is performed by Kaushal
trait differently. Hence, we have determined the accuracy and Patwardhan [15]. They listed different categories of fea-
with and without SN features that are reported in the tures, such as linguistic features (LIWC features, POS tags,
experiments. speech acts, and sentiment features), nonlinguistic features
4) Five different classifiers, namely, the naïve Bayes (NB), (structural, behavior, and temporal features), and SN features.
decision tree (DT), random forest (RF), simple logistic Based on the features used for identifying personality traits,
regression (SLR), and support vector machine (SVM), the methodologies have modified. Farnandi et al. [46] have
were used to determine the evaluation metrics to find proposed methods for predicting personality from social media
the best feature selection algorithm. Utilizing these considering the cross-platform and cross-domain situations.
classifiers, we derived several conclusions. Considering personality prediction as a multilabel prediction
task, they have extracted the LIWC, MRC, NRC, and SPLICE
II. R ELATED W ORKS features to run several types of regression models.
In this section, we have discussed state-of-the-art works For extracting relevant psychological features from texts,
regarding predicting personality traits and applying feature psycholinguistic tools are utilized. These software tools are
selection algorithms. This section is divided into several parts: developed for easier experimentations. LIWC [16], MRC [17],
literature of computational personality prediction, literature of and SPLICE [18] are widely used psycholinguistic tools.
the psycholinguistic tools, literature of existing methods and Developed by Pennebaker and Francis, a word list-based text
applied features devised by researchers, and feature selection analysis tool, LIWC, extracts 93 features consisting standard
methods applied in similar research problems. counts (word counts, words longer than six letters, and so
For predicting personality traits computationally, resear- on), personal concerns (occupation, financial issues, health,
chers have utilized the machine learning techniques, such and so on), psychological processes (cognitive, emotional,
as supervised/ unsupervised learning models and classifica- perceptional, and social processes), and other features
tion algorithms to classify the traits. International person- (punctuation counts, swear words, and so on) [16]. On the
ality item pools (IPIPs) [5] are the items or questions to other hand, MRC [17] features are computed using Medical
answer to devise a scoring mechanism for traits identifica- Research Council’s psycholinguistic database that consists
tion. Depending on the behavior of test-taker on different over 150 000 words with linguistic and psycholinguistic
issues of practical life, these items are presented. Using the features of each word. MRC includes very interesting
IPIP questionnaire, the quantitative method has been adopted latent features of text, such as the Kucera–Francis written
for the problem, and many variations of the question sets frequency [19] and the Brown verbal frequency [20].
were used for developing a better ground-truth data set. Structured Programming for Linguistic Cue Extraction
This manual procedure of taking answers of a set of question- (SPLICE) extracts 74 features related to linguistic. Upon the

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MAROUF et al.: COMPARATIVE ANALYSIS OF FEATURE SELECTION ALGORITHMS 3

input of textual data, SPLICE [18] outcomes various features proposed experimental method and experimental results are
including the quantities (number of characters, sentences, enlisted in Section VI. The detail comparative analysis is
words, and so on), parts of speech features (number of nouns, depicted in Section VII, and finally, Section VIII concludes
noun ratio, verb ratio, adjective ratio, and so on), immediacy with the contributions highlighted.
(number of passive verbs and passive verb ratio), pronouns,
positive self-evaluation, negative self-evaluation, influence, III. C OMPUTATIONAL P ERSONALITY-P REDICTION
deference, and whissel (imagery, pleasantness, and activation), P ROBLEM
text complexity, spoken word features, tense, SentiWordNet The computational personality-prediction problem in the
features, and readability scores. Among these three widely context of social media could be defined as “predicting
used closed vocabulary psycholinguistic tools, for our work, the personality traits from user profile information using
we have used LIWC. LIWC consists of a psycholinguistic computational features rather than asking a set of question-
dictionary in back end, which contains huge number of naire.” Usually, for understanding own personality, people
words, synonyms, and antonyms in different psychological try to take online or off-line personality test. The traditional
categories. LIWC is proven to be useful in the context of personality-prediction systems depend on a set of question-
personality traits prediction. naires to be answered honestly by the test-taker. Questionnaire-
Though features are playing a vital role in data-driven based personality-prediction systems are also popular among
system, the feature selection methods also significantly find the test-takers. The widely used personality tests are big-
the most prominent features from huge feature vectors. Not five-personality test [20], the Myers–Briggs type indicator
only, specifically, data mining but also feature selection has (MBTI) [21], and the dominance influence steadiness con-
become an important tool used in bioinformatics and compu- scientiousness (DISK) [22]. Among these tests, the big-five-
tational biology. Xu et al. [51] proposed an autoencoder-based personality test has been widely accepted among the test-takers
feature selection method for classification of anticancer drug because of the similarity found with themselves with the result
response. Similarly, Mallik and Zhao [52] presented a graph- of the test.
and rule-based learning algorithm for cancer-type classification Many online personality testing sites, such as 16Person-
using feature selection. Mallik and Zhao [53] have applied ality1 , 123test2, Personality Perfect3 , PsychCentral Person-
statistically significant feature extraction-based study on can- ality Test4 , Open Source Psychometrics Project5 , See My
cer expression using integrated marker recognition, which is Personality6, and Discover My Profile7 by the University of
mutual-information based. Cambridge, are very popular for identifying precise personality
Apart from the filter-based feature selection algorithm, there reviewed by the test-takers. The reviews are analyzed from
are wrapper based and hybrid feature selection algorithms. each of the websites and found positive comments delivered
Masoudi-Sobhanzadeh et al. [57] have presented “FeatureS- by the reviewers. The literature provides evidential proof that
elect,” which is a software for selecting features based on computational personality prediction provides better results
machine learning approaches, and the software is tested on than manual paper-based methods. Therefore, the acceptability
gene selection methods. Several nature-inspired evolutionary of these online personality tools is much higher than manual
algorithm-based feature selection algorithms are presented questionnaire-based personality testing. Hence, this encour-
recently. Mafarja et al. [58] presented a binary grasshop- ages applying automated personality prediction from social
per optimization algorithm-based feature selection. Similarly, media. It is evident that computational personality judgments
chaotic hybrid artificial bee colony-based feature selection [59] are more accurate than those made by humans [32].
and binary butterfly optimization-based feature selection [60] The history of personality prediction goes a long way as
are recently introduced in the literature. researchers have tried to optimize the number of questions
In the process of supervised learning, one of the most being asked to the test-taker. Usually, high volume of questions
significant roles is played by the feature selection criteria. is asked, and the answers are analyzed to predict personality
Selecting the most relevant features from a huge feature precisely. However, answering these questions could be time-
vector has a vital impact on the accuracy of the system. consuming as well as tiring for the test-takers. Therefore,
For comparing, in this article, we have utilized the five asking a minimum number of questions to get a better
most conventional feature selection algorithms, namely IG, prediction could be a challenging task. Researchers’ have
CFS-based subset evaluator (CFS), CHI method, symmetrical come up with various numbers of questions or items. NEO
uncertainty attribute evaluation (SU), and the Pearson corre- five-factor inventory (NEO-FFI) [24] is a 60-item personality
lation coefficient (PCC). These feature selection algorithms measure model. Similar models were proposed by researchers
are discussed in Section III, including the definitions and in psychology area for the personality-prediction task. Depend-
formulas. ing on scores determined by the IPIP, the computation of
Therefore, in this article, we have presented an experimental
1 [Link]
comparison between the feature selection algorithms, and for
2 [Link]
the experiments, we have extracted more than 150 features. 3 [Link]
The rest of this article is organized as follows. Section III 4 [Link]
discusses the computational personality-prediction problem, 5 [Link]
and Section IV includes the state-of-the-art application areas 6 [Link]
of the feature selection algorithms. Section V illustrated the 7 [Link]

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

personality traits is performed. Depending on the number In the context of statistics, the uncertainty coefficient or
of IPIP items considered for prediction, there are several entropy coefficient is the measure of nominal association. The
models proposed by many researchers. The 50-item IPIP five- symmetrical uncertainty (SU) [39] attribute evaluator is one
factor model (FFM) proposed by Goldberg [25], 44-item big- kind of correlation finder that evaluates the importance of a
five inventory (BFI) proposed by John and Srivastava [26], feature by measuring the SU with respect to the class. This
40-item Big-Five Mini-Markers proposed by Saucier [27], feature selection process is not only used for imagery data,
20-item Mini-IPIP proposed by Donnellan et al. [28], ten-item such as hyperspectral images [40], but also used with the
personality inventory (TIPI) proposed by Gosling et al. [29] nature-inspired optimization algorithms, such as ant colony
are the existing models in the literature. Short form of item sets optimization [41]. The SU is determined using (3) where,
is also proven effective in some cases [30]. Although there are H(C|F) is the conditional entropy considering C the class and
many scoring systems adopted for this particular problem, each F the feature, and H (C) is the single distribution of class C.
of them has own advantages to be used. The myPersonality The algorithm outcomes ranking of the most relevant features
data set [4], [31] is collected from Facebook users and used
100-item IPIP questionnaire set. 2 ∗ (H (C) − H (C|F))
SU(C, F) = (3)
Although the above-mentioned works try to use psycholin- H (C) + H (F)

guistic tools to extract the psycholinguistic features or different H (C) = − PC (x)logPC (x) (4)
types of methods for devising a model, in the literature, x

there is no article highlighting the best features for predicting H (C | F) = − PC,F (x, y)logPC,F (x, y). (5)
personality from social media data. In this article, we have x,y
focused on this problem and designed experiments to find a
solution to this problem. CHI test (χ 2 ) [42], [43] is used in statistics for determining
the association between variables or features. Depending on
IV. F EATURE S ELECTION A LGORITHMS OVERVIEW the difference between the expected frequencies (e) and the
In this section, we have outlined the feature selection observed frequencies (n) in one or more features in the feature
algorithms that we have applied for computational personality set, the CHI value is determined. Depending on the value of
prediction. The five different algorithms are applied to the the parameter, we can decide the number of features to be
outcome of features that are most relevant to the prediction selected for a system. The equation for calculating CHI value
task. All these feature selection methods provide a ranking is given as
generated based on the relevance between the feature and the
class. 
r 
c
(n i j − ei j )2
χ2 = (6)
CFS subset evaluator [33], [34] is a feature selection ei j
i=1 j =1
algorithm that finds the subset of features via the individual
predictive ability of each feature along with the degree of where r and c are the numbers of row and column of the
redundancy between them. CFS ranks the features subsets feature table.
according to a correlation-based heuristic evaluation method. PCC [44] is considered as one of the most efficient and
The subset evaluation function is given in (1), where Ms is widely accepted feature selection algorithm. In PCC, the value
the heuristic merit of the feature subset S containing k features. of covariance between the class and feature is been deter-
rc¯f is the mean of feature-class correlation ( f ∈ S), and r ¯f f mined. The standard deviations (SDs) of the class and feature
is the average feature-feature intercorrelation are calculated to find the coefficient value (ρ). The coefficient
k rc¯f could be used as an efficient parameter to determine the feature
Ms =  . (1)
k + k(k − 1)r ¯f f sets. The calculation of (ρ) is performed using (7) as given
in the following. cov(X, Y) is the covariance between X and
The CFS subset evaluator is used in different contexts of
Y , where X or Y is the class value, and σ is the SD in the
research, such as to predict students’ performance [35] and
following equation:
selecting features for sentiment classification [36].
IG is one of the widely used feature selection meth- cov(X, Y )
ods in different research problems, including text catego- ρ X,Y = . (7)
σ X σY
rization [37], [38]. Various research fields have utilized the
inner mechanism of IG, such as computer vision and text The correlation value is distributed between −1 and +1,
classification [38], [39]. IG outcomes a ratio value calculated where 1 is the total positive correlation, O is no linear corre-
by (2), where values(a) denotes the set of all possible values lation, and −1 means the negative correlation. The coefficient
of features a ∈ Attr. Attr is the set of all features, H is the is invariant under separate modifications in scale and location
entropy, and x ∈ T denotes the value of specific example x in the two variables, which could be considered as a key
for a ∈ Attr. The largest IG is the smallest entropy mathematical property of PCC. Depending on the above-
 |{x∈T | x a = v}| mentioned parameters, the number of features to be selected
IG(T, a) = H (T )− · H (x ∈ T |x a = v). for the problem could be determined. For comparing the
|T |
v∈vals(a) feature selection criteria, we have experimented with various
(2) scenarios for computational personality trait prediction.

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MAROUF et al.: COMPARATIVE ANALYSIS OF FEATURE SELECTION ALGORITHMS 5

TABLE II
T RADITIONAL L INGUISTIC F EATURES

Fig. 1. Steps of the experimental method.

TABLE I
C LASS D ISTRIBUTION OF M YPERSONALITY D ATA S ET

V. E XPERIMENTAL M ETHOD
For the experimental analysis, we have designed a common
method for testing the performance of each of the feature
selection algorithms. The proposed experimental method con-
sists of data acquisition, data preprocessing, feature extraction,
of the removal of URLs, names, symbols, unnecessary spaces,
feature selection, and classification, as depicted in Fig. 1. For
and stemming. These operations are performed using the
each of the five personality traits, we are going to apply the
NLTK package [45] library.
proposed method. The rest of the section elaborately discusses
the steps of experimental methods.
C. Feature Extraction
A. Data Acquisition In this step, the extracted features are in two categories:
linguistic features and SN features. We have extracted the
For our experiment, we have used the myPersonality data
traditional linguistic features and psycholinguistic features as
set [4], [31] that consists of the status updates, SN features,
well.
ground-truth personality traits scores, as well as classes. The
1) Traditional Linguistic Features: The traditional linguistic
traits used in the data set are formalized in BFFM. For each
features are textual features that could be divided into four
of the five personality traits, openness to experience (OPN),
types: character-based, word-based, structural, and function
conscientiousness (CON), extraversion (EXT), agreeableness
words. The list of traditional features considered for our study
(AGR), and neuroticism (NEU), the personality score and the
is shown in Table II. For extracting the linguistic features,
class value (yes or no) are given in the data set.
we have applied LIWC [16] on the preprocessed textual data.
The data set contains 250 users around 10 000 status
LIWC gives a total of 93 features having psycholinguistic and
updates, and it is considered as a ground-truth data set for
traditional linguistic categorical features. All the features are
personality prediction. The class distribution of the myPerson-
integer or fractional values, meaning the percentages of words
ality data set is demonstrated in Table I.
in specific categories.
2) Psycholinguistic Features: Among 93 features, only
B. Data Preprocessing 28 could be considered as psycholinguistic features divided
All the statuses of the data set are in English and follow into five categories, namely, emotional affect, cognitive
every step of preprocessing. The preprocessing step consists process, self-focus, social relationships, and perceptions.

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

TABLE III
P SYCHOLINGUISTIC F EATURES E XTRACTED U SING LIWC

Fig. 2. Broker B between A and C.

(F88), n-betweenness (F89), density (F90), brokerage (F91),


n-brokerage (F92), and transitivity (F93). These features are
closely related to the behavior and personality of a user.
Network size defines the number of friends, connections,
or followers in SNSs. Using this feature, we may predict if the
user has a decent number of friends or not. Having a smaller
number of friends may lead to the characteristics of introvert
user and vice versa
NS(v) = Total no. of edges of v. (8)
Betweenness centrality g(v) of a node v in a given graph
could be determined using (7). Centrality is the measure
to determine the central nodes within a graph, whereas the
betweenness centrality demonstrates how many times a node
behaved as a connector along the shortest path between two
other nodes [54]. This measure is useful in assessing which
nodes are central with respect to spreading information and
influencing others in their immediate neighborhood
 σst (v)
g(v) = . (9)
σst
s =v =t

Normalized betweenness centrality is the normalized value


of g(v) with respect to the minimum and maximum values
of g [54], [55]. The formula to determine the n-betweenness
is (3)
g(v) − min(g)
normal(g(v)) = . (10)
max(g) − min(g)
Density is the measure of network connections. Network
The features associated in each of the categories are demon-
density could be measured using the formula (4). Density
strated in Table III.
demonstrates the potential connections in a network, which
Apart from the psycholinguistic features, another 65 differ-
are actual connections
ent linguistic features are extracted using LIWC. The linguistic
features are word count, analytical word, tone, word per Actual Connections
Density = . (11)
sentence, number of six-letter words, number of articles, dif- Maximum Possible Connections
ferent punctuation symbols (period, comma, colon, semicolon, Brokerage refers to the nodes embedded in its neighbor-
question mark, exclamatory mark, dash, quote, apostrophe, hood, which is very useful in understanding power, influence,
parenthesis, and so on), and so on. The percentage of function and dependence effects on graphs. A broker could be consid-
words or parts of speech, such as percentage of a noun, pro- ered as the communicator between two different nodes [55].
noun (personal and impersonal pronoun), preposition, adverb, Five types of brokers are available in the literature, namely,
conjunction, verb, adjective, comparative, and interrogative- coordinator, consultant, gatekeeper, representative, and liaison.
words are considered as extracted linguisitc features. It is possible that different types of brokers are present in a
3) Social Network Features: The second type of feature simple SN graph. The general concept of brokerage could be
category is SN features. In SNSs, the architecture is build depicted as in Fig. 2.
upon a graph. Each of the users is considered as one of In a graph, if A is connected to B, and B is connected to C,
the nodes of this huge graph. The edge between these nodes but A and C are not connected to each other, and then A
could be considered as the friend or connection between users. needs B to communicate with C. Thus, B is the broker node
Therefore, SN works as a huge graph. Similar gene network here.
analysis could be utilized through ranking of biomolecules The description of five different types of broker nodes is
for biomedical data sets [56]. Moreover, in the myPerson- illustrated in Table IV. The equations used in Table IV are
ality data set, SN features are extracted from this huge considering node B as a broker, and G(X) denotes the group
graph. The SN features are network size (F87), betweenness that node x belongs to. It is presumed that A→ B→ C, thus

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MAROUF et al.: COMPARATIVE ANALYSIS OF FEATURE SELECTION ALGORITHMS 7

TABLE IV TABLE V
D IFFERENT C ATEGORIES OF B ROKER N ODE P ERFORMANCE M ATRICES A PPLYING C LASSIFIERS FOR P REDICTING
E XTRAVERSION U SING LIWC F EATURES

D. Feature Selection
Feature selection algorithms are used to find the essential
or important features from a set of the feature vector. The
experiment starting from inputting the labeled data, data pre-
processing, is performed and left for identifying the features.
In features extraction step, we have collected the prominent
features, each feature vector containing 93 features
F = {F1, F2, F3, . . . , F93}. (14)
These feature vectors are used to find the optimal number
of essential features using the features selection methods. The
Fig. 3. Idea of transitivity. mentioned five different types of feature selection methods are
adopted, and the selected features are feed to the classifiers.
The performance matrices are determined for each classifier to
A be the source node gives information to B, the broker node, evaluate the experimental method. Finally, the highly accurate
who gives the information to C (the destination node). feature selection algorithm is identified.
N-brokerage is the normalized parameter of brokerage that
is the measure of brokerage nodes divided by the number of
pairs [55]. The equation could be derived as follows: E. Classification Methods
In this article, we have applied the classic classification
number of broker nodes methods to evaluate the performance of the proposed experi-
n-brokerage = . (12)
number of pairs mental method. NB, RF, DT, SLR, and SVM are implemented
in the experiment process. The state-of-the-art classifiers are
Transitivity is the measurement that could be defined as considered for the experiment, not adaptive versions.
a friend-of-friend (FOF) concept of social media, such as
Facebook. The idea of FOF is “when a friend of my friend is
VI. E XPERIMENTAL R ESULTS
my friend.”
In the context of network or graph theory, transitivity In this section, we have evaluated the system using the
is measured based on the relative number of triangles or evaluation metrics, such as precision, recall, f-measure, and
triads present in the graph comparing to the total number of accuracy. We have divided the research contributions into three
connected triples of nodes. The idea of transitivity is depicted different experiments.
in Fig. 3, and the equation to calculate the transitivity T (G) 1) Experiment 1 (Using All LIWC Features for Predicting
is (13), as follows. Five Personality Traits): In this experiment, we have focused
As shown in Fig. 3, A is friend with B, B is friend with C, on the 93 LIWC features to fetch into the prediction model.
and A is also friend with C. The relationships between them Table V shows the metrics values along with the SD applying
build a triad different classifiers for LIWC features for extraversion trait
only. The highest accuracy (61.07%) is shown by the SVM
3 ∗ no. of closed triples in G classifier for the extraversion trait. The same metrics are
T (G) = . (13)
no. of connected triples of vertices in G reported for the other four personality traits in Table VI.

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

TABLE VI TABLE VII


A CCURACY M EASUREMENTS OF E XPERIMENT-1 ON OCEAN T RAITS A CCURACY M EASUREMENTS A PPLYING C LASSIFIERS FOR P REDICTING
E XTRAVERSION IN D IFFERENT S CENARIOS

For each of the personality traits, the accuracy along with


the average and SD is tabulated in Table VI. The average and
SD of the classifiers for each trait are given in the last two
columns, and the average and SD of each classifier are given
in the last two rows. The highest accuracy reported for each
trait is kept in bold.
2) Experiment 2 (Using Psycholinguistic Cues and Feature
Selection Algorithms Applied for Predicting Five Personal-
ity Traits): For this experiment, we have focused on the
psycholinguistic features mostly and the combination of SN
features. The performance metrics are determined for all the
combinations and personality traits. We have compared the
cases with and without SN features. The SN features are
proved to be closely related to the class, as in each case,
the accuracy measurement is higher than without using these
features. Table VII demonstrates the accuracy along with the
SD in all the scenarios for extraversion trait only. In this article,
for experimental analysis, we have considered extracting more
than 100 feature sets and the used combination of these
features to find the best set of features using features selection
algorithms.
It is evident in the literature review that various types of fea-
tures generate quite different results for the prediction system.
Therefore, we have tried 15 different combinations with and
without applying features selection algorithms. In Table VII,
we have listed the scenarios that we have considered. The first
ten feature scenarios are without applying the feature selection
algorithms, and the last five feature combinations are extracted
applying five different feature selection algorithms. It is found
in experiments that according to diversified psycholinguistic
features, traditional classifiers are acting differently.
These features’ combination scenarios are considered sep-
arately for each of the five personality traits, and the per- features (network size, betweenness, n-betweenness, den-
formance matrices are determined. As stated in different sity, brokerage, n-brokerage, and transitivity) and ten LIWC
state-of-the-art articles and for ease of understanding, we are features (pronoun, they, I, filler, drives, authentic, dash, inter-
comparing the scenarios based on the accuracy (ACC). rog, reward, and body).
As stated in Table VII, the highest accuracy is obtained The comparative scenario of using the SN and not using
from the Pearson correlation-based feature selection algorithm. the SN features with the individual psycholinguistic cues are
We have determined the feature-class correlation index for given in Fig. 4. Fig. 4 illustrates that the SN features have
each of the features and only selected the features having deep insight and influential factors, as for each of the cases
ρ > 0.10, therefore the higher correlated features. The (except for NB and cognitive process), input features having
17 selected features using this method are all seven SN SN features are giving better accuracy. Therefore, for each of

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MAROUF et al.: COMPARATIVE ANALYSIS OF FEATURE SELECTION ALGORITHMS 9

TABLE VIII
A CCURACY M EASUREMENTS A PPLYING C LASSIFIERS FOR
P REDICTING B IG -F IVE -P ERSONALITY T RAITS

Fig. 4. With and without SN features’ accuracy measures of classifiers.

the personality traits, it is evident that SN features are selected


while using feature selection algorithms.

VII. C OMPARATIVE A NALYSIS


A. Comparisons on Feature Combination Scenarios
As demonstrated in Table VII, for all 15 feature combination
scenarios, five classifiers have been applied to predict the
big-five-personality traits. We have compiled the best/highest
accuracy (kept in bold) for predicting each of the traits and
illustrated in Table VIII. In Table VIII, the number in the
brackets gives the classification method number, where NB
is (1), DT is (2), and RF, SLR, and SVM are (3), (4), and
(5), respectively. For example, LIWC features give 69.67%
accuracy using NB for predicting openness-to-experience. For except for the openness-to-experience trait (O). RF algorithm
this article, we have concentrated on the last five features’ gives the highest accuracy for the openness trait.
combination scenarios focusing on the feature selection algo- For comparing the feature selection methods in Table IX,
rithms. The data in Table VIII visualize that the Pearson the pairwise t-test is reported for showing the statistical
correlation-based selected features are always giving the high- significance of the methods, as shown in [47]. For simplicity,
est accuracy for each of the personality traits. The classifica- we have reported the t-test values and degree of freedom
tion algorithm used for getting the highest accuracy is NB, values of the PCC method with the rest of the methods for

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

TABLE IX applications, such as neurolinguistic approach to NLP using


T-T EST R ESULTS FOR F EATURE S ELECTION M ETHODS medical text analysis [65], automated classification of radi-
ology reports for acute lung injury using machine learn-
ing and NLP [66], finding strong correlation between text
quality and complex network features [67], detection using
NLP [68].
The selected features that are determined by applying the
TABLE X Pearson correlation-based feature selection method give very
W IN -D RAW-L OSS TABLE OF F EATURE S ELECTION M ETHODS promising insights about personality traits. Here, we have con-
sidered the features in set representation and found interesting
combinations for different traits. For each of the traits, the sets
are named using their initial, such as E for extraversion and
N for Neuroticism.
E = {network-size, betweenness, n-betweenness, density,
brokerage, n-brokerage, transitivity, ppronoun, they, I, filler,
drives, Authentic, Dash, interrog, reward, body}
N = {network-size, betweenness, density, brokerage, tran-
sitivity, relig, number, comma, differ, work}
A = {parenth, transitivity, clout, we, social, nonflu, they,
adverb, n-betweenness, swear, quote, informal, she/he, word-
per-sentence (WPS), differ, male}
C = {network-size, betweenness, n-betweenness, density,
accuracy metric. The two-tailed t-test is performed assuming brokerage, sad, Dash, friend, social, feel, clout, you, colon,
alpha = 0.05, and the hypothesis means difference is equal to
power, authentic, Dic, percept, male, family, anx, affiliation,
zero. differ, discrep}
From Table VIII, we can claim that among the fea- O = {informal, feel, affect, conj, filler, focuspast, swear,
ture selection methods, PCC is providing better accuracy allpunc, period}
than the others. For a better understanding of the compar- U=E∪N∪A∪C∪O
ison, we have computed the win–draw–loss as constructed
= {network-size, betweenness, n-betweenness, density, bro-
in [48]–[50] into Table X describing how many times each kerage, n-brokerage, transitivity, ppronoun, they, I, filler,
method has won against the other methods. Form the data, drives, Authentic, Dash, interrog, reward, body, relig, number,
we can see that PCC has won against all other methods
comma, differ, work, parenth, clout, we, social, nonflu, adverb,
each time and not even drawn with any method. Therefore, swear, quote, informal, she/he, word-per-sentence(WPS),
we had analyzed the features selected using the PCC method, male, sad, friend, feel, you, colon, power, Dic, percept, fam-
in Section VII-B. ily, anx, affiliation, discrep, conj, focuspast, swear, allpunc,
period}
B. Insights of the Pearson Correlation-Based Selected E ∩ N = {network-size, betweenness, density, brokerage,
Features transitivity}
From the PCC calculation, as given in (7), is invariant E ∩ A = {n-brokerage, transitivity}
under separate modifications in scale and location in the two E ∩ C = {network-size, betweenness, n-betweenness, den-
variables, which could be considered as a key mathematical sity, brokerage, Dash}
property of PCC. Therefore, the PCC has been used in E ∩ O = {Ø}.
diversified research problem for the same purpose of feature From the above-mentioned sets, we can depict that the SN
selection. Kim et al. [62] presented a correlation analysis features are playing an influential role in high-accuracy pre-
for DNA microarray data sets, such as leukemia, colon, and dictions. The seven SN features could be found in each of the
lymphoma. They utilized the ensemble classifiers to get the traits sets showing the influence except for set O. Therefore,
highest accuracy on each of the data sets. PCC has been openness-to-experience (O) trait has lesser correlations with
utilized in image processing, such as tissue classification from the SN features. The universal (U) set represents the set having
CT images [63]. union of all the sets, which contains 51 distinct features. From
The implication of PCC for noise removal in the context the selected features from the Pearson correlation, we have
of signal processing is presented in [64]. They provide got the highest accuracy of 72.13% applying NB classifier
experimental justification for using PCC on signal data. for extraversion trait. From the above-mentioned sets, we can
The statistical perspective of using PCC has been presented declare the following findings.
in [64], which focuses on the medical research domain. 1) SN features are the most prominent features as they are
A practical application of PCC has been demonstrated in [64] highly correlated with personality traits.
using the sample data of 780 women attending their first 2) Among the psycholinguistic features, all the “social
antenatal clinic visits. In the context of natural language relationship” features are found in the universal set
processing (NLP), PCC has proven to work better for many except the number of female-related words.

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MAROUF et al.: COMPARATIVE ANALYSIS OF FEATURE SELECTION ALGORITHMS 11

TABLE XI VIII. C ONCLUSION


C OMPARISON W ITH THE L ITERATURE M ETHODS
We have presented a comparative analysis among the feature
or attribute selection algorithms for predicting personality
using positive and negative traits. This article works with the
user-generated social media contents, such as Facebook status
updates, and extracted the most relevant features, including
the textual features, such as traditional and psycholinguistic
features.
As we know, the base of social media is basically a graph.
The connections between the nodes and impact on them due
to social media interactions could be reflected through the SN
features. We have designed and performed experiments utiliz-
ing the linguistic features as well as SN features. To the best
of our knowledge, we have used the most number of features
to compare the performance of the feature selection algorithms
to predict personality traits. Feature combinations or subset-
based scenarios are used to identify the best possible features.
According to the experimental findings, among the tested
algorithms, PCC-based selected features have outperformed
the literature methods giving 72.13% accuracy for extraversion
trait. The overall accuracy of each of the personality traits of
BFFM has increased after using PCC-based feature selection
algorithm.

R EFERENCES
[1] M. M. Hasan, N. H. Shaon, A. A. Marouf, M. K. Hasan, H. Mahmud,
and M. M. Khan, “Friend recommendation framework for social net-
working sites using user’s online behavior,” in Proc. 18th Int. Conf.
Comput. Inf. Technol. (ICCIT), Dec. 2015, pp. 539–543.
[2] M. S. H. Mukta, M. E. Ali, and J. Mahmud, “User generated vs.
Supported contents: Which one can better predict basic human values?”
in Proc. Int. Conf. Social Inform. Cham, Switzerland: Springer, 2016,
pp. 454–470.
[3] C. P. Williams. (Feb. 23, 2013). Language, Identity, Culture, and
Diversity. [Online]. Available: [Link]
policy/edcentral/ multilingualismmatters/
[4] M. Kosinski, S. C. Matz, S. D. Gosling, V. Popov, and D. Stillwell,
“Facebook as a research tool for the social sciences: Opportunities,
3) The influence of punctuations are at a decent level, such challenges, ethical considerations, and practical guidelines,” Amer. Psy-
cholog., vol. 70, no. 6, pp. 543–556, Sep. 2015.
as dash, comma, parenth, quotes, colon, period, and [5] International Personality Item Pool. Accessed: Jan. 7, 2020. [Online].
allpunc features are present in the universal set. Available: [Link]
4) The important function words (personal pronoun, inter- [6] Y. Bachrach, M. Kosinski, T. Graepel, P. Kohli, and D. Stillwell,
“Personality and patterns of Facebook usage,” in Proc. 4th Annu. ACM
rogative, adverb, and conjunction) are present in the U Web Sci. Conf. (WebSci), Evanston, IL, USA, Jun. 2012, pp. 24–32.
set and have a good correlation with the relative classes. [7] J. Golbeck, C. Robles, and K. Turner, “Predicting personality with social
5) The openness-to-experience trait has shown divert media,” in Proc. Extended Abstracts Hum. Factors Computing Syst.,
Vancouver, BC, Canada, May 2011, pp. 253–262.
results, and the selected features’ set does not contain [8] D. Quercia, R. Lambiotte, D. Stillwell, M. Kosinski, and J. Crowsroft,
any SN features. “The personality of popular Faebook users,” in Proc. CSCW, Seattle,
6) PCC outperforms the other existing feature selection WA, USA, Feb. 2012, pp. 955–964.
[9] K. Moore and J. C. Mcelroy, “The influence of personality on Facebook
algorithms for predicting personality from social media usage, wall postings, and regret,” Comput. Human Behavior, vol. 28,
using linguistic and SN features. no. 1, pp. 267–274, Jan. 2012.
[10] A. Ortigosa, R. M. Carro, and J. I. Quiroga, “Predicting user personality
by mining social interactions in Facebook,” J. Comput. Syst. Sci., vol. 80,
C. Comparison With Literature Methods no. 1, pp. 57–71, Feb. 2014.
[11] A. Eftekhar, C. Fullwood, and N. Morris, “Capturing personality from
However, there are very few works found in the litera- Facebook photos and photo-related activities: How much exposure do
ture utilizing the traditional linguistic, psycholinguistic, and you need?” Comput. Hum. Behav., vol. 37, pp. 162–170, Aug. 2014.
SN features altogether for predicting personality from social [12] P. Howlader, K. K. Pal, A. Cuzzocrea, and S. D. M. Kumar, “Predicting
Facebook-users’ personality based on status and linguistic features via
media. Table XI shows comparisons with the literature meth- flexible regression analysis techniques,” in Proc. 33rd Annu. ACM Symp.
ods and features used for predicting with our approach. The Appl. Comput. (SAC), Pau, France, Apr. 2018, pp. 339–345.
proposed feature selection approach in this article has shown [13] T. Tandera, D. Suhartono, R. Wongso and Y. L. Prasetio, “Personality
preddiction system from Facebook users,” in Proc. 2nd Int. Conf.
better accuracy using the selected features through the PCC Comput. Sci. Comput. Intell. (ICCSCI), Bali, Indonesia, Oct. 2017,
algorithm. pp. 604–611.

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS

[14] D. Markovikj, S. Gievska, M. Kosinski, and D. Stillwell, “Min- [38] Z. Gao, Y. Xu, F. Meng, F. Qi, and Z. Lin, “Improved information gain-
ing Facebook data for predictive personality modeling,” Comput. based feature selection for text categorization,” in Proc. 4th Int. Conf.
Pers. Recognit., AAAI, Tech. Rep., Menlo Park, CA, USA, 2013, Wireless Commun., Veh. Technol., Inf. Theory Aerosp. Electron. Syst.
pp. 23–26. (VITAE), May 2014, pp. 11–14.
[15] V. Kaushal and M. Patwardhan, “Emerging trends in personality identi- [39] L. Yu and H. Liu, “Feature selection for high-dimensional data: A fast
fication using online social networks—A literature survey,” ACM Trans. correlation-based filter solution,” Proc. 20th Int. Conf. Mach. Learn.,
Knowl. Discov. DataT, vol. 12, no. 2, pp. 1–30, Jan. 2018. 2003, pp. 856–863.
[16] J. W. Pennebaker, M. E. Francis, and R. J. Booth. (2001). Linguistic [40] E. Sarhrouni, A. Hammouch, and D. Aboutajdine, “Application of sym-
Inquiry and Word Count: LIWC2001. Erlbaum, Mahwah, NJ, USA. metric uncertainty and mutual information to dimensionality reduction of
[Online]. Available: [Link] and classification hyperspectral images,” Int. J. Eng. Technolofy, vol. 4,
[17] M. Coltheart, “The MRC psycholinguistic database,” Quart. J. Exp. no. 5, pp. 268–276, 2012.
Psychol. A, vol. 33, no. 4, pp. 497–505, Nov. 1981. [41] S. Imranali and W. Shahzad, “A feature subset selection method based
[18] K. Moffitt, J. Giboney, E. Ehrhardt, J. Burgoon, and J. Nuna- on symmetric uncertainty and ant colony optimization,” Int. J. Control
maker. (2010). Structured Programming for Linguistic CUE Extraction. Automat., vol. 60, no. 11, pp. 5–10, Jul. 2017.
[Online]. Available: [Link]
[42] K. Pearson, On the Criterion That a Given System of Deviations From
[19] Kucera and W. N. Francis, Computational Analysis of Present-day Amer- the Probable in the Case of a Correlated System of Variables is Such
ican English. Providence, RI, USA: Brown Univ. Press, Providence, That It Can Be Reasonably Supposed to Have Arisen From Random
1967. Sampling. London, U.K.: Philosophical Magazine, vol. 5, no. 50. 1900,
[20] G. D. A. Brown, “A frequency count of 190,000 words in the London- pp. 157–175, doi: 10.1080/14786440009463897.
Lund Corpus of English conversation,” Behav. Res. Methods Instrum.
[43] M. S. Nikulin, “Chi-squared test for normality,” in Proc. Int. Vilnius
Comput., vol. 16, no. 6, pp. 502–532, 1984.
Conf. Probab. Theory Math. Statist., vol. 2, 1973, pp. 119–122.
[21] L. R. Goldberg, “The development of markers for the big-five fac-
tor structure,” Psychol. Assessment, vol. 4, no. 1, pp. 26–42, 1992, [44] B. Jacob, J. Chen, Y. Huang, and I. Cohen, “Pearson correlation
doi: 10.1037/1040-3590.4.1.26. coefficient,” in Noise Reduction Speech Processing. Berlin, Germany:
[22] I. B. Myers, M. H. McCaulley, N. L. Quenk, and A. L. Hammer, MBTI Springer-Verlag, 2009, pp. 1–4.
Manual: A Guide to the Development and Use of the Myers–Briggs Type [45] E. Loper and S. Bird, “NLTK: The natural language toolkit,” in Proc.
Indicator, vol. 3, 3rd ed. Palo Alto, CA, USA: Consulting Psychologists ACL Workshop Effective Tools Methodologies Teach. Natural Lang.
Press, 1998. Process. Comput. Linguistics, 2002, pp. 1–8.
[23] W. Marston, Emotions of Normal People. New York, NY, USA: Taylor [46] G. Farnadi, G. Sitaraman, S. Sushmita, F. Celli, M. Kosinski,
& Francis, 1999. D. Stillwell, S. Davalos, M.-F. Moens, and M. De Cock, “Computational
[24] P. T. Costa, Jr., and R. R. McCrae, NEO-PI-R Professional Manual. personality recognition in social media,” User Model. User-Adapted
Odessa, FL, USA: Psychological Assessment Resources, 1992. Interact., vol. 26, nos. 2–3, pp. 109–142, 2016.
[25] L. R. Goldberg, “A broad-bandwidth, public-domain, personal- [47] S. Bandyopadhyay, S. Mallik, and A. Mukhopadhyay, “A survey and
ity inventory measuring the lower-level facets of several five- comparative study of statistical tests for identifying differential expres-
factor models,” in Personality Psychology in Europe, vol. 7, sion from microarray data,” IEEE/ACM Trans. Comput. Biol. Bioinf.,
I. Mervielde, I. Deary, F. De Fruyt, and F. Ostendorf, Eds. Tilburg, vol. 11, no. 1, pp. 95–115, Jan. 2014, doi: 10.1109/tcbb.2013.147.
The Netherlands: Tilburg Univ. Press, 1999, pp. 7–28. [Online]. Avail- [48] S. Mallik and Z. Zhao, “ConGEMs: Condensed gene co-expression
able: [Link] module discovery through rule-based clustering and its applica-
[26] O. P. John and S. Srivastava, “The big five trait taxonomy: History, tion to carcinogenesis,” Genes, vol. 9, no. 1, p. 7, Dec. 2017,
measurement, and theoretical perspectives,” in Handbook of Personality: doi: 10.3390/genes9010007.
Theory and Research, L. A. Pervin and O. P. John, Eds., 2nd ed. [49] S. Mallik, T. Bhadra, and U. Maulik, “Identifying epigenetic biomarkers
New York, NY, USA: Guilford Press, 1999, pp. 102–138. using maximal relevance and minimal redundancy based feature selec-
[27] G. Saucier, “Mini-markers: A brief version of Goldberg’s unipolar tion for multi-omics data,” IEEE Trans. Nanobiosci., vol. 16, no. 1,
big-five markers,” J. Pers. Assessment, vol. 63, no. 3, pp. 506–516, pp. 3–10, Jan. 2017, doi: 10.1109/tnb.2017.2650217.
Dec. 1994. [50] T. Bhadra, S. Mallik, and S. Bandyopadhyay, “Identification of multi-
[28] M. B. Donnellan, F. L. Oswald, B. M. Baird, and R. E. Lucas, “The mini- view gene modules using mutual information-based hypograph mining,”
IPIP scales: Tiny-yet-effective measures of the big five factors of IEEE Trans. Syst., Man, Cybern., Syst., vol. 49, no. 6, pp. 1119–1130,
personality,” Psychol. Assessment, vol. 18, no. 2, pp. 192–203, Jun. 2006. Jun. 2019, doi: 10.1109/tsmc.2017.2726553.
[29] S. D. Gosling, P. J. Rentfrow, and W. B. Swann, “A very brief measure [51] X. Xu, H. Gu, Y. Wang, J. Wang, and P. Qin, “Autoencoder based feature
of the Big-Five personality domains,” J. Res. Pers., vol. 37, no. 6, selection method for classification of anticancer drug response,” Fron-
pp. 504–528, 2003. tiers Genet., vol. 10, p. 233, Jan. 2019, doi: 10.3389/fgene.2019.00233.
[30] J. A. Johnson, “Developing a short form of the IPIP-NEO: A report
[52] S. Mallik and Z. Zhao, “Graph- and rule-based learning algorithms:
to HGW Consulting,” Dept. Psychol., Univ. Pennsylvania, DuBois, PA,
USA, Tech. Rep., 2000. A comprehensive review of their applications for cancer type classi-
fication and prognosis using genomic data,” Briefings Bioinf., to be
[31] M. Kosinski, D. Stillwell, and T. Graepel, “Private traits and attributes
published, doi: 10.1093/bib/bby120.
are predictable from digital records of human behavior,” Proc. Nat. Acad.
Sci. USA, vol. 110, no. 15, pp. 5802–5805, Apr. 2013. [53] S. Mallik and Z. Zhao, “Towards integrated oncogenic marker recog-
[32] W. Youyou, M. Kosinski, and D. Stillwell, “Computer-based personality nition through mutual information-based statistically significant feature
judgments are more accurate than those made by humans,” Proc. Nat. extraction: An association rule mining based study on cancer expression
Acad. Sci. USA, vol. 112, no. 4, pp. 1036–1040, Jan. 2015. and methylation profiles,” Quant. Biol., vol. 5, no. 4, pp. 302–327,
Dec. 2017, doi: 10.1007/s40484-017-0119-0.
[33] M. A. Hall, “Correlation-based feature subset selection for machine
learning,” Ph.D. dissertation, Dept. Comput. Sci., Univ. Waikato, [54] (2019). [Link]. Accessed: Aug. 31, 2019. [Online]. Available:
Hamilton, New Zealand, Apr. 1999. [Link]
[34] M. Hall and L. A. Smith, “Feature subset selection: A correlationbased [55] (2019). [Link]. Accessed: Aug. 31, 2019. [Online].
filter approach,” in Proc. 4th Int. Conf. Neural Inf. Process. Intell. Inf. Available: [Link]
Syst., 1997, pp. 855–858. 10/Brokerage-2C-Boundary-Spanning-2C-and-Leadership-in-Open-
[35] M. Doshi and R. K. Chaturvedi, “Correlation based feature selection [Link].
(CFS) technique to predict student perfromance,” Int. J. Comput. Netw. [56] S. Mallik and U. Maulik, “MiRNA-TF-gene network analysis through
Commun., vol. 6, no. 3, pp. 197–206, May 2014. ranking of biomolecules for multi-informative uterine leiomyoma
[36] A. Abbasi, S. France, Z. Zhang, and H. Chen, “Selecting attributes for dataset,” J. Biomed. Informat., vol. 57, pp. 308–319, Oct. 2015,
sentiment classification using feature relation networks,” IEEE Trans. doi: 10.1016/[Link].2015.08.014.
Knowl. Data Eng., vol. 23, no. 3, pp. 447–462, Mar. 2011. [57] Y. Masoudi-Sobhanzadeh, H. Motieghader, and A. Masoudi-Nejad,
[37] G. Forman, “An extensive empirical study of feature selection metrics “FeatureSelect: A software for feature selection based on machine
for text classification,” J. Mach. Learn. Res., vol. 3, pp. 1289–1305, learning approaches,” BMC Bioinf., vol. 20, no. 1, p. 170, 2019,
Mar. 2003. doi: 10.1186/s12859-019-2754-0.

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MAROUF et al.: COMPARATIVE ANALYSIS OF FEATURE SELECTION ALGORITHMS 13

[58] M. Mafarja, I. Aljarah, H. Faris, A. I. Hammouri, A. M. Al-Zoubi, and Md. Kamrul Hasan received the [Link]. degree in
S. Mirjalili, “Binary grasshopper optimisation algorithm approaches for computer science and information technology (CIT)
feature selection problems,” Expert Syst. Appl., vol. 117, pp. 267–286, from the Islamic University of Technology (IUT),
Mar. 2019, doi: 10.1016/[Link].2018.09.015. Gazipur, Bangladesh, and the Ph.D. degree from
[59] V. Chahkandi, M. Yaghoobi, and G. Veisi, “Feature selection Kyung Hee University, Seoul, South Korea.
with chaotic hybrid artificial bee colony algorithm based on fuzzy He has long experience in software as a developer
(CHABCF),” J. Soft Comput. Appl., vol. 2013, pp. 1–8, Jun. 2013, and a consultant. He is currently a Professor with the
doi: 10.5899/2013/jsca-00014. Department of Computer Science and Engineering
[60] S. Arora and P. Anand, “Binary butterfly optimization approaches for (CSE), IUT, where he has been serving for ten years
feature selection,” Expert Syst. Appl., vol. 116, pp. 147–160, Feb. 2019, and is also the Founding Director of the Systems and
doi: 10.1016/[Link].2018.08.051. Software Lab (SSL). His current research interests
[61] G. Farnadi, S. Zoghbi, M. Moens, and M. De Cock, “Recognising are in intelligent systems and AI, software engineering, cloud computing, data
personality traits using Facebook status updates,” in Proc. WCPR, 2013, mining applications, and social networking.
pp. 14–18.
[62] K.-J. Kim and S.-B. Cho, “Ensemble classifiers based on correlation
analysis for DNA microarray classification,” Neurocomputing, vol. 70,
nos. 1–3, pp. 187–199, Dec. 2006.
[63] B. Auffarth, M. Lopez-Sanchez, and J. Cerquides, “Comparison of
redundancy and relevance measures for feature selection in tissue
classification of CT images,” Advances in Data Mining. Applications
and Theoretical Aspects, P. Perner, Ed. Berlin, Germany: Springer, 2010,
pp. 248–262.
[64] M. M. Mukaka, “Statistics corner: A guide to appropriate use of
correlation coefficient in medical research,” Malawi Med. J., vol. 24,
no. 3, pp. 69–71, Sep. 2012.
[65] W. Duch, P. Matykiewicz, and J. Pestian, “Neurolinguistic approach to
natural language processing with applications to medical text analysis,”
Neural Netw., vol. 21, no. 10, pp. 1500–1510, Dec. 2008.
[66] I. Solti, C. R. Cooke, F. Xia, and M. M. Wurfel, “Automated classifica-
tion of radiology reports for acute lung injury: Comparison of keyword
and machine learning based natural language processing approaches,” in
Proc. IEEE Int. Conf. Bioinformatics Biomed. Workshop, Washington,
DC, USA, Nov. 2009, pp. 1–4.
[67] L. Antiqueira, M. Nunes, O. Oliveira, Jr., and L. D. F. Costa, “Strong cor-
relations between text quality and complex networks features,” Phys. A,
Stat. Mech. Appl., vol. 373, pp. 811–820, Jan. 2007.
Hasan Mahmud received the bachelor’s degree in
[68] M. Chong, L. Specia, and R. Mitkov, “Using natural language processing
computer science and information technology (CIT)
for automatic detection of plagiarism,” in Proc. 4th Int. Plagiarism Conf.
from the Islamic University of Technology (IUT),
Tyne, U.K.: Northumbria Univ. 2010, pp. 1–12.
Gazipur, Bangladesh, in 2004, and the [Link]. degree
in computer science from the University of Trento
Ahmed Al Marouf received the bachelor’s degree (UniTN), Trento, Italy in 2009. He is currently
from the Department of Computer Science and pursuing the Ph.D. degree in computer science and
Engineering (CSE), Islamic University of Technol- engineering (CSE) with IUT, under the guidance of
ogy (IUT), Gazipur, Bangladesh, in 2014, and the Dr. M. A. Mottalib and Dr. K. Hasan.
[Link]. degree in CSE from IUT in 2019. He joined the CSE Department, Stamford Univer-
He was a Graduate Researcher with the Systems sity Bangladesh, Dhaka, Bangladesh, as a Faculty
and Software Lab (SSL), CSE Department, IUT. Member. Since 2009, he has been an Assistant Professor with the Department
He is currently a Lecturer with the Department of of CSE, IUT, where he is also the Co-Founder of the Systems and Software
Computer Science and Engineering (CSE), Daffodil Lab (SSL). He has different research articles published in several international
International University (DIU), Dhaka, Bangladesh, journals and conferences. His research interest focuses on human–computer
where he is also the Technical Lead of the Human interaction, gesture-based interaction, and machine learning.
Computer Interaction (HCI) Research Lab. His research interest lies within Mr. Mahmud received the University Guild Grant Scholarship for two years
computational social science, data science, and machine learning. (2007–2009) for his master’s study and the Early Degree Scholarship.

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.

You might also like