Personality Prediction Using Social Media
Personality Prediction Using Social Media
Abstract— With the rapid growth of social media, users are other entities as friends, connections, or followers. While using
getting involved in virtual socialism, generating a huge volume of these SNS’s, users are facilitated by many activities, such as
textual and image contents. Considering the contents such as sta- posting statuses/tweets, sharing others’ posts/retweets, liking
tus updates/tweets and shared posts/retweets, liking other posts is
reflecting the online behavior of the users. Predicting personality others’ posts, commenting on others’ posts, chatting directly
of a user from these digital footprints has become a computation- with the friends, and playing online games with the friends.
ally challenging problem. In a profile-based approach, utilizing It is evident that from the activities performed by users, online
the user-generated textual contents could be useful to reflect behavior could be depicted [1]. Understanding users’ behavior
the personality in social media. Using huge number of features may help to identify personality traits.
of different categories, such as traditional linguistic features
(character-level, word-level, structural, and so on), psycholinguis- Predicting users’ personalities from digital footprints of
tic features (emotional affects, perceptions, social relationships, social media is a challenging task as the context of identifying
and so on) or social network features (network size, betweenness, personality traits in social media is not trivial. Users behave
and so on) could be useful to predict personality traits from social differently in social media and real life. Therefore, the user-
media. According to a widely popular personality model, namely, generated content, such as status updates in social media, may
big-five-factor model (BFFM), the five factors are openness-to-
experience, conscientiousness, extraversion, agreeableness, and provide enough evidential reflection of personality as SNS
neuroticism. Predicting personality is redefined as predicting user posts statuses based on his/her current situation, a recent
each of these traits separately from the extracted features. political or popular event, hyped topics, and so on. For exam-
Traditionally, it takes huge number of features to get better ple, during an election of his/her country, he/she may posts
accuracy on any prediction task although applying feature positive or negative reviews/opinions about a political party.
selection algorithms may improve the performance of the model.
In this article, we have compared the performance of five feature These types of statuses may have contextual trends, as other
selection algorithms, namely the Pearson correlation coefficient friends of the users may also be involved in posting similar
(PCC), correlation-based feature subset (CFS), information gain statuses. Considering the trend, user may post his/her political
(IG), symmetric uncertainly (SU) evaluator, and chi-squared views. Users are creating trends as well as following different
(CHI) method. The performance is evaluated using the classic trends to become popular or socially accepted by their friends
metrics, namely, precision, recall, f-measure, and accuracy as
evaluation matrices. in social media. Moreover, each user has different perceptions
and different interest categories to be triggered to update
Index Terms— Chi-squared (CHI) method, computational per- statuses. For defining personality, we have followed the widely
sonality prediction, feature selection algorithms, information gain
(IG), Pearson correlation coefficient (PCC), social media. used big-five-factor model (BFFM). According to BFFM,
there are four positive personality traits, namely, openness-
to-experience (O), conscientiousness (C), extraversion (E),
I. I NTRODUCTION agreeableness (A), and the only negative trait neuroticism (N).
Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
over the prediction. One of the main contributions of this naire could be easily adopted. However, the main limitation
article is to identify the best feature selection algorithm for of this process is the test-takers need to answer the questions
extracting the most prominent features among the traditional honestly. The different IPIP sets are discussed in Section III,
linguistic, psycholinguistic, and social network (SN) features. elaborately.
As the number of features is relevantly high and the accuracy The correlation between the usage of Facebook, thus, social
is low while applying all the features, we have investigated media, and personality has been studied in [6] and [7]. In [6],
several cases to find the important features and category of the study shows that the correlation is higher for neuroticism
features for predicting personality precisely. and extraversion trait but average for the other traits. Different
In this article, we try to compare the existing feature selec- literature works establish the relationships between the per-
tion algorithms to predict personality from Facebook status sonality and social media uses, such as personality of popular
updates. In summary, we have the following contributions. social media users [8], influence of personality from Facebook
1) Applying different feature selection algorithms, such as usage, and wall posting [9], by mining social interactions
the chi-squared (CHI) method, Pearson coefficient cor- in Facebook [10], capturing personality from photograph or
relation, information gain (IG), correlation-based feature photograph-related posts in social media [11], and so on.
subset (CFS)-based subset evaluation, and symmetrical Howlader et al. [12] proposed the topic modeling-based
uncertainty attribute evaluation, to predict the big-five- approach applied to Facebook status updates. For this work,
personality traits. they used the linear Dirichlet allocation (LDA) and term
2) We have extracted over 150 features to analyze the frequency-inverse document frequency (TF-IDF) as feature
predictive system over different types of features, such and applied flexible regression models for prediction. Deep
as traditional linguistic, psycholinguistic, and SN fea- learning-based methods were introduced by Tandera et al. [13].
tures. In the literature, many researchers have used few They applied the traditional deep learning algorithms, such
features to predict personality, but the overall outcome as the multilayer perceptron (MLP), long short-term memory
of those approaches is not quite satisfactory. Jelling up a (LSTM), gated recurrent unit (GRU), and 1-D convolutional
huge volume of features has given an evidentially better neural network (CNN-1-D). A huge feature set (725 features)
understanding of personality traits. has been analyzed in [14] considering basic linguistic features,
3) We have considered several scenarios/cases of feature POS-tagger parameters, AFINN (Lexicon list) parameters, and
combinations based on psycholinguistic features to find H4Lvd parameters. A review of emerging trends of personality
the best subset of features to predict each personality prediction from online social media is performed by Kaushal
trait differently. Hence, we have determined the accuracy and Patwardhan [15]. They listed different categories of fea-
with and without SN features that are reported in the tures, such as linguistic features (LIWC features, POS tags,
experiments. speech acts, and sentiment features), nonlinguistic features
4) Five different classifiers, namely, the naïve Bayes (NB), (structural, behavior, and temporal features), and SN features.
decision tree (DT), random forest (RF), simple logistic Based on the features used for identifying personality traits,
regression (SLR), and support vector machine (SVM), the methodologies have modified. Farnandi et al. [46] have
were used to determine the evaluation metrics to find proposed methods for predicting personality from social media
the best feature selection algorithm. Utilizing these considering the cross-platform and cross-domain situations.
classifiers, we derived several conclusions. Considering personality prediction as a multilabel prediction
task, they have extracted the LIWC, MRC, NRC, and SPLICE
II. R ELATED W ORKS features to run several types of regression models.
In this section, we have discussed state-of-the-art works For extracting relevant psychological features from texts,
regarding predicting personality traits and applying feature psycholinguistic tools are utilized. These software tools are
selection algorithms. This section is divided into several parts: developed for easier experimentations. LIWC [16], MRC [17],
literature of computational personality prediction, literature of and SPLICE [18] are widely used psycholinguistic tools.
the psycholinguistic tools, literature of existing methods and Developed by Pennebaker and Francis, a word list-based text
applied features devised by researchers, and feature selection analysis tool, LIWC, extracts 93 features consisting standard
methods applied in similar research problems. counts (word counts, words longer than six letters, and so
For predicting personality traits computationally, resear- on), personal concerns (occupation, financial issues, health,
chers have utilized the machine learning techniques, such and so on), psychological processes (cognitive, emotional,
as supervised/ unsupervised learning models and classifica- perceptional, and social processes), and other features
tion algorithms to classify the traits. International person- (punctuation counts, swear words, and so on) [16]. On the
ality item pools (IPIPs) [5] are the items or questions to other hand, MRC [17] features are computed using Medical
answer to devise a scoring mechanism for traits identifica- Research Council’s psycholinguistic database that consists
tion. Depending on the behavior of test-taker on different over 150 000 words with linguistic and psycholinguistic
issues of practical life, these items are presented. Using the features of each word. MRC includes very interesting
IPIP questionnaire, the quantitative method has been adopted latent features of text, such as the Kucera–Francis written
for the problem, and many variations of the question sets frequency [19] and the Brown verbal frequency [20].
were used for developing a better ground-truth data set. Structured Programming for Linguistic Cue Extraction
This manual procedure of taking answers of a set of question- (SPLICE) extracts 74 features related to linguistic. Upon the
Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
input of textual data, SPLICE [18] outcomes various features proposed experimental method and experimental results are
including the quantities (number of characters, sentences, enlisted in Section VI. The detail comparative analysis is
words, and so on), parts of speech features (number of nouns, depicted in Section VII, and finally, Section VIII concludes
noun ratio, verb ratio, adjective ratio, and so on), immediacy with the contributions highlighted.
(number of passive verbs and passive verb ratio), pronouns,
positive self-evaluation, negative self-evaluation, influence, III. C OMPUTATIONAL P ERSONALITY-P REDICTION
deference, and whissel (imagery, pleasantness, and activation), P ROBLEM
text complexity, spoken word features, tense, SentiWordNet The computational personality-prediction problem in the
features, and readability scores. Among these three widely context of social media could be defined as “predicting
used closed vocabulary psycholinguistic tools, for our work, the personality traits from user profile information using
we have used LIWC. LIWC consists of a psycholinguistic computational features rather than asking a set of question-
dictionary in back end, which contains huge number of naire.” Usually, for understanding own personality, people
words, synonyms, and antonyms in different psychological try to take online or off-line personality test. The traditional
categories. LIWC is proven to be useful in the context of personality-prediction systems depend on a set of question-
personality traits prediction. naires to be answered honestly by the test-taker. Questionnaire-
Though features are playing a vital role in data-driven based personality-prediction systems are also popular among
system, the feature selection methods also significantly find the test-takers. The widely used personality tests are big-
the most prominent features from huge feature vectors. Not five-personality test [20], the Myers–Briggs type indicator
only, specifically, data mining but also feature selection has (MBTI) [21], and the dominance influence steadiness con-
become an important tool used in bioinformatics and compu- scientiousness (DISK) [22]. Among these tests, the big-five-
tational biology. Xu et al. [51] proposed an autoencoder-based personality test has been widely accepted among the test-takers
feature selection method for classification of anticancer drug because of the similarity found with themselves with the result
response. Similarly, Mallik and Zhao [52] presented a graph- of the test.
and rule-based learning algorithm for cancer-type classification Many online personality testing sites, such as 16Person-
using feature selection. Mallik and Zhao [53] have applied ality1 , 123test2, Personality Perfect3 , PsychCentral Person-
statistically significant feature extraction-based study on can- ality Test4 , Open Source Psychometrics Project5 , See My
cer expression using integrated marker recognition, which is Personality6, and Discover My Profile7 by the University of
mutual-information based. Cambridge, are very popular for identifying precise personality
Apart from the filter-based feature selection algorithm, there reviewed by the test-takers. The reviews are analyzed from
are wrapper based and hybrid feature selection algorithms. each of the websites and found positive comments delivered
Masoudi-Sobhanzadeh et al. [57] have presented “FeatureS- by the reviewers. The literature provides evidential proof that
elect,” which is a software for selecting features based on computational personality prediction provides better results
machine learning approaches, and the software is tested on than manual paper-based methods. Therefore, the acceptability
gene selection methods. Several nature-inspired evolutionary of these online personality tools is much higher than manual
algorithm-based feature selection algorithms are presented questionnaire-based personality testing. Hence, this encour-
recently. Mafarja et al. [58] presented a binary grasshop- ages applying automated personality prediction from social
per optimization algorithm-based feature selection. Similarly, media. It is evident that computational personality judgments
chaotic hybrid artificial bee colony-based feature selection [59] are more accurate than those made by humans [32].
and binary butterfly optimization-based feature selection [60] The history of personality prediction goes a long way as
are recently introduced in the literature. researchers have tried to optimize the number of questions
In the process of supervised learning, one of the most being asked to the test-taker. Usually, high volume of questions
significant roles is played by the feature selection criteria. is asked, and the answers are analyzed to predict personality
Selecting the most relevant features from a huge feature precisely. However, answering these questions could be time-
vector has a vital impact on the accuracy of the system. consuming as well as tiring for the test-takers. Therefore,
For comparing, in this article, we have utilized the five asking a minimum number of questions to get a better
most conventional feature selection algorithms, namely IG, prediction could be a challenging task. Researchers’ have
CFS-based subset evaluator (CFS), CHI method, symmetrical come up with various numbers of questions or items. NEO
uncertainty attribute evaluation (SU), and the Pearson corre- five-factor inventory (NEO-FFI) [24] is a 60-item personality
lation coefficient (PCC). These feature selection algorithms measure model. Similar models were proposed by researchers
are discussed in Section III, including the definitions and in psychology area for the personality-prediction task. Depend-
formulas. ing on scores determined by the IPIP, the computation of
Therefore, in this article, we have presented an experimental
1 [Link]
comparison between the feature selection algorithms, and for
2 [Link]
the experiments, we have extracted more than 150 features. 3 [Link]
The rest of this article is organized as follows. Section III 4 [Link]
discusses the computational personality-prediction problem, 5 [Link]
and Section IV includes the state-of-the-art application areas 6 [Link]
of the feature selection algorithms. Section V illustrated the 7 [Link]
Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
personality traits is performed. Depending on the number In the context of statistics, the uncertainty coefficient or
of IPIP items considered for prediction, there are several entropy coefficient is the measure of nominal association. The
models proposed by many researchers. The 50-item IPIP five- symmetrical uncertainty (SU) [39] attribute evaluator is one
factor model (FFM) proposed by Goldberg [25], 44-item big- kind of correlation finder that evaluates the importance of a
five inventory (BFI) proposed by John and Srivastava [26], feature by measuring the SU with respect to the class. This
40-item Big-Five Mini-Markers proposed by Saucier [27], feature selection process is not only used for imagery data,
20-item Mini-IPIP proposed by Donnellan et al. [28], ten-item such as hyperspectral images [40], but also used with the
personality inventory (TIPI) proposed by Gosling et al. [29] nature-inspired optimization algorithms, such as ant colony
are the existing models in the literature. Short form of item sets optimization [41]. The SU is determined using (3) where,
is also proven effective in some cases [30]. Although there are H(C|F) is the conditional entropy considering C the class and
many scoring systems adopted for this particular problem, each F the feature, and H (C) is the single distribution of class C.
of them has own advantages to be used. The myPersonality The algorithm outcomes ranking of the most relevant features
data set [4], [31] is collected from Facebook users and used
100-item IPIP questionnaire set. 2 ∗ (H (C) − H (C|F))
SU(C, F) = (3)
Although the above-mentioned works try to use psycholin- H (C) + H (F)
guistic tools to extract the psycholinguistic features or different H (C) = − PC (x)logPC (x) (4)
types of methods for devising a model, in the literature, x
there is no article highlighting the best features for predicting H (C | F) = − PC,F (x, y)logPC,F (x, y). (5)
personality from social media data. In this article, we have x,y
focused on this problem and designed experiments to find a
solution to this problem. CHI test (χ 2 ) [42], [43] is used in statistics for determining
the association between variables or features. Depending on
IV. F EATURE S ELECTION A LGORITHMS OVERVIEW the difference between the expected frequencies (e) and the
In this section, we have outlined the feature selection observed frequencies (n) in one or more features in the feature
algorithms that we have applied for computational personality set, the CHI value is determined. Depending on the value of
prediction. The five different algorithms are applied to the the parameter, we can decide the number of features to be
outcome of features that are most relevant to the prediction selected for a system. The equation for calculating CHI value
task. All these feature selection methods provide a ranking is given as
generated based on the relevance between the feature and the
class.
r
c
(n i j − ei j )2
χ2 = (6)
CFS subset evaluator [33], [34] is a feature selection ei j
i=1 j =1
algorithm that finds the subset of features via the individual
predictive ability of each feature along with the degree of where r and c are the numbers of row and column of the
redundancy between them. CFS ranks the features subsets feature table.
according to a correlation-based heuristic evaluation method. PCC [44] is considered as one of the most efficient and
The subset evaluation function is given in (1), where Ms is widely accepted feature selection algorithm. In PCC, the value
the heuristic merit of the feature subset S containing k features. of covariance between the class and feature is been deter-
rc¯f is the mean of feature-class correlation ( f ∈ S), and r ¯f f mined. The standard deviations (SDs) of the class and feature
is the average feature-feature intercorrelation are calculated to find the coefficient value (ρ). The coefficient
k rc¯f could be used as an efficient parameter to determine the feature
Ms = . (1)
k + k(k − 1)r ¯f f sets. The calculation of (ρ) is performed using (7) as given
in the following. cov(X, Y) is the covariance between X and
The CFS subset evaluator is used in different contexts of
Y , where X or Y is the class value, and σ is the SD in the
research, such as to predict students’ performance [35] and
following equation:
selecting features for sentiment classification [36].
IG is one of the widely used feature selection meth- cov(X, Y )
ods in different research problems, including text catego- ρ X,Y = . (7)
σ X σY
rization [37], [38]. Various research fields have utilized the
inner mechanism of IG, such as computer vision and text The correlation value is distributed between −1 and +1,
classification [38], [39]. IG outcomes a ratio value calculated where 1 is the total positive correlation, O is no linear corre-
by (2), where values(a) denotes the set of all possible values lation, and −1 means the negative correlation. The coefficient
of features a ∈ Attr. Attr is the set of all features, H is the is invariant under separate modifications in scale and location
entropy, and x ∈ T denotes the value of specific example x in the two variables, which could be considered as a key
for a ∈ Attr. The largest IG is the smallest entropy mathematical property of PCC. Depending on the above-
|{x∈T | x a = v}| mentioned parameters, the number of features to be selected
IG(T, a) = H (T )− · H (x ∈ T |x a = v). for the problem could be determined. For comparing the
|T |
v∈vals(a) feature selection criteria, we have experimented with various
(2) scenarios for computational personality trait prediction.
Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II
T RADITIONAL L INGUISTIC F EATURES
TABLE I
C LASS D ISTRIBUTION OF M YPERSONALITY D ATA S ET
V. E XPERIMENTAL M ETHOD
For the experimental analysis, we have designed a common
method for testing the performance of each of the feature
selection algorithms. The proposed experimental method con-
sists of data acquisition, data preprocessing, feature extraction,
of the removal of URLs, names, symbols, unnecessary spaces,
feature selection, and classification, as depicted in Fig. 1. For
and stemming. These operations are performed using the
each of the five personality traits, we are going to apply the
NLTK package [45] library.
proposed method. The rest of the section elaborately discusses
the steps of experimental methods.
C. Feature Extraction
A. Data Acquisition In this step, the extracted features are in two categories:
linguistic features and SN features. We have extracted the
For our experiment, we have used the myPersonality data
traditional linguistic features and psycholinguistic features as
set [4], [31] that consists of the status updates, SN features,
well.
ground-truth personality traits scores, as well as classes. The
1) Traditional Linguistic Features: The traditional linguistic
traits used in the data set are formalized in BFFM. For each
features are textual features that could be divided into four
of the five personality traits, openness to experience (OPN),
types: character-based, word-based, structural, and function
conscientiousness (CON), extraversion (EXT), agreeableness
words. The list of traditional features considered for our study
(AGR), and neuroticism (NEU), the personality score and the
is shown in Table II. For extracting the linguistic features,
class value (yes or no) are given in the data set.
we have applied LIWC [16] on the preprocessed textual data.
The data set contains 250 users around 10 000 status
LIWC gives a total of 93 features having psycholinguistic and
updates, and it is considered as a ground-truth data set for
traditional linguistic categorical features. All the features are
personality prediction. The class distribution of the myPerson-
integer or fractional values, meaning the percentages of words
ality data set is demonstrated in Table I.
in specific categories.
2) Psycholinguistic Features: Among 93 features, only
B. Data Preprocessing 28 could be considered as psycholinguistic features divided
All the statuses of the data set are in English and follow into five categories, namely, emotional affect, cognitive
every step of preprocessing. The preprocessing step consists process, self-focus, social relationships, and perceptions.
Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE III
P SYCHOLINGUISTIC F EATURES E XTRACTED U SING LIWC
Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IV TABLE V
D IFFERENT C ATEGORIES OF B ROKER N ODE P ERFORMANCE M ATRICES A PPLYING C LASSIFIERS FOR P REDICTING
E XTRAVERSION U SING LIWC F EATURES
D. Feature Selection
Feature selection algorithms are used to find the essential
or important features from a set of the feature vector. The
experiment starting from inputting the labeled data, data pre-
processing, is performed and left for identifying the features.
In features extraction step, we have collected the prominent
features, each feature vector containing 93 features
F = {F1, F2, F3, . . . , F93}. (14)
These feature vectors are used to find the optimal number
of essential features using the features selection methods. The
Fig. 3. Idea of transitivity. mentioned five different types of feature selection methods are
adopted, and the selected features are feed to the classifiers.
The performance matrices are determined for each classifier to
A be the source node gives information to B, the broker node, evaluate the experimental method. Finally, the highly accurate
who gives the information to C (the destination node). feature selection algorithm is identified.
N-brokerage is the normalized parameter of brokerage that
is the measure of brokerage nodes divided by the number of
pairs [55]. The equation could be derived as follows: E. Classification Methods
In this article, we have applied the classic classification
number of broker nodes methods to evaluate the performance of the proposed experi-
n-brokerage = . (12)
number of pairs mental method. NB, RF, DT, SLR, and SVM are implemented
in the experiment process. The state-of-the-art classifiers are
Transitivity is the measurement that could be defined as considered for the experiment, not adaptive versions.
a friend-of-friend (FOF) concept of social media, such as
Facebook. The idea of FOF is “when a friend of my friend is
VI. E XPERIMENTAL R ESULTS
my friend.”
In the context of network or graph theory, transitivity In this section, we have evaluated the system using the
is measured based on the relative number of triangles or evaluation metrics, such as precision, recall, f-measure, and
triads present in the graph comparing to the total number of accuracy. We have divided the research contributions into three
connected triples of nodes. The idea of transitivity is depicted different experiments.
in Fig. 3, and the equation to calculate the transitivity T (G) 1) Experiment 1 (Using All LIWC Features for Predicting
is (13), as follows. Five Personality Traits): In this experiment, we have focused
As shown in Fig. 3, A is friend with B, B is friend with C, on the 93 LIWC features to fetch into the prediction model.
and A is also friend with C. The relationships between them Table V shows the metrics values along with the SD applying
build a triad different classifiers for LIWC features for extraversion trait
only. The highest accuracy (61.07%) is shown by the SVM
3 ∗ no. of closed triples in G classifier for the extraversion trait. The same metrics are
T (G) = . (13)
no. of connected triples of vertices in G reported for the other four personality traits in Table VI.
Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VIII
A CCURACY M EASUREMENTS A PPLYING C LASSIFIERS FOR
P REDICTING B IG -F IVE -P ERSONALITY T RAITS
Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
R EFERENCES
[1] M. M. Hasan, N. H. Shaon, A. A. Marouf, M. K. Hasan, H. Mahmud,
and M. M. Khan, “Friend recommendation framework for social net-
working sites using user’s online behavior,” in Proc. 18th Int. Conf.
Comput. Inf. Technol. (ICCIT), Dec. 2015, pp. 539–543.
[2] M. S. H. Mukta, M. E. Ali, and J. Mahmud, “User generated vs.
Supported contents: Which one can better predict basic human values?”
in Proc. Int. Conf. Social Inform. Cham, Switzerland: Springer, 2016,
pp. 454–470.
[3] C. P. Williams. (Feb. 23, 2013). Language, Identity, Culture, and
Diversity. [Online]. Available: [Link]
policy/edcentral/ multilingualismmatters/
[4] M. Kosinski, S. C. Matz, S. D. Gosling, V. Popov, and D. Stillwell,
“Facebook as a research tool for the social sciences: Opportunities,
3) The influence of punctuations are at a decent level, such challenges, ethical considerations, and practical guidelines,” Amer. Psy-
cholog., vol. 70, no. 6, pp. 543–556, Sep. 2015.
as dash, comma, parenth, quotes, colon, period, and [5] International Personality Item Pool. Accessed: Jan. 7, 2020. [Online].
allpunc features are present in the universal set. Available: [Link]
4) The important function words (personal pronoun, inter- [6] Y. Bachrach, M. Kosinski, T. Graepel, P. Kohli, and D. Stillwell,
“Personality and patterns of Facebook usage,” in Proc. 4th Annu. ACM
rogative, adverb, and conjunction) are present in the U Web Sci. Conf. (WebSci), Evanston, IL, USA, Jun. 2012, pp. 24–32.
set and have a good correlation with the relative classes. [7] J. Golbeck, C. Robles, and K. Turner, “Predicting personality with social
5) The openness-to-experience trait has shown divert media,” in Proc. Extended Abstracts Hum. Factors Computing Syst.,
Vancouver, BC, Canada, May 2011, pp. 253–262.
results, and the selected features’ set does not contain [8] D. Quercia, R. Lambiotte, D. Stillwell, M. Kosinski, and J. Crowsroft,
any SN features. “The personality of popular Faebook users,” in Proc. CSCW, Seattle,
6) PCC outperforms the other existing feature selection WA, USA, Feb. 2012, pp. 955–964.
[9] K. Moore and J. C. Mcelroy, “The influence of personality on Facebook
algorithms for predicting personality from social media usage, wall postings, and regret,” Comput. Human Behavior, vol. 28,
using linguistic and SN features. no. 1, pp. 267–274, Jan. 2012.
[10] A. Ortigosa, R. M. Carro, and J. I. Quiroga, “Predicting user personality
by mining social interactions in Facebook,” J. Comput. Syst. Sci., vol. 80,
C. Comparison With Literature Methods no. 1, pp. 57–71, Feb. 2014.
[11] A. Eftekhar, C. Fullwood, and N. Morris, “Capturing personality from
However, there are very few works found in the litera- Facebook photos and photo-related activities: How much exposure do
ture utilizing the traditional linguistic, psycholinguistic, and you need?” Comput. Hum. Behav., vol. 37, pp. 162–170, Aug. 2014.
SN features altogether for predicting personality from social [12] P. Howlader, K. K. Pal, A. Cuzzocrea, and S. D. M. Kumar, “Predicting
Facebook-users’ personality based on status and linguistic features via
media. Table XI shows comparisons with the literature meth- flexible regression analysis techniques,” in Proc. 33rd Annu. ACM Symp.
ods and features used for predicting with our approach. The Appl. Comput. (SAC), Pau, France, Apr. 2018, pp. 339–345.
proposed feature selection approach in this article has shown [13] T. Tandera, D. Suhartono, R. Wongso and Y. L. Prasetio, “Personality
preddiction system from Facebook users,” in Proc. 2nd Int. Conf.
better accuracy using the selected features through the PCC Comput. Sci. Comput. Intell. (ICCSCI), Bali, Indonesia, Oct. 2017,
algorithm. pp. 604–611.
Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[14] D. Markovikj, S. Gievska, M. Kosinski, and D. Stillwell, “Min- [38] Z. Gao, Y. Xu, F. Meng, F. Qi, and Z. Lin, “Improved information gain-
ing Facebook data for predictive personality modeling,” Comput. based feature selection for text categorization,” in Proc. 4th Int. Conf.
Pers. Recognit., AAAI, Tech. Rep., Menlo Park, CA, USA, 2013, Wireless Commun., Veh. Technol., Inf. Theory Aerosp. Electron. Syst.
pp. 23–26. (VITAE), May 2014, pp. 11–14.
[15] V. Kaushal and M. Patwardhan, “Emerging trends in personality identi- [39] L. Yu and H. Liu, “Feature selection for high-dimensional data: A fast
fication using online social networks—A literature survey,” ACM Trans. correlation-based filter solution,” Proc. 20th Int. Conf. Mach. Learn.,
Knowl. Discov. DataT, vol. 12, no. 2, pp. 1–30, Jan. 2018. 2003, pp. 856–863.
[16] J. W. Pennebaker, M. E. Francis, and R. J. Booth. (2001). Linguistic [40] E. Sarhrouni, A. Hammouch, and D. Aboutajdine, “Application of sym-
Inquiry and Word Count: LIWC2001. Erlbaum, Mahwah, NJ, USA. metric uncertainty and mutual information to dimensionality reduction of
[Online]. Available: [Link] and classification hyperspectral images,” Int. J. Eng. Technolofy, vol. 4,
[17] M. Coltheart, “The MRC psycholinguistic database,” Quart. J. Exp. no. 5, pp. 268–276, 2012.
Psychol. A, vol. 33, no. 4, pp. 497–505, Nov. 1981. [41] S. Imranali and W. Shahzad, “A feature subset selection method based
[18] K. Moffitt, J. Giboney, E. Ehrhardt, J. Burgoon, and J. Nuna- on symmetric uncertainty and ant colony optimization,” Int. J. Control
maker. (2010). Structured Programming for Linguistic CUE Extraction. Automat., vol. 60, no. 11, pp. 5–10, Jul. 2017.
[Online]. Available: [Link]
[42] K. Pearson, On the Criterion That a Given System of Deviations From
[19] Kucera and W. N. Francis, Computational Analysis of Present-day Amer- the Probable in the Case of a Correlated System of Variables is Such
ican English. Providence, RI, USA: Brown Univ. Press, Providence, That It Can Be Reasonably Supposed to Have Arisen From Random
1967. Sampling. London, U.K.: Philosophical Magazine, vol. 5, no. 50. 1900,
[20] G. D. A. Brown, “A frequency count of 190,000 words in the London- pp. 157–175, doi: 10.1080/14786440009463897.
Lund Corpus of English conversation,” Behav. Res. Methods Instrum.
[43] M. S. Nikulin, “Chi-squared test for normality,” in Proc. Int. Vilnius
Comput., vol. 16, no. 6, pp. 502–532, 1984.
Conf. Probab. Theory Math. Statist., vol. 2, 1973, pp. 119–122.
[21] L. R. Goldberg, “The development of markers for the big-five fac-
tor structure,” Psychol. Assessment, vol. 4, no. 1, pp. 26–42, 1992, [44] B. Jacob, J. Chen, Y. Huang, and I. Cohen, “Pearson correlation
doi: 10.1037/1040-3590.4.1.26. coefficient,” in Noise Reduction Speech Processing. Berlin, Germany:
[22] I. B. Myers, M. H. McCaulley, N. L. Quenk, and A. L. Hammer, MBTI Springer-Verlag, 2009, pp. 1–4.
Manual: A Guide to the Development and Use of the Myers–Briggs Type [45] E. Loper and S. Bird, “NLTK: The natural language toolkit,” in Proc.
Indicator, vol. 3, 3rd ed. Palo Alto, CA, USA: Consulting Psychologists ACL Workshop Effective Tools Methodologies Teach. Natural Lang.
Press, 1998. Process. Comput. Linguistics, 2002, pp. 1–8.
[23] W. Marston, Emotions of Normal People. New York, NY, USA: Taylor [46] G. Farnadi, G. Sitaraman, S. Sushmita, F. Celli, M. Kosinski,
& Francis, 1999. D. Stillwell, S. Davalos, M.-F. Moens, and M. De Cock, “Computational
[24] P. T. Costa, Jr., and R. R. McCrae, NEO-PI-R Professional Manual. personality recognition in social media,” User Model. User-Adapted
Odessa, FL, USA: Psychological Assessment Resources, 1992. Interact., vol. 26, nos. 2–3, pp. 109–142, 2016.
[25] L. R. Goldberg, “A broad-bandwidth, public-domain, personal- [47] S. Bandyopadhyay, S. Mallik, and A. Mukhopadhyay, “A survey and
ity inventory measuring the lower-level facets of several five- comparative study of statistical tests for identifying differential expres-
factor models,” in Personality Psychology in Europe, vol. 7, sion from microarray data,” IEEE/ACM Trans. Comput. Biol. Bioinf.,
I. Mervielde, I. Deary, F. De Fruyt, and F. Ostendorf, Eds. Tilburg, vol. 11, no. 1, pp. 95–115, Jan. 2014, doi: 10.1109/tcbb.2013.147.
The Netherlands: Tilburg Univ. Press, 1999, pp. 7–28. [Online]. Avail- [48] S. Mallik and Z. Zhao, “ConGEMs: Condensed gene co-expression
able: [Link] module discovery through rule-based clustering and its applica-
[26] O. P. John and S. Srivastava, “The big five trait taxonomy: History, tion to carcinogenesis,” Genes, vol. 9, no. 1, p. 7, Dec. 2017,
measurement, and theoretical perspectives,” in Handbook of Personality: doi: 10.3390/genes9010007.
Theory and Research, L. A. Pervin and O. P. John, Eds., 2nd ed. [49] S. Mallik, T. Bhadra, and U. Maulik, “Identifying epigenetic biomarkers
New York, NY, USA: Guilford Press, 1999, pp. 102–138. using maximal relevance and minimal redundancy based feature selec-
[27] G. Saucier, “Mini-markers: A brief version of Goldberg’s unipolar tion for multi-omics data,” IEEE Trans. Nanobiosci., vol. 16, no. 1,
big-five markers,” J. Pers. Assessment, vol. 63, no. 3, pp. 506–516, pp. 3–10, Jan. 2017, doi: 10.1109/tnb.2017.2650217.
Dec. 1994. [50] T. Bhadra, S. Mallik, and S. Bandyopadhyay, “Identification of multi-
[28] M. B. Donnellan, F. L. Oswald, B. M. Baird, and R. E. Lucas, “The mini- view gene modules using mutual information-based hypograph mining,”
IPIP scales: Tiny-yet-effective measures of the big five factors of IEEE Trans. Syst., Man, Cybern., Syst., vol. 49, no. 6, pp. 1119–1130,
personality,” Psychol. Assessment, vol. 18, no. 2, pp. 192–203, Jun. 2006. Jun. 2019, doi: 10.1109/tsmc.2017.2726553.
[29] S. D. Gosling, P. J. Rentfrow, and W. B. Swann, “A very brief measure [51] X. Xu, H. Gu, Y. Wang, J. Wang, and P. Qin, “Autoencoder based feature
of the Big-Five personality domains,” J. Res. Pers., vol. 37, no. 6, selection method for classification of anticancer drug response,” Fron-
pp. 504–528, 2003. tiers Genet., vol. 10, p. 233, Jan. 2019, doi: 10.3389/fgene.2019.00233.
[30] J. A. Johnson, “Developing a short form of the IPIP-NEO: A report
[52] S. Mallik and Z. Zhao, “Graph- and rule-based learning algorithms:
to HGW Consulting,” Dept. Psychol., Univ. Pennsylvania, DuBois, PA,
USA, Tech. Rep., 2000. A comprehensive review of their applications for cancer type classi-
fication and prognosis using genomic data,” Briefings Bioinf., to be
[31] M. Kosinski, D. Stillwell, and T. Graepel, “Private traits and attributes
published, doi: 10.1093/bib/bby120.
are predictable from digital records of human behavior,” Proc. Nat. Acad.
Sci. USA, vol. 110, no. 15, pp. 5802–5805, Apr. 2013. [53] S. Mallik and Z. Zhao, “Towards integrated oncogenic marker recog-
[32] W. Youyou, M. Kosinski, and D. Stillwell, “Computer-based personality nition through mutual information-based statistically significant feature
judgments are more accurate than those made by humans,” Proc. Nat. extraction: An association rule mining based study on cancer expression
Acad. Sci. USA, vol. 112, no. 4, pp. 1036–1040, Jan. 2015. and methylation profiles,” Quant. Biol., vol. 5, no. 4, pp. 302–327,
Dec. 2017, doi: 10.1007/s40484-017-0119-0.
[33] M. A. Hall, “Correlation-based feature subset selection for machine
learning,” Ph.D. dissertation, Dept. Comput. Sci., Univ. Waikato, [54] (2019). [Link]. Accessed: Aug. 31, 2019. [Online]. Available:
Hamilton, New Zealand, Apr. 1999. [Link]
[34] M. Hall and L. A. Smith, “Feature subset selection: A correlationbased [55] (2019). [Link]. Accessed: Aug. 31, 2019. [Online].
filter approach,” in Proc. 4th Int. Conf. Neural Inf. Process. Intell. Inf. Available: [Link]
Syst., 1997, pp. 855–858. 10/Brokerage-2C-Boundary-Spanning-2C-and-Leadership-in-Open-
[35] M. Doshi and R. K. Chaturvedi, “Correlation based feature selection [Link].
(CFS) technique to predict student perfromance,” Int. J. Comput. Netw. [56] S. Mallik and U. Maulik, “MiRNA-TF-gene network analysis through
Commun., vol. 6, no. 3, pp. 197–206, May 2014. ranking of biomolecules for multi-informative uterine leiomyoma
[36] A. Abbasi, S. France, Z. Zhang, and H. Chen, “Selecting attributes for dataset,” J. Biomed. Informat., vol. 57, pp. 308–319, Oct. 2015,
sentiment classification using feature relation networks,” IEEE Trans. doi: 10.1016/[Link].2015.08.014.
Knowl. Data Eng., vol. 23, no. 3, pp. 447–462, Mar. 2011. [57] Y. Masoudi-Sobhanzadeh, H. Motieghader, and A. Masoudi-Nejad,
[37] G. Forman, “An extensive empirical study of feature selection metrics “FeatureSelect: A software for feature selection based on machine
for text classification,” J. Mach. Learn. Res., vol. 3, pp. 1289–1305, learning approaches,” BMC Bioinf., vol. 20, no. 1, p. 170, 2019,
Mar. 2003. doi: 10.1186/s12859-019-2754-0.
Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[58] M. Mafarja, I. Aljarah, H. Faris, A. I. Hammouri, A. M. Al-Zoubi, and Md. Kamrul Hasan received the [Link]. degree in
S. Mirjalili, “Binary grasshopper optimisation algorithm approaches for computer science and information technology (CIT)
feature selection problems,” Expert Syst. Appl., vol. 117, pp. 267–286, from the Islamic University of Technology (IUT),
Mar. 2019, doi: 10.1016/[Link].2018.09.015. Gazipur, Bangladesh, and the Ph.D. degree from
[59] V. Chahkandi, M. Yaghoobi, and G. Veisi, “Feature selection Kyung Hee University, Seoul, South Korea.
with chaotic hybrid artificial bee colony algorithm based on fuzzy He has long experience in software as a developer
(CHABCF),” J. Soft Comput. Appl., vol. 2013, pp. 1–8, Jun. 2013, and a consultant. He is currently a Professor with the
doi: 10.5899/2013/jsca-00014. Department of Computer Science and Engineering
[60] S. Arora and P. Anand, “Binary butterfly optimization approaches for (CSE), IUT, where he has been serving for ten years
feature selection,” Expert Syst. Appl., vol. 116, pp. 147–160, Feb. 2019, and is also the Founding Director of the Systems and
doi: 10.1016/[Link].2018.08.051. Software Lab (SSL). His current research interests
[61] G. Farnadi, S. Zoghbi, M. Moens, and M. De Cock, “Recognising are in intelligent systems and AI, software engineering, cloud computing, data
personality traits using Facebook status updates,” in Proc. WCPR, 2013, mining applications, and social networking.
pp. 14–18.
[62] K.-J. Kim and S.-B. Cho, “Ensemble classifiers based on correlation
analysis for DNA microarray classification,” Neurocomputing, vol. 70,
nos. 1–3, pp. 187–199, Dec. 2006.
[63] B. Auffarth, M. Lopez-Sanchez, and J. Cerquides, “Comparison of
redundancy and relevance measures for feature selection in tissue
classification of CT images,” Advances in Data Mining. Applications
and Theoretical Aspects, P. Perner, Ed. Berlin, Germany: Springer, 2010,
pp. 248–262.
[64] M. M. Mukaka, “Statistics corner: A guide to appropriate use of
correlation coefficient in medical research,” Malawi Med. J., vol. 24,
no. 3, pp. 69–71, Sep. 2012.
[65] W. Duch, P. Matykiewicz, and J. Pestian, “Neurolinguistic approach to
natural language processing with applications to medical text analysis,”
Neural Netw., vol. 21, no. 10, pp. 1500–1510, Dec. 2008.
[66] I. Solti, C. R. Cooke, F. Xia, and M. M. Wurfel, “Automated classifica-
tion of radiology reports for acute lung injury: Comparison of keyword
and machine learning based natural language processing approaches,” in
Proc. IEEE Int. Conf. Bioinformatics Biomed. Workshop, Washington,
DC, USA, Nov. 2009, pp. 1–4.
[67] L. Antiqueira, M. Nunes, O. Oliveira, Jr., and L. D. F. Costa, “Strong cor-
relations between text quality and complex networks features,” Phys. A,
Stat. Mech. Appl., vol. 373, pp. 811–820, Jan. 2007.
Hasan Mahmud received the bachelor’s degree in
[68] M. Chong, L. Specia, and R. Mitkov, “Using natural language processing
computer science and information technology (CIT)
for automatic detection of plagiarism,” in Proc. 4th Int. Plagiarism Conf.
from the Islamic University of Technology (IUT),
Tyne, U.K.: Northumbria Univ. 2010, pp. 1–12.
Gazipur, Bangladesh, in 2004, and the [Link]. degree
in computer science from the University of Trento
Ahmed Al Marouf received the bachelor’s degree (UniTN), Trento, Italy in 2009. He is currently
from the Department of Computer Science and pursuing the Ph.D. degree in computer science and
Engineering (CSE), Islamic University of Technol- engineering (CSE) with IUT, under the guidance of
ogy (IUT), Gazipur, Bangladesh, in 2014, and the Dr. M. A. Mottalib and Dr. K. Hasan.
[Link]. degree in CSE from IUT in 2019. He joined the CSE Department, Stamford Univer-
He was a Graduate Researcher with the Systems sity Bangladesh, Dhaka, Bangladesh, as a Faculty
and Software Lab (SSL), CSE Department, IUT. Member. Since 2009, he has been an Assistant Professor with the Department
He is currently a Lecturer with the Department of of CSE, IUT, where he is also the Co-Founder of the Systems and Software
Computer Science and Engineering (CSE), Daffodil Lab (SSL). He has different research articles published in several international
International University (DIU), Dhaka, Bangladesh, journals and conferences. His research interest focuses on human–computer
where he is also the Technical Lead of the Human interaction, gesture-based interaction, and machine learning.
Computer Interaction (HCI) Research Lab. His research interest lies within Mr. Mahmud received the University Guild Grant Scholarship for two years
computational social science, data science, and machine learning. (2007–2009) for his master’s study and the Early Degree Scholarship.
Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at [Link] UTC from IEEE Xplore. Restrictions apply.