Personality Classification From Online Text
Personality Classification From Online Text
Abstract—Personality refer to the distinctive set of Myers- Briggs Type Indicator (MBTI) [4], and DiSC
characteristics of a person that effect their habits, behaviour’s, Assessment [5].
attitude and pattern of thoughts. Text available on Social
Networking sites provide an opportunity to recognize individual’s The existing works on personality recognition from social
personality traits automatically. In this proposed work, Machine media text is based on supervised machine learning techniques
Learning Technique, XGBoost classifier is used to predict four applied on benchmarks dataset [6], [7], [8]. However, the
personality traits based on Myers- Briggs Type Indicator (MBTI) major issue associated with the aforementioned studies is the
model, namely Introversion-Extroversion(I-E), iNtuition- skewness of the datasets, i.e. presence of imbalanced classes
Sensing(N-S), Feeling-Thinking(F-T) and Judging-Perceiving(J-P) with respect to different personality traits. This issue mainly
from input text. Publically available benchmark dataset from contributes to the performance degradation of personality
Kaggle is used in experiments. The skewness of the dataset is the recognition system.
main issue associated with the prior work, which is minimized by
applying Re-sampling technique namely random over-sampling, To address the aforementioned issue different techniques
resulting in better performance. For more exploration of the are available for minimizing the skewness of the dataset, like
personality from text, pre-processing techniques including Over-sampling, Under-sampling and hybrid-sampling [9].
tokenization, word stemming, stop words elimination and feature Such techniques, when applied on the imbalanced datasets in
selection using TF IDF are also exploited. This work provides the different domain, have shown promising performance in terms
basis for developing a personality identification system which of improved accuracy, recall, precision, and F1-score [10].
could assist organization for recruiting and selecting appropriate
personnel and to improve their business by knowing the In this work, a machine learning technique, namely,
personality and preferences of their customers. The results XGBoost is applied on the benchmark personality recognition
obtained by all classifiers across all personality traits is good dataset to classify the text into different personality traits such
enough, however, the performance of XGBoost classifier is as Introversion-Extroversion(I-E), iNtuition-Sensing(N-S),
outstanding by achieving more than 99% precision and accuracy Feeling-Thinking(F-T) and Judging-Perceiving(J-P).
for different traits. Furthermore, to improve the performance of the system,
resampling technique [11] is also utilized for minimizing the
Keywords—Personality recognition; re-sampling; machine skewness of the dataset.
learning; XGBoost; class imbalanced; MBTI; social networks
A. Problem Statement
I. INTRODUCTION Predicting personality from online text is a growing trend
Personality of a person encircles every aspect of life. It for researchers. Sufficient work has already been carried out on
describes the pattern of thinking, feeling and characteristics predicting personality from the input text [6, 7, 8].
that predict and describe an individual’s behaviour and also However, more work is required to be carried out for the
influences daily life activities including emotions, preference, performance improvement of the existing personality
motives and health [1]. recognition system, which in most of the cases arises due to
The increasing use of Social Networking Sites, such as presence of imbalanced classes of personality traits. In the
Twitter and Facebook have propelled the online community to proposed work. A dataset balancing technique, called re-
share ideas, sentiments, opinions, and emotions with each sampling is used for balancing the personality recognition
other; reflecting their attitude, behaviour and personality. dataset, which may result in improved performance.
Obviously, a solid connection exists between individual’s B. Research Questions
temperament and the behaviour they show on social networks
in the form of comments or tweets [2]. RQ.1: How to apply supervised machine
learning technique, namely XGBoost classifier for classifying
Nowadays personality recognition from social networking personality traits from the input text?
sites has attracted the attention of researchers for developing
automatic personality recognition systems. The core RQ.2: How to apply a class balancing technique on the
philosophy of such applications is based on the different imbalanced classes of personality traits for performance
personality models, like Big Five Factor Personality Model [3], improvement and what is the efficiency of the proposed
technique w.r.t other machine learning techniques?
*Corresponding Author
460 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020
RQ.3: What is the efficiency of the proposed technique Fig. 1 depicts the classification sketch of the literature
with respect to other baseline methods? review on personality recognition from text.
C. Aims and Objective A. Supervised Learning Technique
1) Aim: The aim of this work is to classify the personality These supervised learning algorithms are comprised of
traits of a user from the input text by applying supervised unlabeled data/ variables which is to be determined from
machine learning technique namely XGBoost classifier on the labelled data, also called independent variables. The studies
benchmark dataset of MBTI personality. This work is the given below are based on supervised learning methodologies.
enhancement of the prior work performed by [6]. A system is proposed by [6] for analysing social media
2) Objectives posts/ tweets of a person and produce personality profile
a) Applying machine learning technique namely accordingly. The work mainly emphasizes on data collection,
XGBoost classifier for personality traits recognition from the pre-processing methods and machine learning algorithm for
input text. prediction. The feature vectors are constructed using different
feature selection techniques such as Emolex, LIWC and
b) Applying re-sampling technique on the imbalanced TF/IDF, etc. The obtained feature vectors are used during
classes of personality traits for improving the performance of training and testing of different kinds of machine learning
proposed system. algorithms, like Neural Net, Naïve Bayes and SVM. However,
c) Evaluating the performance of proposed model with SVM with all feature vectors achieved best accuracy across all
respect to other machine learning techniques and base line dimensions of Myers-Briggs Type Indicator (MBTI) types.
methods. Further enhancement can be made by incorporating more state
of the art techniques.
D. Significance of Study
Personality is distinctive way of thinking, behaving and MBTI dataset, introduced in [7] for personality prediction,
feeling. Personality plays a key role in someone’s orientation in which is derived from Reddit social media network. A rich set
various things like books, social media sites, music and movies of features are extracted, and benchmark models are evaluated
[12]. for personality prediction. The classification is performed using
SVM, Logistic Regression, and (MLP). The classifier using all
The proposed work on personality recognition is an linguistic features together outperformed across all MBTI
enhancement of the work performed by [6]. Proposed work is dimensions. However, further experimentation is required on
significant due to the following reasons: (i) performance of the more models for achieving more robust results. The major
existing study is not efficient due to skewness, which will be limitation is that the number of words in the posts are very
addressed in this proposed work by applying re-sampling large, which sometimes don’t predict the personality
technique on the imbalanced dataset, (ii) proposed work also accurately.
provide a basis for developing state of the art applications for
personality recognition, which could assist organization for To predict personality from tweets, [8] proposed a model
recruiting and selecting appropriate personnel and to improve using 1.2 Million tweets, which are annotated with MBTI type
their business by taking into account the personality and for personality and gender prediction. Logistic regression
preferences of their customers. model is used to predict four dimensions of MBTI. Binary
word n-gram is used as a feature selection. This work showed
II. RELATED WORK improvement in I-E and T-F dimensions but no improvements
in S-N and even slightly drop for P-J. In terms of personality
A review of literature pertaining to personality recognition
prediction, linguistic features produce far better results.
from text is presented here in this section. The literature studies
Incorporating enhanced dataset may improve performance.
of this work is categorized into four sub groups, namely,
i) Supervised learning techniques, ii) Un-supervised machine A system was developed to recognize user personality
learning techniques, iii) Semi-supervised machine learning using Big Five Factor personality model from tweets posted in
techniques and, iv) Deep learning techniques. English and Indonesian language [13]. Different classifiers are
applied on the MyPersonality dataset. The accuracy achieved
by Naive Bayes(NB) is 60%, which is better than the accuracy
of KNN (58%) and SVM (59%).Although this work did not
improve the accuracy of previous research (61%) yet achieved
the goal of predicting the personality from Twitter-based
messages. Using extended dataset and implementing semantic
approach, may improve the results.
Personality assessment/ classification system based on Big5
Model was proposed for Bahasa Indonesian tweets [14].
Assessment is made on user’s words choice. The machine
learning classifiers, namely, SVM and XGBoost, are
implemented on different parameters like existence of (n_gram
minimum and n_gram weighted), removal of stop words and
Fig. 1. Categorization Sketch of Literature Review. using LDA. XGBoost performed better than the SVM under
461 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020
the same data and same parameter setting. Limited dataset of Unavailability of datasets, lack of identification of features in
only 359 instances for training and testing is the main certain languages, and difficulty in identifying the requisite
drawback of their work. pre-processing methods, are the issues to be tackled. These
issues can be addressed by developing methods for non-
Automatic identification of Big Five Factor Personality English language, introducing more accurate machine learning
Model was proposed by [15] using individual status text from algorithms, implementing other personality models, and
Facebook. Various techniques like Multinomial NB, Logestic including more feature selection for pre-processing of data.
Regression (LR) and SMO for SVM are used for personality
classification. However, MNB outperformed other methods. Twitter user’s profiles are used for accurate classification
Incorporating feature selection and more classifiers, may of their personality traits using Big5 model [20]. Total 50
enhance the performance. subjects with 2000 tweets per user are assessed for prediction.
Users content are analysed using two psycholinguistic tools,
Personality profiling based on different social networks namely LIWC and MRC. The performance evaluation is
such as Twitter, Instagram and Foursquare performed by [16]. carried out using two regression models, namely ZeroR and
Multisource large dataset, namely NUS-MSS, is utilized for GP. Results for “openness” and “agreeableness” traits are
three different geographical regions. The data is evaluated for similar as that of previous work, but less efficient results are
an average accuracy using different machine learning shown for other traits. Extended dataset may improve the
classifiers. When the different data sources are concatenated in results.
one feature vector, the classification performance is improved
by more than 17%. Available dataset may be enriched from A connection has been established between the users of
multi (SNS) by user’s cross posting for better performance. Twitter and their personality traits based on Big5 model [21].
Due to inaccessibility of original tweets, user’s personality is
The performance of different ML classifiers are analysed to predicted on three parameters that are publicly available in
assess the student’s personality based on their Twitter profiles their profiles, namely (i) followers, (ii), following, and
by considering only Extraversion trait of Big 5 [17]. Different (iii) listed count. Regression analysis is performed using M5
machine learning algorithms like Naïve Bayes, Simple logistic,
rules with 10-fold cross validation. RMSE of predicted values
SMO, JRip, OneR, ZeroR, J48, Random Forest, Random Tree, against observed values is also measured. Results show that
and AdaBoostM1, are applied in WEKA platform. The based on three counts, user’s personality can be predicted
efficiency of the classifiers is evaluated in terms of correctly accurately.
classified instances, time taken, and F-Measures, etc. OneR
algorithm of rules classifier show best performance among all, TwiSTy, a novel corpus of tweets for gender and
producing 84% classification accuracy. In future, all personality prediction has been presented by [22] using MBTI
dimensions of Big5 can be considered for evaluation to get type Indicator. It covers six languages, namely Dutch, German,
more insight. French, Italian, Portuguese and Spanish. Linear SVM is used
as classifier and results are also tested on Logistic Regression.
The performance of different classifier is evaluated by [18] Binary features for character and word (n-gram) are utilized. It
using MBTI model to predict user’s personality from the online outperformed for gender prediction. For personality prediction,
text. Various ML classifiers, namely Naïve Bayes, SVM, LR it outperformed other techniques for two dimensions: I-E and
and Random Forest, are used for estimation. Logistic T-F, but for S-N and J-P, this model did not show
Regression received a 66.5% accuracy for all MBTI types, improvement. In future, the model can be trained enough to
which is further improved by parameter tuning. Results may predict all four dimensions of MBTI efficiently.
further be improved by using XGBoost algorithm, which
remained winner of most Kaggle and other data science The Table I represents the summaries of above cited studies
competitions. for classification and prediction of user’s personality using
Supervised Machine Learning strategies.
The oversampling and undersampling techniques are
compared by [11] for imbalance dataset. Classification perform B. Unsupervised Learning Approach
poorly when applied on imbalanced classes of dataset. There Unsupervised learning classifiers are using only unlabeled
are three approaches (data level, algorithmic level and hybrid) training data (Dependent Variables) without any equivalent
that are widely used for solving class imbalance problem. Data output variables to be predicted or estimated.
level method is experimented in this study and result of Over-
sampling method (SMOTE) is better than under-sampling The Twitter data was annotated by [23] for 12 different
technique (RUS). More re-sampling techniques need to be linguistic features and established a correlation between user’s
evaluated in future. personality and writing style with different cross-region users
and different devices. Users with more than one tweets are
Authors in [19] briefly discussed and explained the early considered for evaluation. It was observed that Twitter users
research for the classification of personality from text, carried are secure, unbiased and introvert as compared to the users
out on various social networking sites, such as Twitter, posting from iPhone, blackberry, ubersocial and Facebook
Blogger, Facebook and YouTube on the available datasets. The platforms. More Twitter data for classification may enhance
methods, features, tools and results are also evaluated. the efficiency of personality identification model.
462 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020
TABLE I. PERSONALITY RECOGNITION BASED WORK USING SUPERVISED MACHINE LEARNING APPROACH
SNo Research Goals and objectives Strategy/ Approach Performance Limitation and Future Work
Less weightage is given to the
SVM, Neural Net and SVM with all feature vectors word’s gravity.
Bharadwaj et al. Personality prediction from Naïve Bayes achieved best accuracy Incorporating more state-of-the-art
1
(2018) [6] online text TF-IDF, Emolex, LIWC across all dimensions of techniques in future will yield
and ConceptNet MBTI better result.
463 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020
RMSE:
To establish a connection O = 0.69 In future user personality
between the users of Regression using M5 rules C = 0.76
Quercia et al. (2011) classification may be utilized in
13 Twitter and their with 10-fold cross
[21] E = 0.88 marketing and recommender
personality traits based on validation.
A = 0.79 system.
Big5 model.
N = 0.85
Ƒ_score
To predict gender and I/E =77.78
SVM and logistic S/N =79.21 In future, the model can be trained
Verhoeven et al. personality from a novel
14 Regression along words enough to predict all four
(2016) [22] corpus of tweets, namely T/F = 52.13
n_grams features. dimensions of MBTI efficiently.
TwiSTy. J/P = 47.01
For italic lang:
more than secure ones and tend to develop longer chain of
The purpose of the study carried out by [24], is to scrutinize interaction.
the group-based personality identification by utilizing
unsupervised trait learning methodology. Adawalk technique is An Unsupervised Machine learning methodology, namely,
utilized in this survey. The outcomes portray that while Ḳ-Meańs was accomplished by [26] to recognize the network
considering Micro- Ƒ1 score, the achievement of adawalk is visitors’ trait and personality. This proposed work is based on
exceptional with somewhat 7% for ԝiki, 3% for Ƈora, and 8% the quantifiable contents of the website. The obtained results
for BlogCaṯlog. While utilizing SoCE personality corpus, portray that this strategy can be utilized to predict website and
97.74% Macro-Ƒ1 score was achieved by this approach. The network visitors’ personality traits, more accurately. Proposed
drawback of this work is that it entirely depends on TƑ -IDƑ system may be enhanced in future by adding more elements
strategy, additionally the created content systems are not an associated with websites and a greater number of websites for
impersonation of genuine social and interpersonal network like the better performance.
retweeting systems. Large and increased dataset will definitely
enhance the performance of the proposed work in future. Author in [27] proposed a personality identification system
using unsupervised approach based on Big-5 personality
An unsupervised personality classification strategy was model. Different social media network sites are used for
accomplished by [25] to highlight the matter that to how extent extraction and classification of user’s traits. Linguistic features
different personalities collaborate and behave on social media are exploited to build personality model. The system predict
site Twitter. Linguistic and statistical characteristics are personality for an input text and achieved reasonable results.
utilized by this work and then tested on data corpus elucidated However, extended annotated corpus can boost the system’s
with personality model using human judgment. System performance.
investigation anticipate that psychoneurotic users comments
TABLE II. PERSONALITY RECOGNITION BASED WORK USING UN-SUPERVISED MACHINE LEARNING APPROACH
SNo Research Goals and objectives Strategy/ Approach Outcome Limitation and Future Work
Additional Tweets for
Personality classification Mean Accuracy
Un_supervised personality recognition may
1 Celli (2011) [23] from individual’s writing =0.6651 and Mean
Score-based improve the accuracy of this
pattern validity= 0.6994
proposed model.
Large and increased dataset
Sun et al. (2019) group-based personality Un_supervised will definitely enhance the
2 97.74% (Macɍo-Ƒ1)
[24] identification Adȧwalk performance of the proposed
work in future
Impact of linguistic
Celli and Rossi, Un-supervised More tweets are needed for
3 characteristics on 78.29% (Accurȧcy)
(2012) [25] Statistics-based efficient investigation
personality traits.
System may be enhanced in
Chishti and To recognize the network future by adding more elements
Uń-supervised Ḳ=10 is accurate
4 Sarrafzadeh (2015) visitors’ trait and associated with websites and a
Ḳ-Mean score
[26] personality greater number of websites for
better performance
Impact of linguistic
characteristics on
Un_supervised Extended annotated corpus can
5 Celli (2012) [27] personality traits using Big 81.43% (Accuracy)
Score-based boost the system’s performance
Five Model
Developing personality
Findings of this method are
model to predict
Arnoux et al. based on English Twitter data,
6 individual’s Big Five Word-Embedding 68.5% (Accuracy)
(2017) [28] which may be extended to
personality traits on much
other languages
fewer data using twitter.
464 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020
A model was proposed by [28] that requires eight times Detail of the above quoted studies regarding personality
fewer data to predict individual’s Big Five personality traits. classification using Semi-supervised Machine Learning
GloVe Model is used as Word embedding to extract the words Approach are presented in Table III.
from user tweets. Firstly, the model is trained and then tested
on given tweets. Further, the data is tested on three other TABLE III. PERSONALITY RECOGNITION BASED WORK USING SEMI-
SUPERVISĖD MACHINE LEARNING APPROACH
combinations: (i) GloVe with RR, (ii) LIWC with GP, and
(iii) 3-Gram with GP, and the proposed model performed better
Limitation
with an average correlation of 0.33 over the Big-5 traits, which SN Researc Goals and
Strategy/
Performanc and
is far better than the baseline method. Findings of this method Approac
o h objectives e Future
h
are based on English Twitter data, which may be extended to Work
other languages. Similarly, the performance of the model can Accuracy
be examined with small number of tweets. may be
improved
The Table II illustrates the concise detail of above cited by using
studies regarding user’s personality and traits identification Multilingua different
from textual data using un-supervised machine learning l predictive ›SGD
personality
model is classifier
approach. used to with n-
model.
Arroju et identify gram Similarly,
C. Semi-Supervised Learning Approach author
al. user’s features. Accuracy =
The studies carried out by using the combination of 1 profiling
(2015) personality › LIWC 68.5%
can be
linguistic and lexicon features, supervised machine learning [29] traits, age with
further
methodologies and different feature selection algorithms are and gender, regressor
based on enhanced
known as semi-supervised ML approaches. The following model
their by
(ERCC)
studies have utilized the semi-supervised and hybrid strategy. tweets. performing
experiment
Multilingual predictive model was proposed by [29], which s in
identified user’s personality traits, age and gender, based on multiple
their tweets. SGD classifier with n-gram features, is used for languages.
age and gender classification, while LIWC with regressor Lower
model (ERCC) is used for personality prediction. An average accuracy is
accuracy of 68.5% has been achieved for recognition of user’s To due to
attributes in four different languages. However, author recognize limited
›Machine corpus in
profiling can be enhanced by performing experiments in MBTI type Learning,
personality I/E trait = Bhasha
multiple languages. Lukito et ›Lexicon- Indonesia.
traits from 80% S/N,
al. based,
A technique was devised to detect MBTI type personality 2 social T/F and J/P By
(2016) and
traits from social media (Twitter) in Bahasa Indonesian media accuracy is increasing
[30]
(Twitter) in ›linguistic 60% the training
language [30]. Among 142 respondents, 97 users are selected Bahasa Rules data set,
with an average 2500 tweets per user. WEKA is used for Indonesian driven accuracy
building classification and training set. Three approaches are language. may get
used for prediction from training set. i) Machine Learning, improved.
ii) Lexicon-based, and iii) linguistic Rules driven. Among all,
Naïve Bayes outperformed the comparing methods in terms of
better accuracy and time. Its accuracy for I/E trait is 80% while Using only
for S/N, T/F and J/P, its accuracy is 60%. Lower accuracy on word count
the part of linguistic rule-driven and lexicon-based, are due to for
Accuracy for prediction
limited corpus in Bhasha Indonesia. It is observed that by “openness” is the main
increasing the training data set, accuracy may get improved. trait of Big5 drawback
Personality
is higher, of the
A technique proposed for personality prediction from social Alsadhan prediction
while for proposed
media-based text using word count [31]. It works for both and from social Based on
MBTI, system,
MBTI and Big5 personality models using 8 different 3 Skillicor media- word
accuracy for which may
n (2017) based text count
languages. Four kinds of labelled corpus both for Big5 and [31] using word
S/N be covered
BMTI are used for conducting the experiments. In each corpus, dimension is by
count
greater than introducing
1000 most frequently used words are selected. Prediction all other different
accuracy for “openness” trait of Big5 is higher across all dichotomies. features
corpus, while for MBTI, prediction accuracy for S/N selection
dimension is greater than other dichotomies. Using only word and ML
count for prediction is the main drawback of the proposed algorithms.
system, which may be covered by introducing different
features selection and ML algorithms.
465 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020
D. Deep Learning Strategy TABLE IV. PERSONALITY RECOGNITION BASED WORK USING DEEP
LEARNING APPROACH
Deep learning is a subcategory of machine learning
(ML) in artificial intelligence (AI), where machines may Strategy/ Limitation
acquire knowledge and get experience by training without SN Goals and
Research Approac Outcome and Future
o objectives
user’s interaction to make decisions. Based on experiences and h Work
learning from unlabeled and unstructured corpus, deep learning
performs tasks repeatedly and get improvement and tweaking The
in results after each iteration. The studies given below are in predictive
To predict Deep efficiency
summarized form, showing the prior work performed in Deep Hernande
and classify Learning Accuracy
of this work
learning. people into I/E= 67.6%
z and
their MBTI ›RNN may be
1 Scott S/N=62.0% improved
A deep learning classifier was developed, which takes types using ›LSTM
(2017) T/F=77.8% by
text/tweet as input and predict MBTI type of the author using their online ›GRU
[32] increasing
textual J/P=63.7%
MBTI dataset [32]. After applying different pre-processing contents. ›BiLSTM the number
techniques embedding layer is used, where all lemmatized of posts per
words are mapped to form a dictionary. Different RNN layers user.
are investigated, but LSTM performed better than GRU and In future
simple RNN. While classifying user, its accuracy is 0.028 (.676 more deep
× .62 × .778 × .637), which is not good. The predictive Over all learning
A model accuracy= techniques
efficiency of this work may be improved by increasing the
that takes 38% with more
number of posts per user. As the model is tested on real life snippet of word
I/E=
example of Donald trump’s 30,000 tweets, which correctly post or text
Deep
89.51% embedding
Cui, and Learning
predict his actual MBTI type personality. as input and features
2 Qi (2018) Multi- S/N=89.84 may be
classify it
A model proposed by [33] that takes snippet of post or text [33] layer % exploited.
into
as input and classify it into different personality traits, such as different LSTM T/F=69.09 Using of
(INFP, ENTP, and ISJF, etc.). Different classification methods personality % unsupervise
like Softmax as baseline, SVM, Naïve Bayes, and deep traits. J/P=69.37 d technique
learning, are implemented for performance evaluation. SVM % will also
give better
outperformed NB and softmax with 34% train 33% test
results.
accuracy, while Deep learning model shows more
improvement with 40% train and 38% test accuracy. However, In future
the accuracy is still low as it doesn’t even achieve 50 percent. MAE these deep
and
To OPN=
Personality classification system is proposed by [34], to complex
recognize 0.3577
recognize the traits from online text using deep learning semantic
the Deep CON= features will
methodology. AttRCNN model was suggested for this study Xue et al.
personality Learning 0.4251 be used as
utilizing hierarchical approach, which is capable of learning traits from using
3 (2018) EXT= input of
complex and hidden semantic characteristics of user’s textual online text AttRCN
[34] 0.4776 regression
using deep N
contents. Results produced are very effective, proving that learning Approach AGR=
classifiers
using deep and complex semantic features are far better than for more
methodolog 0.3864
improveme
the baseline features. y. NEU= nt in the
0.4273 performanc
A deep learning model was suggested by [1] to classify
e.
personality traits using Big Five personality model based on
essay dataset. Convolutional Neural Network (CNN) is used Accuracy In future
for this work to detect personality traits from input essay. more
OPN=
Different pre- processing techniques like word n-grams, To classify 62.68%
features
sentence, word and document level filtration and extracting need to be
personality CON=
different features are performed for personality traits incorporate
Majumde traits using Deep 56.73% d and
classification. “OPN” traits achieved higher accuracy of 4
r et al. Big Five Learning LSTM
(2017) personality EXT=
62.68% by using different configuration of features and among ›CNN 58.09% recurrent
[1] model based
all five traits. In future, more features need to be incorporated on essay AGR=
network
and LSTM recurrent network may be applied for better results. may be
dataset. 56.71%
applied for
Table IV represents the outline of the works regarding NEU= better
automatic personality recognition system using Deep learning 59.38% results.
methodology.
466 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020
467 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020
TABLE VII. SAMPLE TWEETS FROM TEST DATASET Algorithm 1. Dividing the Data set in Train and Test sets.
Personality
Tweets #Division of Data in training and testing sets:
Type Type Assign [] to Ӿ↔₮rȧin
Patience is a virtue. So proud that you guys are still Assign [] to Ῡ↔₮rȧin
ENFP
together. Assign [] to Ӿ↔₮est
ISFJ We are always willing to help those in need Assign [] to Ῡ↔₮est
Allocate ₮est→ Ṩize to 20% of ṉ
I'm scared of failure, but also throwing up...take that for what Assign RNDM (0, ṉ -1, ₮est→Ṩize) to ₮INDICES
ENTJ
you will. For Ị = 0 ṯo ṉ-1
INFP That would be the best description for what I usually am. Assign [] to ₮emp
Ƒor each ꝠỌRD in ₮f-Idf [i]
ENFJ You're right. Not sure why I didn't think of that before hahah Append (If-Idf [i][WORD]) to temp
ESTP I have 0 friends. I don't trust anybody. END FOR
If Ị in ₮INDICES then
At the point when the dataset is divided into training data, Ӑppend (ŤEMṖ) to Ӿ↔₮est
Ӑppend (tweet [i][ Ị]) to Ῡ↔₮est
validating data and testing data, it utilizes just a portion of Ēlse
dataset and it is clear that training on minor data instances the Ӑppend ŤEMṖ) to Ӿ↔₮rȧin
model won't behave better and overrate the testing error rate of Ӑppend ŤEMṖ) to Ῡ↔₮rȧin
algorithm to set on the whole dataset. EƝƉ ỊƑ
EƝƉ ƑOR
To address this problem a cross-validation technique will
be used. B. Preprocessing and Feature Selection
4) Cross-validation: It is a statistical methodology that Different pre-processing techniques and various feature
selection are exploited, for more exploration of the personality
perform splitting of data into subgroups, training on the subset
from text. These techniques include tokenization, removal of
of data and utilize the other subset of data to assess the URLs, User mentions and Hash tag, word stemming, stop
model's authentication. words elimination and feature selection using TF IDF [28] and
Cross validation comprises of the following steps: [32].
Split the dataset into two subsets. 1) Preprocessing: The following preprocessing steps on
mbti_kaggle dataset are applied before classification, acquired
Reserve one subset data. from the [37] work.
Train the model on the other subset of data. a) Tokenization: Tokenization is the procedure where
words are divided into the small fractions of text. For this
Using the reserve subset of data for validation (test) reason, Python-based NLTK tokenizer is utilized.
purpose, if the model exhibits better on validation set,
it shows the effectiveness of the proposed model. b) Dropping Stop Word: Stop words don't reveal any
idea or information. A python code is executed to take out
Cross validation is utilized for the algorithm’s predictive these words utilizing a pre-defined words inventory. For
performance estimation. instance, "the", "is", "an" and so on are called stop words.
a) K fold cross validation: This strategy includes c) Word stemming: It is a text normalization technique.
haphazardly partitioning the data into k subsets of almost even Word stemming is used to reduce the inflection in words to
size. The initial fold is reserved for testing and all the their root form. Stem words are produced by eliminating the
remaining k-1 subsets of data are used for training the model. pre-fix or suffix used with root words.
This process is continued until each Cross-validation fold (of 2) Feature Selection: The following feature selection
k iteration) have been used as the testing set. steps are accomplished using different machine learning
This procedure is repeated kth times; therefore, the Mean classifiers.
Square Error also obtained k times (from Mean Square Error-1 a) CountVectorizer: Using machine learning algorithms,
to kth Mean Square Error). So, k-fold Cross Validation error is it cannot execute text or document directly, rather it may firs
calculated by taking mean of the Mean Square Error over be converted into matrix of numbers. This conversion of text
Kfolds. Fig. 3, explain the working procedure of K-Fold cross
document into numbers vector is called tokens.
validation.
The count vector is a well-known encoding technique to
make word vector for a given document. CountVectorizer takes
what's known as the Bag of Words approach. Each message or
document is divided into tokens and the number of times every
token happens in a message is counted.
CountVectorizer perform the following tasks:
It tokenizes the whole text document.
Fig. 3. K-Fold Cross Validation Working Procedure. It constitutes a dictionary of defined words.
468 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020
It encodes the new document using known word C. Text-based Personality Classification Using MBTI Model
vocabulary. In this proposed work, supervised learning approach is used
b) Term Frequency: It represents the weight of a word for personality prediction. The model will take snippet of post
that how much a word or term occurs in a document. or text as an input and will predict and produce personality trait
(I-E, N-S, T-F, J-P) according to the scanned text. Mayers-
c) Inverse document Frequency: It is also a weighting Briggs Type Indicator is used for classification and prediction
scheme that describe the common word representation in the [4]. This model categorize an individual into 16 different
whole document. personality types based on four dimensions, namely,
d) Term Frequency Inverse Document Frequency: The (i) Attitude →Extroversion vs Introversion: this dimension
TF-IDF score is useful in adjusting the weight between most defines that how an individual focuses their energy and
regular or general words and less ordinarily utilized words. attention, whether get motivated externally from other people’s
Term frequency figures the frequency of every token in the judgement and perception, or motivated by their inner
tweet however, this frequency is balanced by frequency of that thoughts, (ii) Information →Sensing vs iNtuition (S/N): this
token in the entire dataset. TF-IDF value shows the aspect illustrates that how people perceive information and
significance of a token in a tweet of whole dataset [38]. observant(S), relying on their five senses and solid observation,
while intuitive type individuals prefer creativity over constancy
This measure is significant in light of the fact that it
and believe in their guts, (iii) Decision →Thinking vs Feeling
describes the significance of a term, rather than the customary
(T/F): a person with Thinking aspect, always exhibit logical
frequency sum [39].
behaviour in their decisions, while feeling individuals are
Feature engineering module pseudocode is illustrated in the empathic and give priority to emotions over logic, (iv) Tactics
following Algorithm 2. →Judging vs Perceiving (J/P): this dichotomy describes an
individual approach towards work, decision-making and
Algoriṯhm2. Stepwise procedure for Ƒeature Engineering planning. Judging ones are highly organized in their thoughts.
# CountVectorizer They prefer planning over spontaneity. Perceiving individuals
Assign [] to CVectorizer have spontaneous and instinctive nature. They keep all their
Ƒor Ēach tweet in Post Ɗo options open and good at improvising opportunities [40].
ƑorĒach word in tweet Ɗo
Assign Ɗict [word] to Ɗict [Ꝡord] +1 D. Working Procedure of the System for Personality Traits
ĒndƑor Prediction
CVectorizer. Ӑppend (Ɗict)
Ӓssign 0 to Ɗict As depicted in Fig. 4, first, the proposed model is trained
ĒndƑor by giving both labelled data (MBTI type) and text (in the form
ŦermƑrequency of tweets). After training the model, it is evaluated for
Assign CVectorizer to TƑ
efficiency. For better prediction, the dataset will be split into
Assign 0 to ɌOẄ
Ꝡhile (ɌOẄ <= Ɲ-1) Ɗo three phases (training phase, validating phase and testing
Assign SUM (CVectorizer [row].values) to Nwords phase). The validating step will reduce overfitting of data.
For Each Word in CVectorizer [row]
Assign CVectorizer[W]/Nwords to TF [W] The mbti_kaggle dataset is available in two columns,
ĒndƑor namely, (i) type and (ii) posts. By type it means 16 MBTI
ꝠhileĒnd personality types, such as INTP, ENTJ and INFJ, etc. As we
# ŦF/ ƊƑ are interested in MBTI traits rather than types, therefore we
# IƊƑ Ꞓ alculation through python coding added four new columns to the original
Assign [] to IƊƑ dataset for the purpose of traits determination. As a result, the
While (Till the existence of ɌOẄ in TƑ) Do
Assign [] to ṯemp new modified dataset will look like as given bellow in
Ꝡhile (Till the existence of word in ɌOẄ) DO Table VIII.
Assign 0 to Count
Ƒor i fɍom 0 to Ɲ-1 Ɗo
IƑ TƑ [Count][Ẅord]>0 Ŧhen
Ϲount Ϲount+1
Ēnd IƑ
ĒndƑor
Ӓssign LÖG (Ɲ/Ϲount) to Ťemp [Ẅord]
ꝠhileĒnd
IDF. Ӓppend (ŦEMṖ)
ꝠhileĒnd
# ŦF/ IDƑ
Assign 0 to TƑ -IƊƑ
ƑOR I Ƒrom 0 to Ɲ-1 ƊÖ
Assign [] to ŦĒMṖ
ƑorĒach Ẅ in TƑ [i], IƊƑ [i]
ŦĒMṖ [W]= TƑ [i][Ẅ]*IƊƑ[i][ Ẅ]
ĒndƑor
Ӓppend (ŦEMṖ) to TƑ -IƊƑ
ĒndƑor
Fig. 4. Working Procedure of the System.
469 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020
TABLE VIII. SAMPLE OF DATASET USED FOR EXPERIMENT Algorithm 4: XGBoost Working Procedure
Type Posts I/E S/N F/T J/P Data: Dataset and Hyperparameters
I'm scared of failure, but also Initialize
ENTP throwing up...take that for 1 0 0 1 fork = 1,2, ………, M do
what you will. Calculategk = ;
Just a funny comment from Calculate hk = ;
my side. A bit serious maybe. 0 0 1 0 Determine the structure by choosing splits with maximized gain
INFJ
If you don't care about the
functions A=
Determine the leaf weights = ;
I need a date with an INTJ!
INFP God dammit. Opps, wrong 0 0 1 1 Determine the base learnerb(x) = ;
thread. lol Add treesfk(x) = fk-1(x) + b(x);
Algoriṯhm 3: Pseudo code of the entire System end
Result: f(x) =
Inṗut: Ṣet of tweets from mbti_kaggle dataset saved in CSV
format F. Comparing the Efficiency of XGBoost with other
Output: Classification of input text into personality traits
Personality Traits: [“I_E”, “S-N”, “F-T”, “J-P”]
Classifiers
ML-Classifier: [ “XGboost”] The overall prediction performance and efficiency of the
Stop-word List: [There, it, on, into, under…….] proposed system has examined by applying other supervised
Start
machine learning classifiers. This comparison illustrates a true
//Inputting Snippet of Text
Assign Dataset text of post to Text picture of the performance of this proposed classifier, namely
#Pre-processing steps. XGBoost, as compared to the other machine learning
#Tokenization/segmentation algorithms and baseline methods regarding personality
Assign Tokenize(text) to Token prediction capability from the input text [13].
# Dropping of stop words
Set Post_text to Drop_stopwords(tokens) G. Evaluation Metrics
#punctuation
# data set splitting into train/teṣt The evaluation metrics, such as accuracy, precision, recall
Set Ӿ↔₮rȧin, Ῡ↔₮rȧin, Ӿ↔₮est, Ῡ↔₮est to Ṣplit (post_text, and f-measure, describe the performance of a model.
ṯest-ṣize=20%) Therefore, different evaluation metrics has been used to check
# counterVectorizer(Post_Text) the overall efficiency of predictive model.
#Application of tf‣ idf
#Classifier implementation Algorithm 5: Pseudo code of the Performance Evaluation
Set Model to MLClassifier
ӒssignMödel: fit(Ӿ↔₮rȧin, Ῡ↔₮rȧin) to Ϲlassification # Ṗerformance
Set Ϲlassification to Mödel: fit(Ӿ↔₮rȧin, Ῡ↔₮rȧin) ŦC
Ӓccuracy ↔ŦC/Ɲ2
#Traits Ṗrediction
TṖ ϹÖUNŦ (Ṗredicṯion = Ṗosiṯive ӒNƊ Ỵ↔Ŧest =Ṗosiṯive)
Assign Classification: Ṗrediction (Ӿ↔₮est) to Ṗrediction
Set Trait_Prediction to Ϲlassification: Ṗrediction (Ӿ↔₮est) TƝ ↔ ϹÖUNŦ (Ṗredicṯion =Ɲegaṯive ӒNƊ Ỵ↔Ŧest = Ɲegaṯive)
#Αccuracy ƑṖ ↔ ϹÖUNŦ ((Ṗredicṯion = Ṗosiṯive ӒNƊ Ỵ↔Ŧest = Ɲegaṯive)
Set Αccuracy to Αccuracy (Trait_Ṗrediction, Ῡ↔₮est) ƑN ↔ ϹÖUNŦ (Ṗredicṯion = Ɲegaṯive ӒNƊ Ỵ↔Ŧest = Ṗosiṯive)
#Recall score Ṗrecisioń ↔TṖ / (TṖ + ƑṖ)
Set Recall to Recall (Trait_Prediction, Ῡ↔₮est) Recȧll ↔TṖ / (TṖ + ƑN)
#Precision score ϹFM ↔ []
Set Precision to Precsion (Trait_Prediction, Ῡ↔₮est) ϹƑM [‘TṖ’] ↔TṖ
#F1‣ score
ϹƑM [‘FN’] ↔ƑN
Set F1‣ score to F1‣ score (Trait_Prediction, Ῡ↔₮est)
Assign (Accuracy, Re_call, Precession, F1‣ score) to Personality Traits CFM[‘FP’] ↔ƑP
Return (Personality Traits) ϹƑM [‘ŦN’] ↔ŦN
470 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020
Table X shows the results of XGBoost classifier with TABLE XI. RESULTS OF XGBOOST WITH DIFFERENT
PARAMETERSETTINGS
default parameter settings.
It is clear from Table XI that increasing or decreasing the Metrics I-E S-N F-T J-P
learning_rate: 0.01
values of different parameters for XGBoost classifier, has huge n_estimators: 1000
effect on the text classification results. max_depth: 5 Accuracy 93.10 96.70 92.32 90.88
subsample: 0.8
B. Answer to RQ.2 colsample_bytree: 1
While addressing RQ2: “How to apply a class balancing gamma: 1 Recall 89.56 96.24 92.07 94.24
technique on the imbalanced classes of personality traits for Objective =
‘binary:logistic’
performance improvement and What is the efficiency of the Reg_alpha = 0.3 Precession 96.32 97.14 93.64 90.91
proposed technique w.r.t other machine learning techniques?”, Scale-pos_weight = 1
An imbalanced dataset is considered first. Imbalanced dataset
F1_Score 92.82 96.68 92.85 92.55
can be defined as a distribution problem arises in classification
where the number of instances in each class is not equally learning_rate: 0.01
divided. n_estimators: 1000 Accuracy 95.51 97.61 93.15 91.79
max_depth: 6
Whenever, an algorithm is applied on skewed and subsample: 0.8
unbalanced classified dataset, the outcome always diverge colsample_bytree: 1 Recall 93.39 97.21 92.91 94.77
toward the sizeable class and the smaller classes are bypassed gamma: 1
for prediction. This drawback of classification is known as Objective =
‘binary:logistic’ Precession 97.47 98.00 94.37 91.81
class imbalance problem [11].
Reg_alpha = 0.3
Therefore, it is attempted to balance this sparsity by re- Scale-pos_weight = 1
F1_Score 95.39 97.60 93.64 93.27
sampling technique [11]. As two traits are highly imbalanced,
therefore Data Level Re-sampling approach is used for class learning_rate: 0.01
balancing [9]. n_estimators: 500 Accuracy 90.95 94.51 91.20 89.84
max_depth: 6
TABLE IX. PARAMETER SETTING FOR XGBOOST subsample: 0.8
colsample_bytree: 1 Recall 85.78 91.98 90.28 95.23
Parameters Description gamma: 1
Objective =
It describes the effect of weighting of adding
Learning_rate = 0.03 ‘binary:logistic’ Precession 95.48 96.88 93.28 88.69
more trees to the boosting model.
Reg_alpha = 0.3
It corresponds to the fraction of features Scale-pos_weight = 1
Colsample_bytree = 0.4 F1_Score 90.37 94.37 91.75 91.84
(columns) that will be used to train each tree.
It controls the balance between negative and learning_rate: 0.01
Scale-pos_weight = 1
positive classes. n_estimators: 1000 Accuracy 99.37 99.92 94.55 95.53
Subsample ratio of the training instance. Setting it max_depth: 10
to 0.5 means that XGBoost randomly collects half subsample: 0.8
Subsample = 0.8 colsample_bytree: 1 Recall 97.16 100 89.96 92.66
of the data instances to grow trees. This prevents
overfitting. gamma: 1
Objective =
Objective = It returns predicted probability for binary ‘binary:logistic’ Precession 100 99.50 100 100
‘binary:logistic’, classification. Reg_alpha = 0.3
It represents the number of decision trees in Scale-pos_weight = 1 98.56 99.75 94.72 96.19
n_estimators = 1000 F1_Score
XGBoost classifier.
Reg_alpha = 0.3
L1 regularization encourages sparsity (meaning In this section the overall comparison of predicting
pulling weights to 0). personality traits is presented using all evaluation metrics to
It represents the size (depth) of each decision tree determine the performance of different classifiers. Results are
Max-depth = 10 in the model. Over fitting can be controlled using reported in Table XII.
this parameter.
Different classifiers are applied over same mbti_kaggle
Its purpose is to control complexity. It represents
Gamma = 10 that how much loss has to be reduced. It prevents dataset using Re-sampling technique and without Re-sampling
overfittings. technique. Results reported in Table XII depict that XGBoost
obtained the highest score using all four-evaluation metrics and
TABLE X. RESULTS OF XGBOOST WITHOUT PARAMETER SETTINGS across all the MBTI personality dimensions, when imbalance
dataset is experimented. However, Naïve Bayes and Random
Metrics I-E S-N F-T J-P Forest on imbalance dataset, performed poorly. So, it is
Accuracy 87.04 92.32 89.00 85.85 concluded from this experiment that applying classifiers on
No Parameter skewed data is not producing good results.
Recall 81.44 81.75 87.70 89.16
setting
Accuracy 91.59 68.98 91.65 87.80
On the other hand, when different classifiers are tested over
resampled dataset, an improved result is obtained for all
F1_Score 86.22 74.82 89.92 88.47 dimensions over all classifiers.
471 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020
The most accurate and precise algorithm for this proposed 1) Why our Class balancing technique is better: By
work is XGBoost. It got excellent results for all traits using all applying class balancing technique results for all evaluation
metrics. XGBoost obtained maximum accuracy (99.92%) for metrics and for all four personality traits are high and better
S/N trait. Its results are highest for all four dimensions and than base line work. In this dataset two dimensions I/E and
across all metrics.
S/N are highly imbalanced, therefore a class balance technique
is used for better prediction performance.
TABLE XII. COMPARISON OF DIFFERENT CLASSIFIERS PERFORMANCE USING RE-SAMPLE DATASET AND IMBALANCE DATASET
472 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020
KNN classifier gives overall low performance, however its A very large dataset MBTI9k acquired from reddit is used
Recall for I/E and F/T is a little bit high. for personality prediction [7]. The emphasis of this work is to
extract features and linguistic properties of different words and
The outcome of Decision Tree algorithm for I/E and S/N then these features are used to train various machine leaning
traits is better than F/T and J/P traits. models such as Logistic Regression, SVM and MLP.
Random Forest gives highest for all traits. However, for J/P Classifiers using integration of all features together (LR_all
lowest Recall is obtained. and MLP_all) obtained better results for all traits. The overall
worst results using all classifiers obtained for the T/F
Logistic Regression classifier produced tremendous result dichotomy. The major limitation of this work is that the
for all traits, but again for J/P traits accuracy and Precision are number of words in each post are very large, which lead to a
not up to the mark. little bit lower performance on the part of all classifiers.
The results obtained by applying Naïve Bays classifier is 1) Proposed Work: In this proposed system, the same
comparatively better for I/E and S/N traits.
dataset is used as experimented by [6], However re-sampling
Support Vector Machine when tested on the given dataset it technique is applied over it, and hence obtained results in
gives better and balance results in respect to all traits. SGD respect of all personality traits are very good, especially
Classifier showing remarkable performance for all four XGBoost achieved the best score across all dimensions and all
personality traits. traits as compared to previous work. It is observed that the
MLP classifier achieved outstanding results for all four mbti_kaggle dataset is very skewed, therefore when
traits using four metrics. oversampling technique is applied the output is far better than
XGBoost classifier has proven to be very good for all previous works. Up to 99% accuracy for I/E and S/N traits
classification problems. The results obtained using XGBoost is are achieved using XGBoost classifier, while Bharadwaj [6],
very balance in respect to all personality traits got 88% maximum accuracy for S/N trait. Similarly, for T/F
and J/P proposed work results are promising and obtained
C. Answer to RQ.3 94.55% accuracy for T/F and 95.53% accuracy for J/P
To answer RQ3: “What is the efficiency of the proposed dimension using XGBoost. While in previous work MLP
technique with respect to other baseline methods.” This classifier achieved accuracy of 54.1% for T/F and 61.8% for
proposed model is compared with two baseline methods [6, 7]. J/P dimension. Therefore, it is clear that by using resampling
Classification performed by [6] for personality prediction technique excellent and improved results are obtained for all
using same mbti_kaggle dataset by applying three classifiers four dimensions. The results reported in Table XIII, describe
namely, (i) SVM, (ii) MLP and (iii) Naïve Bayes and got the comparison of proposed work with the baseline method.
accuracy upto 88.4%. Due to imbalance data the result of [6] 2) XGBoost with Outstanding Performance: XGBoost
is not up to the mark. The results show that SVM in belongs to the family of Gradient Boosting is a machine
collaboration with LIWC and TF-IDF feature vectors gave learning technique used for classification and regression
accurate prediction score for all four traits, while MLP with all
problems that produces a prediction from an ensemble of
features Vectors got maximum accuracy score for S/N trait
(90.45%) however its result for J/P trait is lower. Naïve bays weak decision trees.
also perform well for I/E and S/N traits but its performance for The main reason of using this algorithm is its accuracy,
T/F and J/P is very poor. The reason behind better accuracy for speed, efficiency, and feasibility. It’s a linear model and a tree
I/E and S/N dimensions and least performance for T/F and J/P learning algorithm that does parallel computations on a single
is due to class imbalance problem. machine. It also has extra features for doing cross validation
and computing feature importance.
473 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020
Obtained Results
Study Technique Dataset Classifier
Metrics I/E S/N F/T J/P
Accuracy 77% 86.2% 77.9% 62.3%
NB Recall
Precession
Accuracy 84.9% 88.4% 87.0% 78.8%
Bharadwaj, et al. SVM, MLP and
MBTI_Kaggle SVM Recall
(2018) Naïve Bayes
Precession
Accuracy 77.0% 86.3% 54.1% 61.8%
MLP Recall
Precession
Accuracy
SVM F1-Score 79.6% 75.6 64.8 72.6
Precession
Accuracy
SVM, MLP and
Gjurković et al.
Logistic MBTI9k LR F1-Score 81.6 77.0 67.2 74.8
(2018)
Regression
Precession
Accuracy
MLP F1-Score 82.8 79.2 64.8 72.6
Precession
474 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020
475 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020
[31] N. Alsadhan and D. Skillicorn, "Estimating Personality from Social [37] M. Z. Asghar, A. Khan, F. Khan and F. M. Kundi, “RIFT: A Rule
Media Posts," 2017 IEEE International Conference on Data Mining Induction Framework for Twitter Sentiment Analysis,” Arabian Journal
Workshops (ICDMW), New Orleans, LA, 2017, pp. 350-356. for Science and Engineering, vol. 43, no. 2, pp.857-877, 2018.
[32] R. K. Hernandez and L. Scott, “Predicting Myers-Briggs type indicator [38] A. Tripathy, A. Agrawal and S. K. Rath, “Classification of sentiment
with text,” In 31st Conference on Neural Information Processing reviews using n-gram machine learning approach,” Expert Systems with
Systems (NIPS 2017), 2017. Applications, 57, pp. 117-126, 2016.
[33] B. Cui and C. Qi, “Survey Analysis of Machine Learning Methods for [39] L. H. Patil and M. Atique, "A novel approach for feature selection
Natural Language Processing for MBTI Personality Type Prediction”. method TF-IDF in document clustering," 2013 3rd IEEE International
[34] D. Xue, L. Wu, Z. Hong, S. Guo, L. Gao et al, “Deep learning-based Advance Computing Conference (IACC), Ghaziabad, 2013, pp. 858-
personality recognition from text posts of online social 862.
networks,” Applied Intelligence, vol. 48, no. 11, pp. 4232-4246, 2018. [40] M. C. Komisin and C. I. Guinn, “Identifying personality types using
[35] Y. Yan, Y. Liu, M. Shyu and M. Chen, "Utilizing concept correlations document classification methods,” In Twenty-Fifth International
for effective imbalanced data classification," Proceedings of the 2014 FLAIRS Conference, 2012.
IEEE 15th International Conference on Information Reuse and [41] D. Nielsen, “Tree Boosting With XGBoost-Why Does XGBoost Win
Integration (IEEE IRI 2014), Redwood City, CA, 2014, pp. 561-568. Every Machine Learning Competition? (Master's thesis, NTNU),” 2016.
[36] S. Rezaei and X. Liu, "Deep Learning for Encrypted Traffic [42] M. M. Tadesse, H. Lin, B. Xu and L. Yang, "Personality Predictions
Classification: An Overview," in IEEE Communications Magazine, vol. Based on User Behavior on the Facebook Social Media Platform," in
57, no. 5, pp. 76-81, May 2019. IEEE Access, vol. 6, pp. 61959-61969, 2018.
476 | P a g e
www.ijacsa.thesai.org