0% found this document useful (0 votes)
208 views17 pages

Personality Classification From Online Text

This document summarizes a research paper that used machine learning techniques to classify personality traits from online text. Specifically: 1. An XGBoost classifier was used to predict four personality traits (introversion-extroversion, intuition-sensing, feeling-thinking, judging-perceiving) from input text, using a benchmark dataset from Kaggle. 2. The dataset exhibited class imbalance between personality traits, which can degrade classifier performance. Random oversampling was applied to minimize this skew and improve results. 3. Pre-processing including tokenization, stemming, stop word removal and TF-IDF feature selection was also used. 4. The XGBoost classifier achieved over 99% precision and

Uploaded by

IQ vines
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
208 views17 pages

Personality Classification From Online Text

This document summarizes a research paper that used machine learning techniques to classify personality traits from online text. Specifically: 1. An XGBoost classifier was used to predict four personality traits (introversion-extroversion, intuition-sensing, feeling-thinking, judging-perceiving) from input text, using a benchmark dataset from Kaggle. 2. The dataset exhibited class imbalance between personality traits, which can degrade classifier performance. Random oversampling was applied to minimize this skew and improve results. 3. Pre-processing including tokenization, stemming, stop word removal and TF-IDF feature selection was also used. 4. The XGBoost classifier achieved over 99% precision and

Uploaded by

IQ vines
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 11, No. 3, 2020

Personality Classification from Online Text using


Machine Learning Approach
Alam Sher Khan1, Hussain Ahmad2, Muhammad Zubair Asghar3*
Furqan Khan Saddozai4, Areeba Arif5, Hassan Ali Khalid6
Institute of Computing and Information Technology
Gomal University, D.I. Khan, Pakistan

Abstract—Personality refer to the distinctive set of Myers- Briggs Type Indicator (MBTI) [4], and DiSC
characteristics of a person that effect their habits, behaviour’s, Assessment [5].
attitude and pattern of thoughts. Text available on Social
Networking sites provide an opportunity to recognize individual’s The existing works on personality recognition from social
personality traits automatically. In this proposed work, Machine media text is based on supervised machine learning techniques
Learning Technique, XGBoost classifier is used to predict four applied on benchmarks dataset [6], [7], [8]. However, the
personality traits based on Myers- Briggs Type Indicator (MBTI) major issue associated with the aforementioned studies is the
model, namely Introversion-Extroversion(I-E), iNtuition- skewness of the datasets, i.e. presence of imbalanced classes
Sensing(N-S), Feeling-Thinking(F-T) and Judging-Perceiving(J-P) with respect to different personality traits. This issue mainly
from input text. Publically available benchmark dataset from contributes to the performance degradation of personality
Kaggle is used in experiments. The skewness of the dataset is the recognition system.
main issue associated with the prior work, which is minimized by
applying Re-sampling technique namely random over-sampling, To address the aforementioned issue different techniques
resulting in better performance. For more exploration of the are available for minimizing the skewness of the dataset, like
personality from text, pre-processing techniques including Over-sampling, Under-sampling and hybrid-sampling [9].
tokenization, word stemming, stop words elimination and feature Such techniques, when applied on the imbalanced datasets in
selection using TF IDF are also exploited. This work provides the different domain, have shown promising performance in terms
basis for developing a personality identification system which of improved accuracy, recall, precision, and F1-score [10].
could assist organization for recruiting and selecting appropriate
personnel and to improve their business by knowing the In this work, a machine learning technique, namely,
personality and preferences of their customers. The results XGBoost is applied on the benchmark personality recognition
obtained by all classifiers across all personality traits is good dataset to classify the text into different personality traits such
enough, however, the performance of XGBoost classifier is as Introversion-Extroversion(I-E), iNtuition-Sensing(N-S),
outstanding by achieving more than 99% precision and accuracy Feeling-Thinking(F-T) and Judging-Perceiving(J-P).
for different traits. Furthermore, to improve the performance of the system,
resampling technique [11] is also utilized for minimizing the
Keywords—Personality recognition; re-sampling; machine skewness of the dataset.
learning; XGBoost; class imbalanced; MBTI; social networks
A. Problem Statement
I. INTRODUCTION Predicting personality from online text is a growing trend
Personality of a person encircles every aspect of life. It for researchers. Sufficient work has already been carried out on
describes the pattern of thinking, feeling and characteristics predicting personality from the input text [6, 7, 8].
that predict and describe an individual’s behaviour and also However, more work is required to be carried out for the
influences daily life activities including emotions, preference, performance improvement of the existing personality
motives and health [1]. recognition system, which in most of the cases arises due to
The increasing use of Social Networking Sites, such as presence of imbalanced classes of personality traits. In the
Twitter and Facebook have propelled the online community to proposed work. A dataset balancing technique, called re-
share ideas, sentiments, opinions, and emotions with each sampling is used for balancing the personality recognition
other; reflecting their attitude, behaviour and personality. dataset, which may result in improved performance.
Obviously, a solid connection exists between individual’s B. Research Questions
temperament and the behaviour they show on social networks
in the form of comments or tweets [2]. RQ.1: How to apply supervised machine
learning technique, namely XGBoost classifier for classifying
Nowadays personality recognition from social networking personality traits from the input text?
sites has attracted the attention of researchers for developing
automatic personality recognition systems. The core RQ.2: How to apply a class balancing technique on the
philosophy of such applications is based on the different imbalanced classes of personality traits for performance
personality models, like Big Five Factor Personality Model [3], improvement and what is the efficiency of the proposed
technique w.r.t other machine learning techniques?
*Corresponding Author

460 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020

RQ.3: What is the efficiency of the proposed technique Fig. 1 depicts the classification sketch of the literature
with respect to other baseline methods? review on personality recognition from text.
C. Aims and Objective A. Supervised Learning Technique
1) Aim: The aim of this work is to classify the personality These supervised learning algorithms are comprised of
traits of a user from the input text by applying supervised unlabeled data/ variables which is to be determined from
machine learning technique namely XGBoost classifier on the labelled data, also called independent variables. The studies
benchmark dataset of MBTI personality. This work is the given below are based on supervised learning methodologies.
enhancement of the prior work performed by [6]. A system is proposed by [6] for analysing social media
2) Objectives posts/ tweets of a person and produce personality profile
a) Applying machine learning technique namely accordingly. The work mainly emphasizes on data collection,
XGBoost classifier for personality traits recognition from the pre-processing methods and machine learning algorithm for
input text. prediction. The feature vectors are constructed using different
feature selection techniques such as Emolex, LIWC and
b) Applying re-sampling technique on the imbalanced TF/IDF, etc. The obtained feature vectors are used during
classes of personality traits for improving the performance of training and testing of different kinds of machine learning
proposed system. algorithms, like Neural Net, Naïve Bayes and SVM. However,
c) Evaluating the performance of proposed model with SVM with all feature vectors achieved best accuracy across all
respect to other machine learning techniques and base line dimensions of Myers-Briggs Type Indicator (MBTI) types.
methods. Further enhancement can be made by incorporating more state
of the art techniques.
D. Significance of Study
Personality is distinctive way of thinking, behaving and MBTI dataset, introduced in [7] for personality prediction,
feeling. Personality plays a key role in someone’s orientation in which is derived from Reddit social media network. A rich set
various things like books, social media sites, music and movies of features are extracted, and benchmark models are evaluated
[12]. for personality prediction. The classification is performed using
SVM, Logistic Regression, and (MLP). The classifier using all
The proposed work on personality recognition is an linguistic features together outperformed across all MBTI
enhancement of the work performed by [6]. Proposed work is dimensions. However, further experimentation is required on
significant due to the following reasons: (i) performance of the more models for achieving more robust results. The major
existing study is not efficient due to skewness, which will be limitation is that the number of words in the posts are very
addressed in this proposed work by applying re-sampling large, which sometimes don’t predict the personality
technique on the imbalanced dataset, (ii) proposed work also accurately.
provide a basis for developing state of the art applications for
personality recognition, which could assist organization for To predict personality from tweets, [8] proposed a model
recruiting and selecting appropriate personnel and to improve using 1.2 Million tweets, which are annotated with MBTI type
their business by taking into account the personality and for personality and gender prediction. Logistic regression
preferences of their customers. model is used to predict four dimensions of MBTI. Binary
word n-gram is used as a feature selection. This work showed
II. RELATED WORK improvement in I-E and T-F dimensions but no improvements
in S-N and even slightly drop for P-J. In terms of personality
A review of literature pertaining to personality recognition
prediction, linguistic features produce far better results.
from text is presented here in this section. The literature studies
Incorporating enhanced dataset may improve performance.
of this work is categorized into four sub groups, namely,
i) Supervised learning techniques, ii) Un-supervised machine A system was developed to recognize user personality
learning techniques, iii) Semi-supervised machine learning using Big Five Factor personality model from tweets posted in
techniques and, iv) Deep learning techniques. English and Indonesian language [13]. Different classifiers are
applied on the MyPersonality dataset. The accuracy achieved
by Naive Bayes(NB) is 60%, which is better than the accuracy
of KNN (58%) and SVM (59%).Although this work did not
improve the accuracy of previous research (61%) yet achieved
the goal of predicting the personality from Twitter-based
messages. Using extended dataset and implementing semantic
approach, may improve the results.
Personality assessment/ classification system based on Big5
Model was proposed for Bahasa Indonesian tweets [14].
Assessment is made on user’s words choice. The machine
learning classifiers, namely, SVM and XGBoost, are
implemented on different parameters like existence of (n_gram
minimum and n_gram weighted), removal of stop words and
Fig. 1. Categorization Sketch of Literature Review. using LDA. XGBoost performed better than the SVM under

461 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020

the same data and same parameter setting. Limited dataset of Unavailability of datasets, lack of identification of features in
only 359 instances for training and testing is the main certain languages, and difficulty in identifying the requisite
drawback of their work. pre-processing methods, are the issues to be tackled. These
issues can be addressed by developing methods for non-
Automatic identification of Big Five Factor Personality English language, introducing more accurate machine learning
Model was proposed by [15] using individual status text from algorithms, implementing other personality models, and
Facebook. Various techniques like Multinomial NB, Logestic including more feature selection for pre-processing of data.
Regression (LR) and SMO for SVM are used for personality
classification. However, MNB outperformed other methods. Twitter user’s profiles are used for accurate classification
Incorporating feature selection and more classifiers, may of their personality traits using Big5 model [20]. Total 50
enhance the performance. subjects with 2000 tweets per user are assessed for prediction.
Users content are analysed using two psycholinguistic tools,
Personality profiling based on different social networks namely LIWC and MRC. The performance evaluation is
such as Twitter, Instagram and Foursquare performed by [16]. carried out using two regression models, namely ZeroR and
Multisource large dataset, namely NUS-MSS, is utilized for GP. Results for “openness” and “agreeableness” traits are
three different geographical regions. The data is evaluated for similar as that of previous work, but less efficient results are
an average accuracy using different machine learning shown for other traits. Extended dataset may improve the
classifiers. When the different data sources are concatenated in results.
one feature vector, the classification performance is improved
by more than 17%. Available dataset may be enriched from A connection has been established between the users of
multi (SNS) by user’s cross posting for better performance. Twitter and their personality traits based on Big5 model [21].
Due to inaccessibility of original tweets, user’s personality is
The performance of different ML classifiers are analysed to predicted on three parameters that are publicly available in
assess the student’s personality based on their Twitter profiles their profiles, namely (i) followers, (ii), following, and
by considering only Extraversion trait of Big 5 [17]. Different (iii) listed count. Regression analysis is performed using M5
machine learning algorithms like Naïve Bayes, Simple logistic,
rules with 10-fold cross validation. RMSE of predicted values
SMO, JRip, OneR, ZeroR, J48, Random Forest, Random Tree, against observed values is also measured. Results show that
and AdaBoostM1, are applied in WEKA platform. The based on three counts, user’s personality can be predicted
efficiency of the classifiers is evaluated in terms of correctly accurately.
classified instances, time taken, and F-Measures, etc. OneR
algorithm of rules classifier show best performance among all, TwiSTy, a novel corpus of tweets for gender and
producing 84% classification accuracy. In future, all personality prediction has been presented by [22] using MBTI
dimensions of Big5 can be considered for evaluation to get type Indicator. It covers six languages, namely Dutch, German,
more insight. French, Italian, Portuguese and Spanish. Linear SVM is used
as classifier and results are also tested on Logistic Regression.
The performance of different classifier is evaluated by [18] Binary features for character and word (n-gram) are utilized. It
using MBTI model to predict user’s personality from the online outperformed for gender prediction. For personality prediction,
text. Various ML classifiers, namely Naïve Bayes, SVM, LR it outperformed other techniques for two dimensions: I-E and
and Random Forest, are used for estimation. Logistic T-F, but for S-N and J-P, this model did not show
Regression received a 66.5% accuracy for all MBTI types, improvement. In future, the model can be trained enough to
which is further improved by parameter tuning. Results may predict all four dimensions of MBTI efficiently.
further be improved by using XGBoost algorithm, which
remained winner of most Kaggle and other data science The Table I represents the summaries of above cited studies
competitions. for classification and prediction of user’s personality using
Supervised Machine Learning strategies.
The oversampling and undersampling techniques are
compared by [11] for imbalance dataset. Classification perform B. Unsupervised Learning Approach
poorly when applied on imbalanced classes of dataset. There Unsupervised learning classifiers are using only unlabeled
are three approaches (data level, algorithmic level and hybrid) training data (Dependent Variables) without any equivalent
that are widely used for solving class imbalance problem. Data output variables to be predicted or estimated.
level method is experimented in this study and result of Over-
sampling method (SMOTE) is better than under-sampling The Twitter data was annotated by [23] for 12 different
technique (RUS). More re-sampling techniques need to be linguistic features and established a correlation between user’s
evaluated in future. personality and writing style with different cross-region users
and different devices. Users with more than one tweets are
Authors in [19] briefly discussed and explained the early considered for evaluation. It was observed that Twitter users
research for the classification of personality from text, carried are secure, unbiased and introvert as compared to the users
out on various social networking sites, such as Twitter, posting from iPhone, blackberry, ubersocial and Facebook
Blogger, Facebook and YouTube on the available datasets. The platforms. More Twitter data for classification may enhance
methods, features, tools and results are also evaluated. the efficiency of personality identification model.

462 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020

TABLE I. PERSONALITY RECOGNITION BASED WORK USING SUPERVISED MACHINE LEARNING APPROACH

SNo Research Goals and objectives Strategy/ Approach Performance Limitation and Future Work
Less weightage is given to the
SVM, Neural Net and SVM with all feature vectors word’s gravity.
Bharadwaj et al. Personality prediction from Naïve Bayes achieved best accuracy Incorporating more state-of-the-art
1
(2018) [6] online text TF-IDF, Emolex, LIWC across all dimensions of techniques in future will yield
and ConceptNet MBTI better result.

Demographic data like age and


SVM, Logistic Regression MLP using all linguistic
Gjurković and Personality classification gender is not considered
2 and MLP with linguistic features together outperform
Šnajder (2018) [7] of Reddit user’s posts. Accuracy of T/F dichotomy may
features across all MBTI dimensions
be improved in future.
Accuracy for personality
prediction: A lot of Gap between general
Logistic regression Model I/E = 72.5% population personality types and
Plank and Hovy Personality and gender S/N = 77.5% this corpus personality types.
3 and Binary word n-gram is
(2015) [8] prediction from tweets.
used as a feature selection. T/F = 61.2 % Incorporating of enhanced dataset
J/P = 55.4% will improve the performance.

To recognize user Supervised Accuracy


personality using Big-5  KNN KNN = 58% Using extended dataset and
Pratama and Sarno
4 personality model from implementing semantic approach,
(2015) [13]
tweets posted in English  NB NB = 60% may improve the results.
and Indonesian language  SVM SVM = 59%
A personality assessment Supervised Accuracy Limited dataset of only 359
Ong et al. (2017b) based on Big5 Model for
5  XGBoost XGBoost = 97.99% instances for training and testing is
[14] Bahasa Indonesian tweets
 SVM SVM = 76.23% the main drawback of this work.
using user’s words choice.
MNB = 61.79%
Automatic identification of Multinomial NB, Logestic BLR = 58.34% Incorporating feature selection and
Alam et al. (2013) Big Five Factor Personality Regression (LR) and SMO
6 SMO = 59.98% more classifiers, may enhance the
[15] Model using individual for SVM are used for
›MNB outperformed other performance.
status text from Facebook personality classification
methods
By concatenating different
Multisource large dataset, In future the available dataset may
data sources in one feature
Buraya et al. (2017) namely NUS-MSS, is be enriched from multi (SNS) by
7 Supervised vector, the classification
[16] utilized for personality user’s cross posting for better
performance is improved by
profiling. performance.
more than 17%.
Naïve Bayes, Simple
Using different ML
logistic, SMO, JRip, OneR, In future, all dimensions of Big5
Ngatirin et al. classifiers to assess the OneR with F1_Score = 0.837
8 ZeroR, J48, Random can be considered for evaluation to
(2016) [17] student’s personality based outperform among all.
Forest, Random Tree, and get more insight.
on their Twitter profiles.
AdaBoostM1,
Supervised learning Accuracy Lower accuracy is due using
To predict user’s methodology namely
NB = 55.89%
Chaudhary et al. traditional classifiers. Deep
9 personality from the online Naïve Bayes, SVM, LR
(2018) [18] LR = 66.59% learning approach will definitely
text using MBTI model. and Random Forest, are
SVM = 65.44% improve the performance.
used for estimation.
Comparing of Result of Over-sampling
Kaur and Gosain oversampling and Decision tree algorithm method (SMOTE) is better More re-sampling techniques need
10
(2018) [11] undersampling techniques C4.5 is used. than under-sampling to be evaluated in future.
for imbalance dataset. technique (RUS).
Unavailability of datasets, and lack
of identification of features in
Classification of Best result among all was certain languages, are the issues to
Survey paper using
Ong et al. (2017a) personality from text, attained by twitter with be tackled.
11 supervised learning
[19] carried out using various 91.9% accuracy using words
approcah In future methods for non-English
social networking sites. frequency.
language may need to be
developed.
User’s Twitter profiles for Accuracy
Two regression models,
Golbeck et al. accurate classification of Extended dataset may improve the
12 namely ZeroR and GP are Higher for Open = 75.5%
(2011) [20] their personality traits results.
used. Lower for Neuro =42.8%
using Big5 model.

463 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020

RMSE:
To establish a connection O = 0.69 In future user personality
between the users of Regression using M5 rules C = 0.76
Quercia et al. (2011) classification may be utilized in
13 Twitter and their with 10-fold cross
[21] E = 0.88 marketing and recommender
personality traits based on validation.
A = 0.79 system.
Big5 model.
N = 0.85
Ƒ_score
To predict gender and I/E =77.78
SVM and logistic S/N =79.21 In future, the model can be trained
Verhoeven et al. personality from a novel
14 Regression along words enough to predict all four
(2016) [22] corpus of tweets, namely T/F = 52.13
n_grams features. dimensions of MBTI efficiently.
TwiSTy. J/P = 47.01
For italic lang:
more than secure ones and tend to develop longer chain of
The purpose of the study carried out by [24], is to scrutinize interaction.
the group-based personality identification by utilizing
unsupervised trait learning methodology. Adawalk technique is An Unsupervised Machine learning methodology, namely,
utilized in this survey. The outcomes portray that while Ḳ-Meańs was accomplished by [26] to recognize the network
considering Micro- Ƒ1 score, the achievement of adawalk is visitors’ trait and personality. This proposed work is based on
exceptional with somewhat 7% for ԝiki, 3% for Ƈora, and 8% the quantifiable contents of the website. The obtained results
for BlogCaṯlog. While utilizing SoCE personality corpus, portray that this strategy can be utilized to predict website and
97.74% Macro-Ƒ1 score was achieved by this approach. The network visitors’ personality traits, more accurately. Proposed
drawback of this work is that it entirely depends on TƑ -IDƑ system may be enhanced in future by adding more elements
strategy, additionally the created content systems are not an associated with websites and a greater number of websites for
impersonation of genuine social and interpersonal network like the better performance.
retweeting systems. Large and increased dataset will definitely
enhance the performance of the proposed work in future. Author in [27] proposed a personality identification system
using unsupervised approach based on Big-5 personality
An unsupervised personality classification strategy was model. Different social media network sites are used for
accomplished by [25] to highlight the matter that to how extent extraction and classification of user’s traits. Linguistic features
different personalities collaborate and behave on social media are exploited to build personality model. The system predict
site Twitter. Linguistic and statistical characteristics are personality for an input text and achieved reasonable results.
utilized by this work and then tested on data corpus elucidated However, extended annotated corpus can boost the system’s
with personality model using human judgment. System performance.
investigation anticipate that psychoneurotic users comments
TABLE II. PERSONALITY RECOGNITION BASED WORK USING UN-SUPERVISED MACHINE LEARNING APPROACH

SNo Research Goals and objectives Strategy/ Approach Outcome Limitation and Future Work
Additional Tweets for
Personality classification Mean Accuracy
Un_supervised personality recognition may
1 Celli (2011) [23] from individual’s writing =0.6651 and Mean
Score-based improve the accuracy of this
pattern validity= 0.6994
proposed model.
Large and increased dataset
Sun et al. (2019) group-based personality Un_supervised will definitely enhance the
2 97.74% (Macɍo-Ƒ1)
[24] identification Adȧwalk performance of the proposed
work in future
Impact of linguistic
Celli and Rossi, Un-supervised More tweets are needed for
3 characteristics on 78.29% (Accurȧcy)
(2012) [25] Statistics-based efficient investigation
personality traits.
System may be enhanced in
Chishti and To recognize the network future by adding more elements
Uń-supervised Ḳ=10 is accurate
4 Sarrafzadeh (2015) visitors’ trait and associated with websites and a
Ḳ-Mean score
[26] personality greater number of websites for
better performance
Impact of linguistic
characteristics on
Un_supervised Extended annotated corpus can
5 Celli (2012) [27] personality traits using Big 81.43% (Accuracy)
Score-based boost the system’s performance
Five Model

Developing personality
Findings of this method are
model to predict
Arnoux et al. based on English Twitter data,
6 individual’s Big Five Word-Embedding 68.5% (Accuracy)
(2017) [28] which may be extended to
personality traits on much
other languages
fewer data using twitter.

464 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020

A model was proposed by [28] that requires eight times Detail of the above quoted studies regarding personality
fewer data to predict individual’s Big Five personality traits. classification using Semi-supervised Machine Learning
GloVe Model is used as Word embedding to extract the words Approach are presented in Table III.
from user tweets. Firstly, the model is trained and then tested
on given tweets. Further, the data is tested on three other TABLE III. PERSONALITY RECOGNITION BASED WORK USING SEMI-
SUPERVISĖD MACHINE LEARNING APPROACH
combinations: (i) GloVe with RR, (ii) LIWC with GP, and
(iii) 3-Gram with GP, and the proposed model performed better
Limitation
with an average correlation of 0.33 over the Big-5 traits, which SN Researc Goals and
Strategy/
Performanc and
is far better than the baseline method. Findings of this method Approac
o h objectives e Future
h
are based on English Twitter data, which may be extended to Work
other languages. Similarly, the performance of the model can Accuracy
be examined with small number of tweets. may be
improved
The Table II illustrates the concise detail of above cited by using
studies regarding user’s personality and traits identification Multilingua different
from textual data using un-supervised machine learning l predictive ›SGD
personality
model is classifier
approach. used to with n-
model.
Arroju et identify gram Similarly,
C. Semi-Supervised Learning Approach author
al. user’s features. Accuracy =
The studies carried out by using the combination of 1 profiling
(2015) personality › LIWC 68.5%
can be
linguistic and lexicon features, supervised machine learning [29] traits, age with
further
methodologies and different feature selection algorithms are and gender, regressor
based on enhanced
known as semi-supervised ML approaches. The following model
their by
(ERCC)
studies have utilized the semi-supervised and hybrid strategy. tweets. performing
experiment
Multilingual predictive model was proposed by [29], which s in
identified user’s personality traits, age and gender, based on multiple
their tweets. SGD classifier with n-gram features, is used for languages.
age and gender classification, while LIWC with regressor Lower
model (ERCC) is used for personality prediction. An average accuracy is
accuracy of 68.5% has been achieved for recognition of user’s To due to
attributes in four different languages. However, author recognize limited
›Machine corpus in
profiling can be enhanced by performing experiments in MBTI type Learning,
personality I/E trait = Bhasha
multiple languages. Lukito et ›Lexicon- Indonesia.
traits from 80% S/N,
al. based,
A technique was devised to detect MBTI type personality 2 social T/F and J/P By
(2016) and
traits from social media (Twitter) in Bahasa Indonesian media accuracy is increasing
[30]
(Twitter) in ›linguistic 60% the training
language [30]. Among 142 respondents, 97 users are selected Bahasa Rules data set,
with an average 2500 tweets per user. WEKA is used for Indonesian driven accuracy
building classification and training set. Three approaches are language. may get
used for prediction from training set. i) Machine Learning, improved.
ii) Lexicon-based, and iii) linguistic Rules driven. Among all,
Naïve Bayes outperformed the comparing methods in terms of
better accuracy and time. Its accuracy for I/E trait is 80% while Using only
for S/N, T/F and J/P, its accuracy is 60%. Lower accuracy on word count
the part of linguistic rule-driven and lexicon-based, are due to for
Accuracy for prediction
limited corpus in Bhasha Indonesia. It is observed that by “openness” is the main
increasing the training data set, accuracy may get improved. trait of Big5 drawback
Personality
is higher, of the
A technique proposed for personality prediction from social Alsadhan prediction
while for proposed
media-based text using word count [31]. It works for both and from social Based on
MBTI, system,
MBTI and Big5 personality models using 8 different 3 Skillicor media- word
accuracy for which may
n (2017) based text count
languages. Four kinds of labelled corpus both for Big5 and [31] using word
S/N be covered
BMTI are used for conducting the experiments. In each corpus, dimension is by
count
greater than introducing
1000 most frequently used words are selected. Prediction all other different
accuracy for “openness” trait of Big5 is higher across all dichotomies. features
corpus, while for MBTI, prediction accuracy for S/N selection
dimension is greater than other dichotomies. Using only word and ML
count for prediction is the main drawback of the proposed algorithms.
system, which may be covered by introducing different
features selection and ML algorithms.

465 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020

D. Deep Learning Strategy TABLE IV. PERSONALITY RECOGNITION BASED WORK USING DEEP
LEARNING APPROACH
Deep learning is a subcategory of machine learning
(ML) in artificial intelligence (AI), where machines may Strategy/ Limitation
acquire knowledge and get experience by training without SN Goals and
Research Approac Outcome and Future
o objectives
user’s interaction to make decisions. Based on experiences and h Work
learning from unlabeled and unstructured corpus, deep learning
performs tasks repeatedly and get improvement and tweaking The
in results after each iteration. The studies given below are in predictive
To predict Deep efficiency
summarized form, showing the prior work performed in Deep Hernande
and classify Learning Accuracy
of this work
learning. people into I/E= 67.6%
z and
their MBTI ›RNN may be
1 Scott S/N=62.0% improved
A deep learning classifier was developed, which takes types using ›LSTM
(2017) T/F=77.8% by
text/tweet as input and predict MBTI type of the author using their online ›GRU
[32] increasing
textual J/P=63.7%
MBTI dataset [32]. After applying different pre-processing contents. ›BiLSTM the number
techniques embedding layer is used, where all lemmatized of posts per
words are mapped to form a dictionary. Different RNN layers user.
are investigated, but LSTM performed better than GRU and In future
simple RNN. While classifying user, its accuracy is 0.028 (.676 more deep
× .62 × .778 × .637), which is not good. The predictive Over all learning
A model accuracy= techniques
efficiency of this work may be improved by increasing the
that takes 38% with more
number of posts per user. As the model is tested on real life snippet of word
I/E=
example of Donald trump’s 30,000 tweets, which correctly post or text
Deep
89.51% embedding
Cui, and Learning
predict his actual MBTI type personality. as input and features
2 Qi (2018) Multi- S/N=89.84 may be
classify it
A model proposed by [33] that takes snippet of post or text [33] layer % exploited.
into
as input and classify it into different personality traits, such as different LSTM T/F=69.09 Using of
(INFP, ENTP, and ISJF, etc.). Different classification methods personality % unsupervise
like Softmax as baseline, SVM, Naïve Bayes, and deep traits. J/P=69.37 d technique
learning, are implemented for performance evaluation. SVM % will also
give better
outperformed NB and softmax with 34% train 33% test
results.
accuracy, while Deep learning model shows more
improvement with 40% train and 38% test accuracy. However, In future
the accuracy is still low as it doesn’t even achieve 50 percent. MAE these deep
and
To OPN=
Personality classification system is proposed by [34], to complex
recognize 0.3577
recognize the traits from online text using deep learning semantic
the Deep CON= features will
methodology. AttRCNN model was suggested for this study Xue et al.
personality Learning 0.4251 be used as
utilizing hierarchical approach, which is capable of learning traits from using
3 (2018) EXT= input of
complex and hidden semantic characteristics of user’s textual online text AttRCN
[34] 0.4776 regression
using deep N
contents. Results produced are very effective, proving that learning Approach AGR=
classifiers
using deep and complex semantic features are far better than for more
methodolog 0.3864
improveme
the baseline features. y. NEU= nt in the
0.4273 performanc
A deep learning model was suggested by [1] to classify
e.
personality traits using Big Five personality model based on
essay dataset. Convolutional Neural Network (CNN) is used Accuracy In future
for this work to detect personality traits from input essay. more
OPN=
Different pre- processing techniques like word n-grams, To classify 62.68%
features
sentence, word and document level filtration and extracting need to be
personality CON=
different features are performed for personality traits incorporate
Majumde traits using Deep 56.73% d and
classification. “OPN” traits achieved higher accuracy of 4
r et al. Big Five Learning LSTM
(2017) personality EXT=
62.68% by using different configuration of features and among ›CNN 58.09% recurrent
[1] model based
all five traits. In future, more features need to be incorporated on essay AGR=
network
and LSTM recurrent network may be applied for better results. may be
dataset. 56.71%
applied for
Table IV represents the outline of the works regarding NEU= better
automatic personality recognition system using Deep learning 59.38% results.
methodology.

466 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020

2) Data Level Re-Sampling Approach: Data manipulation


III. METHODOLOGY sampling approaches focus on rescaling the training datasets
The working procedure of this proposed system are as for balancing all class instances. Two popular techniques of
follows: (i) Data acquisition and re-sampling, (ii) Pre- class resizing are over-sampling and under-sampling.
Processing and feature selection, (iii) Text-based Personality
classification using MBTI model, (iv)Applying XGBoost for At the data level, the most famous methodologies are
personality classification, (v) Comparing the efficiency of Oversampling and under sampling procedures. Oversampling
XGBoost with other classifiers, (vi) Applying different is the way toward expanding the number of classes into the
evaluation metrics. minority class. The least difficult oversampling is random
oversampling, which basically duplicate minority instances to
A. Dataset Collection and Re-sampling enhance the imbalance proportion.This duplication of minority
The publically available benchmark dataset is acquired class enhancement really improved the performance of
from Kaggle [6]. This data set is comprised of 8675 rows, machine learning classifier for efficient personality traits
where every row represents a unique user. Each user’s last 50 prediction [11].
social media posts are included along with that user’s MBTI
Under samplingapproach is used to level class distribution
personality type (e.g. ENTP, ISJF). As a result, a labelled data
by indiscriminately removing or deleting majority class
set comprising of a total 422845 records, is obtained in the
instances. This process is continued till the majority and
form of excerpt of text along with user’s MBTI type. Table V
minority class occurrences are balanced out.
describes the detail of acquired dataset.
As illustrated in Fig. 2, the data level sampling-based
1) Re-Sampling: As pointed out by [6], the original methodologies including over-sampling and under-sampling
dataset is totally skewed and unevenly distributed among all have gotten exceptionalconsiderations to counter the impact of
four dichotomies, described as follows: I/E Trait: I=6664 imbalanced datasets [35].
and, E= 1996, S/N Trait: S= 7466 and N= 1194, T/F Trait:
T= 4685 and F= 3975, J/P Trait: J= 5231 and P= 3429. 3) Training and Testing Data: In this proposed system,
Whenever, an algorithm is applied on skewed and unbalanced the data is divided into Training, Testing and Validation
classified dataset, the outcome always diverge toward the dataset. Mostly two datasets are required, one for building the
sizeable class and the smaller classes are bypassed for model while the other dataset is needed to measure the
prediction. This drawback of classification is known as class performance of the model. Here training and validation are
imbalance problem (CIP) [11]. used for building the model, while Testing step is used to
measure the performance of the proposed model [36].
Therefore, this sparsity is balancedby re-sampling Table VI shows the sample tweets from training dataset, while
technique [11]. As mentioned earlier, two traits are highly Table VII represents the sample of test data tweets.
imbalanced, Data Level Re-sampling approach for class
balancing is used [9]. This bridged the gap between each
dichotomy traits and resulted in the efficient and predictable
performance of the proposed system.

TABLE V. DETAIL OF DATASET


Dat
Ins Fo Defa
aset Upda Ori Siz Crea
Description tan rm ult
Na ted gin e tor
ces at Task
me Fig. 2. Class Balancing using Undersampling and Oversampling.
 This dataset
was acquired TABLE VI. SAMPLE TWEETS FROM TRAINING DATASET
from Kaggle
by using
PersonalityC Personality Type Tweets
afe platform.
The members ISTP 'I'm only a mystery to myself.
of PerC have Of course, to which I say I know; that's my blessing
MB known MBTI Perso INTP
and my curse.
TI_ personality 86 Te nality Kag 25 Mitc
2018 INFJ Hello ENFJ7. Sorry to hear of your distress.
kagg type along 75 xt Predi gle MB hell J
le their tweets ction
ENTP 'I'm finding the lack of me in these posts very alarming.
or text. The
dataset Lol. Its not like our views were unsolicited. What a
ENTJ
comprised of victim.
8676 PerC
That more or less finds myself in agreement, honey
members INFP
cookie.
personality
types. Most things hands on. For me, music. I'm very tactile. I
ESTP
 like to write too.

467 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020

TABLE VII. SAMPLE TWEETS FROM TEST DATASET Algorithm 1. Dividing the Data set in Train and Test sets.
Personality
Tweets #Division of Data in training and testing sets:
Type Type Assign [] to Ӿ↔₮rȧin
Patience is a virtue. So proud that you guys are still Assign [] to Ῡ↔₮rȧin
ENFP
together. Assign [] to Ӿ↔₮est
ISFJ We are always willing to help those in need Assign [] to Ῡ↔₮est
Allocate ₮est→ Ṩize to 20% of ṉ
I'm scared of failure, but also throwing up...take that for what Assign RNDM (0, ṉ -1, ₮est→Ṩize) to ₮INDICES
ENTJ
you will. For Ị = 0 ṯo ṉ-1
INFP That would be the best description for what I usually am. Assign [] to ₮emp
Ƒor each ꝠỌRD in ₮f-Idf [i]
ENFJ You're right. Not sure why I didn't think of that before hahah Append (If-Idf [i][WORD]) to temp
ESTP I have 0 friends. I don't trust anybody. END FOR
If Ị in ₮INDICES then
At the point when the dataset is divided into training data, Ӑppend (ŤEMṖ) to Ӿ↔₮est
Ӑppend (tweet [i][ Ị]) to Ῡ↔₮est
validating data and testing data, it utilizes just a portion of Ēlse
dataset and it is clear that training on minor data instances the Ӑppend ŤEMṖ) to Ӿ↔₮rȧin
model won't behave better and overrate the testing error rate of Ӑppend ŤEMṖ) to Ῡ↔₮rȧin
algorithm to set on the whole dataset. EƝƉ ỊƑ
EƝƉ ƑOR
To address this problem a cross-validation technique will
be used. B. Preprocessing and Feature Selection
4) Cross-validation: It is a statistical methodology that Different pre-processing techniques and various feature
selection are exploited, for more exploration of the personality
perform splitting of data into subgroups, training on the subset
from text. These techniques include tokenization, removal of
of data and utilize the other subset of data to assess the URLs, User mentions and Hash tag, word stemming, stop
model's authentication. words elimination and feature selection using TF IDF [28] and
Cross validation comprises of the following steps: [32].

 Split the dataset into two subsets. 1) Preprocessing: The following preprocessing steps on
mbti_kaggle dataset are applied before classification, acquired
 Reserve one subset data. from the [37] work.
 Train the model on the other subset of data. a) Tokenization: Tokenization is the procedure where
words are divided into the small fractions of text. For this
 Using the reserve subset of data for validation (test) reason, Python-based NLTK tokenizer is utilized.
purpose, if the model exhibits better on validation set,
it shows the effectiveness of the proposed model. b) Dropping Stop Word: Stop words don't reveal any
idea or information. A python code is executed to take out
Cross validation is utilized for the algorithm’s predictive these words utilizing a pre-defined words inventory. For
performance estimation. instance, "the", "is", "an" and so on are called stop words.
a) K fold cross validation: This strategy includes c) Word stemming: It is a text normalization technique.
haphazardly partitioning the data into k subsets of almost even Word stemming is used to reduce the inflection in words to
size. The initial fold is reserved for testing and all the their root form. Stem words are produced by eliminating the
remaining k-1 subsets of data are used for training the model. pre-fix or suffix used with root words.
This process is continued until each Cross-validation fold (of 2) Feature Selection: The following feature selection
k iteration) have been used as the testing set. steps are accomplished using different machine learning
This procedure is repeated kth times; therefore, the Mean classifiers.
Square Error also obtained k times (from Mean Square Error-1 a) CountVectorizer: Using machine learning algorithms,
to kth Mean Square Error). So, k-fold Cross Validation error is it cannot execute text or document directly, rather it may firs
calculated by taking mean of the Mean Square Error over be converted into matrix of numbers. This conversion of text
Kfolds. Fig. 3, explain the working procedure of K-Fold cross
document into numbers vector is called tokens.
validation.
The count vector is a well-known encoding technique to
make word vector for a given document. CountVectorizer takes
what's known as the Bag of Words approach. Each message or
document is divided into tokens and the number of times every
token happens in a message is counted.
CountVectorizer perform the following tasks:
 It tokenizes the whole text document.
Fig. 3. K-Fold Cross Validation Working Procedure.  It constitutes a dictionary of defined words.

468 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020

 It encodes the new document using known word C. Text-based Personality Classification Using MBTI Model
vocabulary. In this proposed work, supervised learning approach is used
b) Term Frequency: It represents the weight of a word for personality prediction. The model will take snippet of post
that how much a word or term occurs in a document. or text as an input and will predict and produce personality trait
(I-E, N-S, T-F, J-P) according to the scanned text. Mayers-
c) Inverse document Frequency: It is also a weighting Briggs Type Indicator is used for classification and prediction
scheme that describe the common word representation in the [4]. This model categorize an individual into 16 different
whole document. personality types based on four dimensions, namely,
d) Term Frequency Inverse Document Frequency: The (i) Attitude →Extroversion vs Introversion: this dimension
TF-IDF score is useful in adjusting the weight between most defines that how an individual focuses their energy and
regular or general words and less ordinarily utilized words. attention, whether get motivated externally from other people’s
Term frequency figures the frequency of every token in the judgement and perception, or motivated by their inner
tweet however, this frequency is balanced by frequency of that thoughts, (ii) Information →Sensing vs iNtuition (S/N): this
token in the entire dataset. TF-IDF value shows the aspect illustrates that how people perceive information and
significance of a token in a tweet of whole dataset [38]. observant(S), relying on their five senses and solid observation,
while intuitive type individuals prefer creativity over constancy
This measure is significant in light of the fact that it
and believe in their guts, (iii) Decision →Thinking vs Feeling
describes the significance of a term, rather than the customary
(T/F): a person with Thinking aspect, always exhibit logical
frequency sum [39].
behaviour in their decisions, while feeling individuals are
Feature engineering module pseudocode is illustrated in the empathic and give priority to emotions over logic, (iv) Tactics
following Algorithm 2. →Judging vs Perceiving (J/P): this dichotomy describes an
individual approach towards work, decision-making and
Algoriṯhm2. Stepwise procedure for Ƒeature Engineering planning. Judging ones are highly organized in their thoughts.
# CountVectorizer They prefer planning over spontaneity. Perceiving individuals
Assign [] to CVectorizer have spontaneous and instinctive nature. They keep all their
Ƒor Ēach tweet in Post Ɗo options open and good at improvising opportunities [40].
ƑorĒach word in tweet Ɗo
Assign Ɗict [word] to Ɗict [Ꝡord] +1 D. Working Procedure of the System for Personality Traits
ĒndƑor Prediction
CVectorizer. Ӑppend (Ɗict)
Ӓssign 0 to Ɗict As depicted in Fig. 4, first, the proposed model is trained
ĒndƑor by giving both labelled data (MBTI type) and text (in the form
ŦermƑrequency of tweets). After training the model, it is evaluated for
Assign CVectorizer to TƑ
efficiency. For better prediction, the dataset will be split into
Assign 0 to ɌOẄ
Ꝡhile (ɌOẄ <= Ɲ-1) Ɗo three phases (training phase, validating phase and testing
Assign SUM (CVectorizer [row].values) to Nwords phase). The validating step will reduce overfitting of data.
For Each Word in CVectorizer [row]
Assign CVectorizer[W]/Nwords to TF [W] The mbti_kaggle dataset is available in two columns,
ĒndƑor namely, (i) type and (ii) posts. By type it means 16 MBTI
ꝠhileĒnd personality types, such as INTP, ENTJ and INFJ, etc. As we
# ŦF/ ƊƑ are interested in MBTI traits rather than types, therefore we
# IƊƑ Ꞓ alculation through python coding added four new columns to the original
Assign [] to IƊƑ dataset for the purpose of traits determination. As a result, the
While (Till the existence of ɌOẄ in TƑ) Do
Assign [] to ṯemp new modified dataset will look like as given bellow in
Ꝡhile (Till the existence of word in ɌOẄ) DO Table VIII.
Assign 0 to Count
Ƒor i fɍom 0 to Ɲ-1 Ɗo
IƑ TƑ [Count][Ẅord]>0 Ŧhen
Ϲount Ϲount+1
Ēnd IƑ
ĒndƑor
Ӓssign LÖG (Ɲ/Ϲount) to Ťemp [Ẅord]
ꝠhileĒnd
IDF. Ӓppend (ŦEMṖ)
ꝠhileĒnd
# ŦF/ IDƑ
Assign 0 to TƑ -IƊƑ
ƑOR I Ƒrom 0 to Ɲ-1 ƊÖ
Assign [] to ŦĒMṖ
ƑorĒach Ẅ in TƑ [i], IƊƑ [i]
ŦĒMṖ [W]= TƑ [i][Ẅ]*IƊƑ[i][ Ẅ]
ĒndƑor
Ӓppend (ŦEMṖ) to TƑ -IƊƑ
ĒndƑor
Fig. 4. Working Procedure of the System.

469 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020

TABLE VIII. SAMPLE OF DATASET USED FOR EXPERIMENT Algorithm 4: XGBoost Working Procedure
Type Posts I/E S/N F/T J/P Data: Dataset and Hyperparameters
I'm scared of failure, but also Initialize
ENTP throwing up...take that for 1 0 0 1 fork = 1,2, ………, M do
what you will. Calculategk = ;
Just a funny comment from Calculate hk = ;
my side. A bit serious maybe. 0 0 1 0 Determine the structure by choosing splits with maximized gain
INFJ
If you don't care about the
functions A=
Determine the leaf weights = ;
I need a date with an INTJ!
INFP God dammit. Opps, wrong 0 0 1 1 Determine the base learnerb(x) = ;
thread. lol Add treesfk(x) = fk-1(x) + b(x);
Algoriṯhm 3: Pseudo code of the entire System end
Result: f(x) =
Inṗut: Ṣet of tweets from mbti_kaggle dataset saved in CSV
format F. Comparing the Efficiency of XGBoost with other
Output: Classification of input text into personality traits
Personality Traits: [“I_E”, “S-N”, “F-T”, “J-P”]
Classifiers
ML-Classifier: [ “XGboost”] The overall prediction performance and efficiency of the
Stop-word List: [There, it, on, into, under…….] proposed system has examined by applying other supervised
Start
machine learning classifiers. This comparison illustrates a true
//Inputting Snippet of Text
Assign Dataset text of post to Text picture of the performance of this proposed classifier, namely
#Pre-processing steps. XGBoost, as compared to the other machine learning
#Tokenization/segmentation algorithms and baseline methods regarding personality
Assign Tokenize(text) to Token prediction capability from the input text [13].
# Dropping of stop words
Set Post_text to Drop_stopwords(tokens) G. Evaluation Metrics
#punctuation
# data set splitting into train/teṣt The evaluation metrics, such as accuracy, precision, recall
Set Ӿ↔₮rȧin, Ῡ↔₮rȧin, Ӿ↔₮est, Ῡ↔₮est to Ṣplit (post_text, and f-measure, describe the performance of a model.
ṯest-ṣize=20%) Therefore, different evaluation metrics has been used to check
# counterVectorizer(Post_Text) the overall efficiency of predictive model.
#Application of tf‣ idf
#Classifier implementation Algorithm 5: Pseudo code of the Performance Evaluation
Set Model to MLClassifier
ӒssignMödel: fit(Ӿ↔₮rȧin, Ῡ↔₮rȧin) to Ϲlassification # Ṗerformance
Set Ϲlassification to Mödel: fit(Ӿ↔₮rȧin, Ῡ↔₮rȧin) ŦC
Ӓccuracy ↔ŦC/Ɲ2
#Traits Ṗrediction
TṖ ϹÖUNŦ (Ṗredicṯion = Ṗosiṯive ӒNƊ Ỵ↔Ŧest =Ṗosiṯive)
Assign Classification: Ṗrediction (Ӿ↔₮est) to Ṗrediction
Set Trait_Prediction to Ϲlassification: Ṗrediction (Ӿ↔₮est) TƝ ↔ ϹÖUNŦ (Ṗredicṯion =Ɲegaṯive ӒNƊ Ỵ↔Ŧest = Ɲegaṯive)
#Αccuracy ƑṖ ↔ ϹÖUNŦ ((Ṗredicṯion = Ṗosiṯive ӒNƊ Ỵ↔Ŧest = Ɲegaṯive)
Set Αccuracy to Αccuracy (Trait_Ṗrediction, Ῡ↔₮est) ƑN ↔ ϹÖUNŦ (Ṗredicṯion = Ɲegaṯive ӒNƊ Ỵ↔Ŧest = Ṗosiṯive)
#Recall score Ṗrecisioń ↔TṖ / (TṖ + ƑṖ)
Set Recall to Recall (Trait_Prediction, Ῡ↔₮est) Recȧll ↔TṖ / (TṖ + ƑN)
#Precision score ϹFM ↔ []
Set Precision to Precsion (Trait_Prediction, Ῡ↔₮est) ϹƑM [‘TṖ’] ↔TṖ
#F1‣ score
ϹƑM [‘FN’] ↔ƑN
Set F1‣ score to F1‣ score (Trait_Prediction, Ῡ↔₮est)
Assign (Accuracy, Re_call, Precession, F1‣ score) to Personality Traits CFM[‘FP’] ↔ƑP
Return (Personality Traits) ϹƑM [‘ŦN’] ↔ŦN

E. Applying XGBoost for Personality Classification IV. RESULTS AND DISCUSSIONS


XGBoost belongs to the family of Gradient Boosting. It is This chapter presents a set of results which are produced
used to handle classification and regression issues that make a from the proposed system by systematically answering the
prediction/ forecast from a set of weak decision trees. raised research questions.
Although work has been performed on personality A. Answer to RQ.1
assessment using supervised machine learning approaches [13, To answer to RQ1: “How to apply supervised machine
17]. Here state of the art Algorithm XGBoost with optimized learning technique, namely XGBoost classifier for classifying
parameters is used for MBTI personality assessment [41]. personality traits from the input text?”, the supervised machine
XGBoost classifier is good on producing better accuracy as learning technique, XGBoost classifier is applied to predict
compared to other machine learning algorithms [41, 42]. The MBTI personality traits from excerpt of text. Fine-tuned
proposed work is the first attempt to predict personality from parameter setting for XGBoost is presented in Table IX.
text using XGBoost as classifier and MBTI as personality
model.

470 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020

Table X shows the results of XGBoost classifier with TABLE XI. RESULTS OF XGBOOST WITH DIFFERENT
PARAMETERSETTINGS
default parameter settings.
It is clear from Table XI that increasing or decreasing the Metrics I-E S-N F-T J-P
learning_rate: 0.01
values of different parameters for XGBoost classifier, has huge n_estimators: 1000
effect on the text classification results. max_depth: 5 Accuracy 93.10 96.70 92.32 90.88
subsample: 0.8
B. Answer to RQ.2 colsample_bytree: 1
While addressing RQ2: “How to apply a class balancing gamma: 1 Recall 89.56 96.24 92.07 94.24
technique on the imbalanced classes of personality traits for Objective =
‘binary:logistic’
performance improvement and What is the efficiency of the Reg_alpha = 0.3 Precession 96.32 97.14 93.64 90.91
proposed technique w.r.t other machine learning techniques?”, Scale-pos_weight = 1
An imbalanced dataset is considered first. Imbalanced dataset
F1_Score 92.82 96.68 92.85 92.55
can be defined as a distribution problem arises in classification
where the number of instances in each class is not equally learning_rate: 0.01
divided. n_estimators: 1000 Accuracy 95.51 97.61 93.15 91.79
max_depth: 6
Whenever, an algorithm is applied on skewed and subsample: 0.8
unbalanced classified dataset, the outcome always diverge colsample_bytree: 1 Recall 93.39 97.21 92.91 94.77
toward the sizeable class and the smaller classes are bypassed gamma: 1
for prediction. This drawback of classification is known as Objective =
‘binary:logistic’ Precession 97.47 98.00 94.37 91.81
class imbalance problem [11].
Reg_alpha = 0.3
Therefore, it is attempted to balance this sparsity by re- Scale-pos_weight = 1
F1_Score 95.39 97.60 93.64 93.27
sampling technique [11]. As two traits are highly imbalanced,
therefore Data Level Re-sampling approach is used for class learning_rate: 0.01
balancing [9]. n_estimators: 500 Accuracy 90.95 94.51 91.20 89.84
max_depth: 6
TABLE IX. PARAMETER SETTING FOR XGBOOST subsample: 0.8
colsample_bytree: 1 Recall 85.78 91.98 90.28 95.23
Parameters Description gamma: 1
Objective =
It describes the effect of weighting of adding
Learning_rate = 0.03 ‘binary:logistic’ Precession 95.48 96.88 93.28 88.69
more trees to the boosting model.
Reg_alpha = 0.3
It corresponds to the fraction of features Scale-pos_weight = 1
Colsample_bytree = 0.4 F1_Score 90.37 94.37 91.75 91.84
(columns) that will be used to train each tree.
It controls the balance between negative and learning_rate: 0.01
Scale-pos_weight = 1
positive classes. n_estimators: 1000 Accuracy 99.37 99.92 94.55 95.53
Subsample ratio of the training instance. Setting it max_depth: 10
to 0.5 means that XGBoost randomly collects half subsample: 0.8
Subsample = 0.8 colsample_bytree: 1 Recall 97.16 100 89.96 92.66
of the data instances to grow trees. This prevents
overfitting. gamma: 1
Objective =
Objective = It returns predicted probability for binary ‘binary:logistic’ Precession 100 99.50 100 100
‘binary:logistic’, classification. Reg_alpha = 0.3
It represents the number of decision trees in Scale-pos_weight = 1 98.56 99.75 94.72 96.19
n_estimators = 1000 F1_Score
XGBoost classifier.

Reg_alpha = 0.3
L1 regularization encourages sparsity (meaning In this section the overall comparison of predicting
pulling weights to 0). personality traits is presented using all evaluation metrics to
It represents the size (depth) of each decision tree determine the performance of different classifiers. Results are
Max-depth = 10 in the model. Over fitting can be controlled using reported in Table XII.
this parameter.
Different classifiers are applied over same mbti_kaggle
Its purpose is to control complexity. It represents
Gamma = 10 that how much loss has to be reduced. It prevents dataset using Re-sampling technique and without Re-sampling
overfittings. technique. Results reported in Table XII depict that XGBoost
obtained the highest score using all four-evaluation metrics and
TABLE X. RESULTS OF XGBOOST WITHOUT PARAMETER SETTINGS across all the MBTI personality dimensions, when imbalance
dataset is experimented. However, Naïve Bayes and Random
Metrics I-E S-N F-T J-P Forest on imbalance dataset, performed poorly. So, it is
Accuracy 87.04 92.32 89.00 85.85 concluded from this experiment that applying classifiers on
No Parameter skewed data is not producing good results.
Recall 81.44 81.75 87.70 89.16
setting
Accuracy 91.59 68.98 91.65 87.80
On the other hand, when different classifiers are tested over
resampled dataset, an improved result is obtained for all
F1_Score 86.22 74.82 89.92 88.47 dimensions over all classifiers.

471 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020

The most accurate and precise algorithm for this proposed 1) Why our Class balancing technique is better: By
work is XGBoost. It got excellent results for all traits using all applying class balancing technique results for all evaluation
metrics. XGBoost obtained maximum accuracy (99.92%) for metrics and for all four personality traits are high and better
S/N trait. Its results are highest for all four dimensions and than base line work. In this dataset two dimensions I/E and
across all metrics.
S/N are highly imbalanced, therefore a class balance technique
is used for better prediction performance.

TABLE XII. COMPARISON OF DIFFERENT CLASSIFIERS PERFORMANCE USING RE-SAMPLE DATASET AND IMBALANCE DATASET

Without Re-sampling With Re-Sampling


Classifier Metrics
I-E S-N F-T J-P I-E S-N F-T J-P
Accuracy 77.02 86.65 60.11 59.31 86.90 81.44 73.45 81.52
Recall 20.34 15.5 89.89 77.17 86.44 98.00 93.82 89.19
KNN
Precession 45.74 32.29 58.64 63.70 65.74 42.79 68.69 81.74
F1_Score 28.16 20.94 70.98 69.79 74.51 59.57 79.31 85.30
Accuracy 78.69 82.01 70.42 69.33 99.34 99.93 90.85 91.30
Recall 53.31 38.25 71.94 75.11 97.00 99.50 83.14 85.72
Decision Tree
Precession 51.84 36.34 73.11 74.68 100 100 100 100
F1_Score 52.56 37.27 72.52 74.89 98.48 99.75 90.79 92.31
Accuracy 77.93 86.03 74.,89 64.90 98.36 99.45 82.15 91.62
Recall 00 0 84.49 97.7 92.59 98.94 74.07 86.24
Random Forest
Precession 1 0 73.31 63.84 100 100 100 100
F1_Score 00 0 78.50 77.22 96.15 99.44 85.10 92.61
Accuracy 83.83 88.40 83.41 75.86 99.27 99.93 94.52 92.18
Recall 40.37 22.0 84.68 86.46 96.69 99.59 89.90 87.91
MLP
Precession 83.83 88.40 83.41 75.86 100 100 100 81.906
F1_Score 40.37 22.0 84.68 86.46 98.32 99.75 88.89 93.14
Accuracy 85.54 88.68 85.02 78.62 95.94 98.08 92.63 91.37
Recall 43.69 22.75 85.64 90.36 91.32 97.00 89.45 91.11
SVM
Precession 82.93 85.84 86.59 78.01 90.28 90.02 96.73 94.53
F1_Score 57.23 35.96 86.12 83.74 90.69 93.38 92.95 92.79
Accuracy 77.86 86.03 54.63 60.92 79.32 88.82 84.04 60.11
Recall 0 0 99.93 100 6.78 20.25 73.18 100
MNB
Precession 0 0 54.47 60.91 97.73 98.78 96.68 60.11
F1_Score 0 0 70.51 75.71 12.66 33.61 83.25 75.09
Accuracy 86.52 89.21 83.16 80.82 99.37 99.92 94.55 95.53
Recall 52.68 31.5 84.04 89.90 97.16 100 89.96 92.66
XGboost
Precession 79.52 78.26 84.80 80.78 100 99.50 100 100
F1_Score 63.38 44.92 84.42 85.10 98.56 99.75 94.72 96.19
Accuracy 82.47 86.48 84.32 76.63 92.80 96.09 88.96 88.44
Recall 25.86 4.5 86.35 93.52 85.33 90.25 85.28 92.14
Logistic Reg
Precession 83.67 78.26 84.99 74.57 82.72 83.18 93.90 89.23
F1_Score 39.51 8.5 85.66 82.98 84.01 86.57 89.34 90.66
Accuracy 85.26 90.29 85.19 79.36 94.31 97.42 91.86 90.99
Recall 41.64 40.5 85.71 90.82 91.64 95.50 87.52 89.39
SGD
Precession 83.54 80.19 86.83 78.61 84.08 87.21 97.21 95.53
F1_Score 55.58 53.82 86.27 84.28 87.70 91.17 92.11 92.36

472 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020

KNN classifier gives overall low performance, however its A very large dataset MBTI9k acquired from reddit is used
Recall for I/E and F/T is a little bit high. for personality prediction [7]. The emphasis of this work is to
extract features and linguistic properties of different words and
The outcome of Decision Tree algorithm for I/E and S/N then these features are used to train various machine leaning
traits is better than F/T and J/P traits. models such as Logistic Regression, SVM and MLP.
Random Forest gives highest for all traits. However, for J/P Classifiers using integration of all features together (LR_all
lowest Recall is obtained. and MLP_all) obtained better results for all traits. The overall
worst results using all classifiers obtained for the T/F
Logistic Regression classifier produced tremendous result dichotomy. The major limitation of this work is that the
for all traits, but again for J/P traits accuracy and Precision are number of words in each post are very large, which lead to a
not up to the mark. little bit lower performance on the part of all classifiers.
The results obtained by applying Naïve Bays classifier is 1) Proposed Work: In this proposed system, the same
comparatively better for I/E and S/N traits.
dataset is used as experimented by [6], However re-sampling
Support Vector Machine when tested on the given dataset it technique is applied over it, and hence obtained results in
gives better and balance results in respect to all traits. SGD respect of all personality traits are very good, especially
Classifier showing remarkable performance for all four XGBoost achieved the best score across all dimensions and all
personality traits. traits as compared to previous work. It is observed that the
MLP classifier achieved outstanding results for all four mbti_kaggle dataset is very skewed, therefore when
traits using four metrics. oversampling technique is applied the output is far better than
XGBoost classifier has proven to be very good for all previous works. Up to 99% accuracy for I/E and S/N traits
classification problems. The results obtained using XGBoost is are achieved using XGBoost classifier, while Bharadwaj [6],
very balance in respect to all personality traits got 88% maximum accuracy for S/N trait. Similarly, for T/F
and J/P proposed work results are promising and obtained
C. Answer to RQ.3 94.55% accuracy for T/F and 95.53% accuracy for J/P
To answer RQ3: “What is the efficiency of the proposed dimension using XGBoost. While in previous work MLP
technique with respect to other baseline methods.” This classifier achieved accuracy of 54.1% for T/F and 61.8% for
proposed model is compared with two baseline methods [6, 7]. J/P dimension. Therefore, it is clear that by using resampling
Classification performed by [6] for personality prediction technique excellent and improved results are obtained for all
using same mbti_kaggle dataset by applying three classifiers four dimensions. The results reported in Table XIII, describe
namely, (i) SVM, (ii) MLP and (iii) Naïve Bayes and got the comparison of proposed work with the baseline method.
accuracy upto 88.4%. Due to imbalance data the result of [6] 2) XGBoost with Outstanding Performance: XGBoost
is not up to the mark. The results show that SVM in belongs to the family of Gradient Boosting is a machine
collaboration with LIWC and TF-IDF feature vectors gave learning technique used for classification and regression
accurate prediction score for all four traits, while MLP with all
problems that produces a prediction from an ensemble of
features Vectors got maximum accuracy score for S/N trait
(90.45%) however its result for J/P trait is lower. Naïve bays weak decision trees.
also perform well for I/E and S/N traits but its performance for The main reason of using this algorithm is its accuracy,
T/F and J/P is very poor. The reason behind better accuracy for speed, efficiency, and feasibility. It’s a linear model and a tree
I/E and S/N dimensions and least performance for T/F and J/P learning algorithm that does parallel computations on a single
is due to class imbalance problem. machine. It also has extra features for doing cross validation
and computing feature importance.

473 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020

TABLE XIII. COMPARISON OF XGBOOST WITH BASELINE TECHNIQUE

Obtained Results
Study Technique Dataset Classifier
Metrics I/E S/N F/T J/P
Accuracy 77% 86.2% 77.9% 62.3%
NB Recall
Precession
Accuracy 84.9% 88.4% 87.0% 78.8%
Bharadwaj, et al. SVM, MLP and
MBTI_Kaggle SVM Recall
(2018) Naïve Bayes
Precession
Accuracy 77.0% 86.3% 54.1% 61.8%
MLP Recall
Precession
Accuracy
SVM F1-Score 79.6% 75.6 64.8 72.6
Precession
Accuracy
SVM, MLP and
Gjurković et al.
Logistic MBTI9k LR F1-Score 81.6 77.0 67.2 74.8
(2018)
Regression
Precession
Accuracy
MLP F1-Score 82.8 79.2 64.8 72.6
Precession

Accuracy 99.37 99.92 94.55 95.53


Proposed (our
XGBoost MBTI_Kaggle Recall 97.16 100 89.96 92.66
work) XGBoost
Precession 100 99.50 100 100
F1-Score 98.56 99.75 94.72 96.19
V. CONCLUSION AND FUTURE WORK A. Constraints or Limitations
The central theme of this study is the application of 1) MBTI model is examined for personality traits
different machine learning techniques on the benchmark, classification, however, others personality models such as Big
MBTI personality dataset namely mbti_kaggle to classify the Five Factor (BFF) and DiSC personality Assessment models,
text into different personality traits such as Introversion- are not experimented and investigated.
Extroversion(I-E), iNtuition-Sensing(N-S), Feeling- 2) The textual data used in the proposed work for
Thinking(F-T) and Judging-Perceiving(J-P). personality assessment is comprised of only English language,
The Mayers-Briggs Type Indicator (MBTI) model is used and the contents of other languages are not experimented.
for text classification and personality traits recognition [4]. 3) Simple over-sampling and under sampling techniques
After applying class balancing techniques on the imbalanced are used to balance and level the skewness of dataset.
classes, different machine learning classifiers, namely, KNN, 4) The dataset comes from only one platform namely
Decision Tree, Random Forest, MLP, Logistic Regression personalitycafe forum, which may lead to biased results.
(LR), SVM, XGBoost, MNB and Stochastic Gradient Descent 5) All the experiments conducted in this proposed work
(SGD) are experimented to identify the personality traits. are based on the classical or traditional machine learning
Evaluation metrics, such as accuracy, precision, recall and Ƒ-
score, are used to analyze and examine the overall efficiency of algorithms.
the predictive model. The obtained results show that score 6) The textual contents which are classified for
achieved by all classifiers across all personality traits is good personality traits identification belong to only one site Twitter,
enough, however, the performance of XGBoost classifier is however other social networking sites are ignored.
outstanding. We got more than 99% precision and accuracy 7) Only textual data is analysed and investigated for
forI/E and S/N traits and obtained all about 95% accuracy for user’s personality traits recognition in his proposed work.
T/F and J/P dimensions. However, KNN classifier resulted in 8) Less weightage is given to feature extraction in
overall lower performance. classification of text, only TF-IDF technique is utilized.

474 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020

B. Future Proposal [12] I. Cantador, I. Fernández-Tobías and A. Bellogín, “Relating personality


types with user preferences in multiple entertainment domains,”
1) The predictive performance of MBTI personality In CEUR workshop proceedings, ShlomoBerkovsky, 2013.
model needs to be compared with the Big Five Factor (BFF) [13] B. Y. Pratama and R. Sarno, "Personality classification based on Twitter
model for better assessment of the traits. text using Naive Bayes, KNN and SVM," 2015 International Conference
on Data and Software Engineering (ICoDSE), Yogyakarta, 2015, pp.
2) Multilingual textual content, especially Urdu and 170-174.
Pashto language textual data can be examined for personality [14] V. Ong et al., "Personality prediction based on Twitter information in
classification. Bahasa Indonesia," 2017 Federated Conference on Computer Science
3) SMOTE (Synthetic Minority Over-sampling and Information Systems (FedCSIS), Prague, 2017, pp. 367-372.
Technique) can be utilized as class balancing method for more [15] F. Alam, E. A. Stepanov and G. Riccardi, “Personality traits recognition
on social network-facebook,” WCPR (ICWSM-13), Cambridge, MA,
robust and reliable performance. USA, 2013.
4) Labelled data may need to be collected from other [16] K. Buraya, A. Farseev, A. Filchenkov and T. S. Chua, “Towards User
platforms like “Reddit” using multiple benchmark datasets. Personality Profiling from Multiple Social Networks,” In AAAI, pp.
4909-4910, 2017.
5) More experiments on personality recognition may be
[17] N. R. Ngatirin, Z. Zainol and T. L. C. Yoong, "A comparative study of
conducted using Deep learning algorithms. different classifiers for automatic personality prediction," 2016 6th IEEE
6) Other social networking sites like FACEBOOK posts International Conference on Control System, Computing and
and comments are required to be examined for automated Engineering (ICCSCE), Batu Ferringhi, 2016, pp. 435-440.
personality traits inference. [18] S. Chaudhary, R. Sing, S. T. Hasan and I. Kaur, “A comparative Study
of Different Classifiers for Myers-Brigg Personality Prediction Model,”
7) Data available in the format of images and videos on IRJET, vol.05, pp.1410-1413, 2018.
social networking sites can be experimented for the task of [19] V. Ong, A. D. Rahmanto, Williem and D. Suhartono, “Exploring
personality traits identification. Personality Prediction from Text on Social Media: A Literature
8) More advanced features selection approaches are Review,” INTERNETWORKING INDONESIA, vol. 9, no. 1, pp. 65-
70, 2017a.
required to be exploited for enhancement of the proposed
[20] J. Golbeck, C. Robles, M. Edmondson and K. Turner, "Predicting
work. Personality from Twitter," 2011 IEEE Third International Conference on
REFERENCES Privacy, Security, Risk and Trust and 2011 IEEE Third International
[1] N. Majumder, S. Poria, A. Gelbukh and E. Cambria, "Deep Learning- Conference on Social Computing, Boston, MA, 2011, pp. 149-156.
Based Document Modeling for Personality Detection from Text," in [21] D. Quercia, M. Kosinski, D. Stillwell and J. Crowcroft, "Our Twitter
IEEE Intelligent Systems, vol. 32, no. 2, pp. 74-79, Mar.-Apr. 2017. Profiles, Our Selves: Predicting Personality with Twitter," 2011 IEEE
[2] D. Xue et al., "Personality Recognition on Social Media With Label Third International Conference on Privacy, Security, Risk and Trust and
Distribution Learning," in IEEE Access, vol. 5, pp. 13478-13488, 2017. 2011 IEEE Third International Conference on Social Computing,
[3] L. R. Goldberg, L. R. ,”An alternative" description of personality": the Boston, MA, 2011, pp. 180-185.
big-five factor structure,” Journal of personality and social [22] B., Verhoeven, W. Daelemans and B. Plank, “Twisty: a multilingual
psychology, vol. 59, no. 6, p.1216, 1990 twitter stylometry corpus for gender and personality profiling,”
[4] I. B. Myers, “The Myers-Briggs Type Indicator: Manual” ,1962 In Proceedings of the 10th Annual Conference on Language Resources
and Evaluation (LREC 2016)/Calzolari, Nicoletta [edit.]; et al. pp. 1-6,
[5] D. Shaffer, M. Schwab-Stone and P. Fisher, “Preparation, field testing, 2016.
interrater reliability and acceptability of the DIS-C,” J Am Acad Child
Adolesc Psychiatry, vol. 32, pp. 643-648, 1993. [23] F. Celli, “Mining user personality in twitter, “ Language, Interaction
and Computation CLIC, 2011.
[6] S. Bharadwaj, S. Sridhar, R. Choudhary and R. Srinath, "Persona Traits
Identification based on Myers-Briggs Type Indicator(MBTI) - A Text [24] X. Sun, B. Liu, Q. Meng, J. Cao, J. Luo and H. Yin, “Group-level
Classification Approach," 2018 International Conference on Advances personality detection based on text generated networks,” World Wide
in Computing, Communications and Informatics (ICACCI), Bangalore, Web, pp. 1-20, 2019.
2018, pp. 1076-1082. [25] F. Celli and L. Rossi, “The role of emotional stability in Twitter
[7] M. Gjurković and J. Šnajder, “Reddit: A Gold Mine for Personality conversations,” In Proceedings of the workshop on semantic analysis in
Prediction,” In Proceedings of the Second Workshop on Computational social media, Association for Computational Linguistics, pp. 10-17,
Modeling of People’s Opinions, Personality, and Emotions in Social 2012.
Media , pp. 87-97, 2018. [26] S. Chishti, X. Li and A. Sarrafzadeh, “Identify Website Personality by
[8] B. Plank, and D. Hovy, “Personality traits on twitter—or—how to get Using Unsupervised Learning Based on Quantitative Website Elements,
1,500 personality tests in a week.” In Proceedings of the 6th Workshop “ In International Conference on Neural Information Processing,
on Computational Approaches to Subjectivity, Sentiment and Social Springer, Cham. pp. 522-530, 2015.
Media Analysis, pp. 92-98, 2015. [27] F. Celli, “Unsupervised personality recognition for social network sites,”
[9] O. Loyola-González, J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa In Proc. of Sixth International Conference on Digital Society, 2012.
and M. García-Borroto, “Study of the impact of resampling methods for [28] P. H. Arnoux, A. Xu, N. Boyette, J. Mahmud, R. Akkiraju and V. Sinha,
contrast pattern-based classifiers in imbalanced databases,” “25 Tweets to Know You: A New Model to Predict Personality with
Neurocomputing, 175, pp. 935-947, 2016. Social Media,” 2017, arXiv preprint arXiv:1704.05513.
[10] A. More, “Survey of resampling techniques for improving classification [29] M. Arroju, A. Hassan, and G. Farnadi, “Age, gender and personality
performance in unbalanced datasets,” 2016, arXiv preprint recognition using tweets in a multilingual setting,” In 6th Conference
arXiv:1608.06048 and Labs of the Evaluation Forum (CLEF 2015): Experimental IR meets
[11] P. Kaur and A. Gosain, “Comparing the Behavior of Oversampling and multilinguality, multimodality, and interaction, pp. 23-31, 2015.
Undersampling Approach of Class Imbalance Learning by Combining [30] L. C. Lukito, A. Erwin, J. Purnama and W. Danoekoesoemo, "Social
Class Imbalance Problem with Noise,” In ICT Based Innovations , pp. media user personality classification using computational linguistic,"
23-30, Springer, Singapore, 2018. 2016 8th International Conference on Information Technology and
Electrical Engineering (ICITEE), Yogyakarta, 2016, pp. 1-6.

475 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 3, 2020

[31] N. Alsadhan and D. Skillicorn, "Estimating Personality from Social [37] M. Z. Asghar, A. Khan, F. Khan and F. M. Kundi, “RIFT: A Rule
Media Posts," 2017 IEEE International Conference on Data Mining Induction Framework for Twitter Sentiment Analysis,” Arabian Journal
Workshops (ICDMW), New Orleans, LA, 2017, pp. 350-356. for Science and Engineering, vol. 43, no. 2, pp.857-877, 2018.
[32] R. K. Hernandez and L. Scott, “Predicting Myers-Briggs type indicator [38] A. Tripathy, A. Agrawal and S. K. Rath, “Classification of sentiment
with text,” In 31st Conference on Neural Information Processing reviews using n-gram machine learning approach,” Expert Systems with
Systems (NIPS 2017), 2017. Applications, 57, pp. 117-126, 2016.
[33] B. Cui and C. Qi, “Survey Analysis of Machine Learning Methods for [39] L. H. Patil and M. Atique, "A novel approach for feature selection
Natural Language Processing for MBTI Personality Type Prediction”. method TF-IDF in document clustering," 2013 3rd IEEE International
[34] D. Xue, L. Wu, Z. Hong, S. Guo, L. Gao et al, “Deep learning-based Advance Computing Conference (IACC), Ghaziabad, 2013, pp. 858-
personality recognition from text posts of online social 862.
networks,” Applied Intelligence, vol. 48, no. 11, pp. 4232-4246, 2018. [40] M. C. Komisin and C. I. Guinn, “Identifying personality types using
[35] Y. Yan, Y. Liu, M. Shyu and M. Chen, "Utilizing concept correlations document classification methods,” In Twenty-Fifth International
for effective imbalanced data classification," Proceedings of the 2014 FLAIRS Conference, 2012.
IEEE 15th International Conference on Information Reuse and [41] D. Nielsen, “Tree Boosting With XGBoost-Why Does XGBoost Win
Integration (IEEE IRI 2014), Redwood City, CA, 2014, pp. 561-568. Every Machine Learning Competition? (Master's thesis, NTNU),” 2016.
[36] S. Rezaei and X. Liu, "Deep Learning for Encrypted Traffic [42] M. M. Tadesse, H. Lin, B. Xu and L. Yang, "Personality Predictions
Classification: An Overview," in IEEE Communications Magazine, vol. Based on User Behavior on the Facebook Social Media Platform," in
57, no. 5, pp. 76-81, May 2019. IEEE Access, vol. 6, pp. 61959-61969, 2018.

476 | P a g e
www.ijacsa.thesai.org

You might also like