MBTI Personality Prediction Using Machine Learning
MBTI Personality Prediction Using Machine Learning
Article
MBTI Personality Prediction Using Machine Learning and
SMOTE for Balancing Data Based on Statement Sentences
Gregorius Ryan 1 , Pricillia Katarina 1 and Derwin Suhartono 2, *
Abstract: The rise of social media as a platform for self-expression and self-understanding has led to
increased interest in using the Myers–Briggs Type Indicator (MBTI) to explore human personalities.
Despite this, there needs to be more research on how other word-embedding techniques, machine
learning algorithms, and imbalanced data-handling techniques can improve the results of MBTI
personality-type predictions. Our research aimed to investigate the efficacy of these techniques by
utilizing the Word2Vec model to obtain a vector representation of words in the corpus data. We
implemented several machine learning approaches, including logistic regression, linear support
vector classification, stochastic gradient descent, random forest, the extreme gradient boosting
classifier, and the cat boosting classifier. In addition, we used the synthetic minority oversampling
technique (SMOTE) to address the issue of imbalanced data. The results showed that our approach
could achieve a relatively high F1 score (between 0.7383 and 0.8282), depending on the chosen model
for predicting and classifying MBTI personality. Furthermore, we found that using SMOTE could
improve the selected models’ performance (F1 score between 0.7553 and 0.8337), proving that the
machine learning approach integrated with Word2Vec and SMOTE could predict and classify MBTI
personality well, thus enhancing the understanding of MBTI.
Keywords: personality; Myers–Briggs Type Indicator (MBTI); natural language processing; machine
Citation: Ryan, G.; Katarina, P.; learning; Word2Vec; SMOTE
Suhartono, D. MBTI Personality
Prediction Using Machine Learning
and SMOTE for Balancing Data
Based on Statement Sentences. 1. Introduction
Information 2023, 14, 217. https://
The COVID-19 epidemic has altered how people connect and react to one another.
doi.org/10.3390/info14040217
Over the past few years, this pandemic has triggered a significant surge in internet and
Academic Editor: Katsuhide Fujita social media usage. According to data from Statista.com, shown in Figure 1, the number of
Received: 21 February 2023
internet users worldwide in 2022 was estimated to reach 5.03 billion people, equivalent to
Revised: 24 March 2023
63.1% of the global population. Meanwhile, the number of social media users worldwide
Accepted: 28 March 2023 in 2022 was estimated to be around 4.7 billion, or 59% of the global population [1], with the
Published: 3 April 2023 average duration of social media usage in 2022 estimated to be 2 h and 45 min per day. This
amount will likely rise over time, with social media users anticipated to reach 5.85 billion
by 2027 [2].
Social media platforms such as Facebook, YouTube, WhatsApp, Instagram, WeChat,
Copyright: © 2023 by the authors. and TikTok have become the most popular choices for activities in the virtual world [3]. The
Licensee MDPI, Basel, Switzerland. activities commonly performed on social media vary depending on the user’s interests and
This article is an open access article personality type. However, these activities include sharing information, communicating
distributed under the terms and with friends, watching videos, creating content, and commenting. With the abundance of
conditions of the Creative Commons activities that can be carried out on social media, understanding someone’s personality
Attribution (CC BY) license (https://
is necessary to ensure that the information or content spread on social media (whether
creativecommons.org/licenses/by/
created or received) can be tailored to users’ interests and reach the right people.
4.0/).
Figure 1. Number of social media users worldwide from 2017 to 2027 (in billions) [2]. The asterisk
Figure 1. Number of social media users worldwide from 2017 to 2027 (in billions) [2]. The asterisk
sign “*” indicates the prediction of the number of people using social media in the following year.
sign “*” indicates the prediction of the number of people using social media in the following year.
Social media platforms such as Facebook, YouTube, WhatsApp, Instagram, WeChat,
A personality is a set of traits or characteristics that determine how an individual
and TikTok have become the most popular choices for activities in the virtual world [3].
thinks, feels, and acts. One of the most utilized psychological instruments for understanding
The activities commonly performed on social media vary depending on the user’s inter-
and predicting human behavior is the Myers–Briggs Type Indicator (MBTI), a popular
ests and personality type. However, these activities include sharing information, com-
instrument for over 50 years that is now widely discussed on social media. Based on Jung’s
municating with friends, watching videos, creating content, and commenting. With the
theory of psychological types (1971) [4], MBTI is a personality measurement model that
abundance of activities that can be carried out on social media, understanding someone’s
outlines a person’s preferences along four dimensions, where each distinct dimension
personality is necessary to ensure that the information or content spread on social media
describes the propensities of the individual [5]:
(whether created or received) can be tailored to users’ interests and reach the right people.
• Introvert (I)–Extrovert
A personality is a set of(E): Thisordimension
traits measures
characteristics how individuals
that determine how anreact to their
individual
environment, whether they are oriented towards the outside (extrovert)
thinks, feels, and acts. One of the most utilized psychological instruments for understand- or the in-
side (introvert).
ing and predicting human behavior is the Myers–Briggs Type Indicator (MBTI), a popular
•instrument
Intuition for(N)–Sensing
over 50 years(S):thatThis dimension
is now measures
widely discussed onhow
socialindividuals
media. Based process infor-
on Jung’s
mation,
theory whether they
of psychological rely(1971)
types more[4],on MBTI
information received measurement
is a personality through directmodel experience
that
(sensing)
outlines or trust
a person’s their instincts
preferences alongandfourimagination
dimensions,(intuition)
where eachmore.distinct dimension de-
•scribes
Thinking (T)–Feeling
the propensities (F):individual
of the This dimension
[5]: measures how individuals make deci-
sions, whether they rely more on logic and analysis (thinking) or emotions and
• Introvert (I)–Extrovert (E): This dimension measures how individuals react to their
feelings (feeling).
environment, whether they are oriented towards the outside (extrovert) or the inside
• Judgment (J)–Perception (P): This dimension measures how individuals manage their
(introvert).
environment, whether they are more inclined to make plans and stick to their tasks
• Intuition (N)–Sensing (S): This dimension measures how individuals process infor-
(judging) or are more flexible and accepting of change (perceiving).
mation, whether they rely more on information received through direct experience
These fourorfundamental
(sensing) dimensions
trust their instincts can be combined
and imagination (intuition)to more.
create one of 16 possible
•
personality
Thinking types that describe
(T)–Feeling (F): Thisindividual
dimensionpersonality
measures how traits [6]. MBTImake
individuals has decisions,
several ap-
plications
whetherin various
they relyfields,
moreincluding
on logic career development,
and analysis (thinking)counseling,
or emotions andandrelationship
feelings
improvement
(feeling). [7]. However, like other personality measurement models, MBTI must be
•
used cautiously,(J)–Perception
Judgment not as a diagnostic tooldimension
(P): This or for making vaguehow
measures generalizations about an
individuals manage
individual’s personality. whether
their environment, Other personality measurement
they are more inclined tomodels include
make plans andthe Big to
stick Five Per-
their
sonality Traits, which categorize the human personality into five
tasks (judging) or are more flexible and accepting of change (perceiving). main domains (openness,
conscientiousness, extraversion,
These four fundamental agreeableness,
dimensions can be and neuroticism)
combined to create [8],
one andof DISC, which
16 possible
classifies the human personality into four main domains in terms
personality types that describe individual personality traits [6]. MBTI has several of work and social
interactions (dominance, influence, steadiness, and conscientiousness) [9].
Information 2023, 14, 217 3 of 15
Some researchers have argued that the Big Five Personality Traits provide a more
comprehensive view of the human personality than MBTI and DISC [10,11]. However,
research on MBTI is still relevant and important, as the MBTI model offers a more specific
interpretation of an individual’s personality type and can help individuals understand
their preferences and how they interact with others [7]. It is also important to note that
each model has its strengths and weaknesses, and no model is accurate and covers all
aspects of an individual’s personality. This is because each person is unique and different
from everyone else. Therefore, it is important to use these models wisely and not view one
model as a universal solution to all personality problems.
Research on natural language processing (NLP) for predicting an individual’s MBTI
has also been a growing topic in recent years. Using word-embedding technologies and
machine learning approaches, NLP techniques can provide computation and extract infor-
mation from digital communication to identify, predict, and classify individuals into MBTI
personality types [12]. However, despite the growing interest in using these techniques
for MBTI predictions, some challenges still need to be addressed. Specifically, there is a
need for more research on how other word-embedding techniques, machine learning algo-
rithms, and imbalanced data-handling techniques can improve the results and reliability of
these predictions.
Word embedding is a computational technique that allows one to convert words or
phrases in textual form into numerical vectors to measure how strongly related the given
words are [13]. It is used to minimize human communication’s vector dimension and
identify features associated with MBTI. Most existing MBTI research used TF-IDF as the
weighting technique in information retrieval to assess the relevance of words in a document
or corpus [14]. However, in this research, we used Word2Vec as a word-embedding
technique to represent words as vectors in a high-dimensional space and capture their
relationships with other words in the corpus [15].
In addition to the exploratory use of Word2Vec, this research provides several contri-
butions to the field of MBTI prediction. Firstly, we implemented various machine learning
models, including logistic regression (LR), linear support vector classification (LSVC),
stochastic gradient descent (SGD), random forest (RF), the extreme gradient boosting classi-
fier (XGBoost), and the cat boosting classifier (CatBoost), which are explained in Section 3.2,
to evaluate their effectiveness in predicting MBTI types based on the features identified
from the word-embedding method. Secondly, we addressed the imbalanced data issue
using SMOTE, which improved the performance of selected models. Finally, we conducted
a comprehensive comparison of the performance of each method used, offering insights
into the most suitable approach for MBTI prediction based on text data.
2. Related Works
This research was based on previous works classifying MBTI types. Researchers
in [7] performed MBTI personality prediction based on data obtained from social media
using XGBoost. Before the classification task, the processing started by cleaning and pre-
processing the raw data, i.e., through word removal (URLs and stop words) using NLTK,
and continued with lemmatization. The following step was vectorizing the processed text
by weighting each relevant piece of text using TF-IDF, finishing with the classification task
to make a prediction. The results showed that XGBoost achieved an accuracy for I-S of
78.17%, N-S 86.06%, F-T 71.78%, and J-P 65.70%.
In [16], researchers conducted MBTI personality prediction using K-means clustering
and gradient boosting. The step before classification consisted of data cleaning and pre-
processing (removing URLs and MBTI profile strings, converting all text into lowercase,
and lemmatization) and creating vector representations using TF-IDF. The results showed
that by using K-means to form the clusters and XGBoost for hyperparameter tuning, the
overall accuracy fell in the range of 85–90% for each dimension. Nevertheless, this research
had some space for improvement, such as applying more sophisticated parameters; for
personality prediction using K-means clustering and gradient boosting. The step before
classification consisted of data cleaning and preprocessing (removing URLs and MBTI
profile strings, converting all text into lowercase, and lemmatization) and creating vector
representations using TF-IDF. The results showed that by using K-means to form the clus-
Information 2023, 14,ters
217 and XGBoost for hyperparameter tuning, the overall accuracy fell in the range of 85– 4 of 15
90% for each dimension. Nevertheless, this research had some space for improvement,
such as applying more sophisticated parameters; for example, raising the tree depth or
increasing the example,
number of iterations
raising on adepth
the tree moreorbalanced
increasing dataset could have
the number considerably
of iterations on a more balanced
enhanced the results.
dataset could have considerably enhanced the results.
In [17], the researchers performed
In [17], the researchersMBTI personality
performed MBTIprediction byprediction
personality comparing bydiffer-
comparing different
ent machine learning techniques, namely support vector machine (SVM),
machine learning techniques, namely support vector machine (SVM), the naïve Bayesthe naïve Bayes
classifier, and classifier,
recurrent and
neural networks,
recurrent implemented
neural according to the
networks, implemented cross-industry
according to the cross-industry
standard process for data
standard mining
process (CRISP-DM),
for data combined with
mining (CRISP-DM), the agile
combined withmethodology.
the agile methodology. The
The results showed that recurrent neural networks (RNNs) with additional bidirectional
results showed that recurrent neural networks (RNNs) with additional bidirectional long
long short-term memorymemory
short-term (BI-LSTM) produced
(BI-LSTM) a higher
produced score compared
a higher to naïve
score compared Bayes
to naïve Bayes and SVM,
and SVM, withwith an overall accuracy of 49.75%.
an overall accuracy of 49.75%.
The approach The proposed in this
approach researchinwas
proposed thistoresearch
performwas MBTI personality
to perform MBTIprediction
personality prediction
using the wordusing embedding
the word andembedding
several machine learning
and several approaches,
machine suchapproaches,
learning as logistic re-such as logistic
gression (LR),regression
linear support
(LR), vector classification
linear support vector(LSVC), stochastic
classification gradient
(LSVC), descent
stochastic gradient descent
(SGD), random(SGD),forest random
(RF), theforest
extreme gradient
(RF), boosting
the extreme classifier
gradient (XGBoost),
boosting and(XGBoost),
classifier the cat and the cat
boosting classifier (CatBoost).
boosting classifier (CatBoost).
3. Methodology 3. Methodology
As shown
As shown in Figure in Figure
2, several steps2, had
several
to besteps had out
carried to betocarried
developoutthetomodel
develop the model
smoothly, thus achieving the goal of this research. These methods included understand-understanding
smoothly, thus achieving the goal of this research. These methods included
ing the datasetthe dataset
with with
various rawvarious raw data
data analysis analysis preparing
techniques; techniques;thepreparing the dataset (feature
dataset (feature
grouping, data cleaning, and data normalization); processing the dataset (tokenization (tokenization
grouping, data cleaning, and data normalization); processing the dataset
and vectorization);
and vectorization); creatingthe
creating and training and training
model withthe modeldata;
training with improving
training data;
the improving
data the data
(using SMOTE); and evaluating the model through comparisons
(using SMOTE); and evaluating the model through comparisons based on a measurement based on a measurement
metric (F1 score).
metric (F1 score).
3.1. Dataset
This section provides an understanding of how the data used in this research were
managed and prepared before being used for model training and evaluation.
Figure 3. Distribution of the 16 types of MBTI personalities in the dataset used in this research.
Figure 3. Distribution of the 16 types of MBTI personalities in the dataset used in this research.
3.1.2. Data
3.1.2. DataPreparation
Preparation
Four Dimensions
Four Dimensions
The MBTI type data could be divided into four different classes, namely Introvert
The MBTI type data could be divided into four different classes, namely Introvert (I)–
(I)–Extrovert (E), Intuition (N)–Sensing (S), Thinking (T)–Feeling (F), and Judgment (J)–
Extrovert (E), Intuition (N)–Sensing (S), Thinking (T)–Feeling (F), and Judgment (J)–Per-
Perception (P). Below, we present the distribution of the data for each class.
ception (P). Below, we present the distribution of the data for each class.
The distribution of classes presented in Table 1 refers to the main characteristics of
The distribution of classes presented in Table 1 refers to the main characteristics of
each class associated with the indicated MBTI type. This was useful for determining the
each class associated with the indicated MBTI type. This was useful for determining the
size of the dataset that was used to classify the MBTI type data.
size of the dataset that was used to classify the MBTI type data.
Table 1. MBTI type class distribution.
Data Cleaning
Data cleaning is a crucial step to eliminate unwanted information, improve data
quality, and remove noise. It is a process of detecting and correcting or eliminating errors
contained in data. Besides improving the data quality, in this research, the implementation
of data cleaning also reduced the noise that SMOTE generated. SMOTE can enhance data
noise if the original data contain mistakes or inconsistencies, since it creates synthetic data
by interpolating between existing datapoints, and any inaccuracies in the original data are
transferred to the synthetic data.
Many approaches can be adopted to minimize the noise in imbalanced data; for
example, the authors of [19] employed a hybrid framework for fault detection and diagnosis
(FDD) frameworks with a signal processing method. This research used data preprocessing
and cleaning, one of the three leading solutions proposed in [19], to fix the problem during
FDD, which was executed before employing SMOTE to prevent data noise problems. The
data-cleaning actions that were implemented for our dataset were as follows:
• Converting letters to lowercase.
• Removing links.
• Removing punctuation.
• Removing stopwords.
By performing data cleaning, the appropriate data were easier to process. Lemmatiza-
tion was also performed to transform words in the data into primary forms. The lemmatizer
helped us to identify words that were related to each other.
where P(w) represents the probability of the word w; ∑ c ∈ C represents the sum of all
context words c in the target word’s context window; and P(w|c) represents the likelihood
of the word w in context c [13].
tor form. This method is the opposite of CBOW, as it uses a given word to guess the words
around it [15]. The equation for Skip-gram is as follows:
Information 2023, 14, 217 where 𝑃(𝑤) represents the probability of the word 𝑤; ∑ 𝑐 ∈ 𝐶 represents the sum of all
7 of 15
context words 𝑐 in the target word’s context window; and 𝑃(𝑐|𝑤) represents the likeli-
hood of the word 𝑐 that is close to the word 𝑤 [13].
Figure 4. The difference in architecture between the CBOW and Skip-gram models for word em-
Figure 4. The difference in architecture between the CBOW and Skip-gram models for word embed-
bedding.
ding. The CBOW
The CBOW modelmodel
takestakes several
several wordswords and calculates
and calculates the probability
the probability of the word’s
of the target target word’s
oc-
currence, while the Skip-gram model takes the target word and tries to predict the occurrence of of
occurrence, while the Skip-gram model takes the target word and tries to predict the occurrence
related
relatedwords
words[15].
[15].
Skip-gram
The processisofalso
worda word-embedding
embedding usingmethod
Word2Vec thatininvolves encoding
this research was words
carriedinto
outvector
by
form. This method is the opposite of CBOW, as it uses a given word to guess
initializing the Word2Vec model using the gensim Python library with sentence, size, win- the words
around
dow, anditmin_count
[15]. The equation for Skip-gram
parameters. The sentenceis parameter
as follows:was a set of sentences to be used
to train the model, the size parameter set the vector size for each word, the window pa-
rameter specified the number of w) = ∑
P(words ∈C
to cthe P(and
left c w)right
P(wof) the word to be examined, (2)
where P(w) represents the probability of the word w; ∑ c ∈ C represents the sum of all
context words c in the target word’s context window; and P(c|w) represents the likelihood
of the word c that is close to the word w [13].
The process of word embedding using Word2Vec in this research was carried out
by initializing the Word2Vec model using the gensim Python library with sentence, size,
window, and min_count parameters. The sentence parameter was a set of sentences to
be used to train the model, the size parameter set the vector size for each word, the
window parameter specified the number of words to the left and right of the word to be
examined, and the min_count parameter specified the minimum number of words required
in the phrase.
We chose the CBOW model over the Skip-gram model since CBOW could better
represent frequent words and be trained quicker than Skip-gram [15]. After initialization
was completed, the Word2Vec model was trained with 50 epochs and total_examples
parameters. The epoch parameter determined how many times the model iterated through
the training data, while the total_examples parameter set the total number of sentences to be
processed. Afterwards, the model was used to generate a vector of a sentence with values
from the pre-defined Word2Vec model, and a high-dimensional matrix could be created.
3.2. Modeling
This section provides a general overview of the six machine learning models that were
used in the research. For each model, we briefly explain the basic concepts and how it
works, as well as providing some additional information.
y = wT x + b (4)
where y is the predicted class, w T is the weight, x is the featuring vector, and b is the
bias [26]. The prediction result is based on the sign produced by the equation, where
positive values correspond to one class and negative values to another class.
w t + 1 = w t − γt ∇ w Q ( z t , w t ) (5)
Information 2023, 14, 217 9 of 15
where wt is the weighted vector; γt is the learning rate; and ∇w Q(zt , wt ) is the gradient of
the loss function with respect to weight [32].
where Pt (y/x) represents the probability distribution of a specific tree, and x is a collection of
test samples [36]. Using random forest for prediction modeling has the advantage of being
able to handle large datasets with numerous predictor variables. However, in practical
applications, it is often necessary to reduce the number of predictors used for making
outcome predictions to improve the efficiency of the process [37].
where L is the loss function that determines how big the error is between the actual target
yi and the prediction ŷi , and Ω is the regularization term that restricts the model from over-
fitting. Because XGBoost is created using multiple cores [39], and several hyperparameters
can be optimized, XGBoost can improve the model’s performance and speed by minimizing
overfitting, enhancing generalization performance, and shortening the computation time,
making it a popular algorithm in machine learning [40].
3.2.6. CatBoost
CatBoost is a gradient boosting decision tree (GDBT) model developed by Yandex. It
includes two significant algorithmic advancements compared to traditional GBDT:
• It utilizes a permutation-driven ordered boosting method instead of the conven-
tional approach.
• It employs a unique categorical feature-processing algorithm.
These improvements were designed to address a specific type of target leakage in
previous GBDT implementations, which could lead to inaccurate predictions [41,42].
The CatBoost equation cannot be expressed with a single formula as it is a complex
machine learning algorithm. This algorithm combines several techniques, such as gradient
boosting, decision trees, and categorical feature handling. The algorithm builds small
trees iteratively using gradient boosting techniques to improve the model’s accuracy by
minimize the expected loss [42], as shown in Equation (8) below:
2 2
δLy δLy
t 1
h = argmin E −h ≈ argmin −h h∈H (8)
δF t−1 n δF t−1
previous GBDT implementations, which could lead to inaccurate predictions [41,42].
machine learning algorithm. This algorithm combines several techniques, such as gradient
The CatBoost equation cannot be expressed with a single formula as it is a complex
boosting, decision trees, and categorical feature handling. The algorithm builds small
machine learning algorithm. This algorithm combines several techniques, such as gradient
trees iteratively using gradient boosting techniques to improve the model’s accuracy by
boosting, decision trees, and categorical feature handling. The algorithm builds small
minimize the expected loss [42], as shown in Equation (8) below:
trees iteratively using gradient boosting techniques to improve the model’s accuracy by
Information 2023, 14, 217 𝛿ℒ𝑦as shown in Equation
minimize the expected loss [42], 1 (8)𝛿ℒ𝑦
below: 10 of 15
ℎ = arg 𝑚𝑖𝑛 𝔼 − ℎ ≈ arg 𝑚𝑖𝑛 −ℎ
𝛿𝐹 𝛿ℒ𝑦 𝑛 1𝛿𝐹 𝛿ℒ𝑦 (8)
ℎ∈𝐻 ℎ = arg 𝑚𝑖𝑛 𝔼
ℎ∈𝐻 − ℎ ≈ arg 𝑚𝑖𝑛 −ℎ
𝛿𝐹 𝑛 𝛿𝐹 (8)
It is also designed to to
handle categorical features in in
a better way compared to to
other
ℎ ∈It𝐻is
gradient
also designed handle ∈categorical
ℎutilizing
𝐻 features a better way compared other
gradient boosting algorithms by utilizing modified target-based statistics that help totore-
boosting algorithms by modified target-based statistics that help
reduce the It computational
is also designedburden
to handle categoricalcategorical
of processing features infeatures
a better [43].
way CatBoost
compareduses to other
duce the computational burden of processing categorical features [43]. CatBoost uses cat-
gradient
categorical boosting
encoding algorithms
techniques by utilizing
such as one-hot modified
encoding, target-based statistics
target statistics that help
encoding, andto re-
egorical encoding techniques such as one-hot encoding, target statistics encoding, and
ducefor
binning the computational
categorical feature burden
handling.of processing
This allows categorical features
the algorithm [43]. CatBoost
to process uses cat-
categorical
binning for categorical feature handling. This allows the algorithm to process categorical
egorical
features encoding
and improve techniques
prediction such as
accuracy one-hot [44].
efficiently encoding,
Below target statisticstoencoding,
is the equation estimate and
features and improve prediction accuracy efficiently [44]. Below is the equation to esti-
binning
the ith for categorical
categorical feature
variable with the handling.
k-th element: This allows the algorithm to process categorical
mate the 𝑖𝑡ℎ categorical variable with the 𝑘-𝑡ℎ element:
features and improve prediction accuracy efficiently [44]. Below is the equation to esti-
mate the 𝑖𝑡ℎ categorical variable ∑∑ xwith
∈ Dk 𝟙the
j∈ {{ x j =𝑘-𝑡ℎ
i x i } · yelement:
i k } ∙ j +a p
𝑥 k= ∑∑x ∈∈D 𝟙{xi =xi }+a
x̂ = (9)
(9)
∑ j ∈ k 𝟙{ j k } ∙
{ }
𝑥 =
where parameter a must be greater than ∑ ∈ and
zero, 𝟙 a frequently used value for p (prior) is (9)
{ }
thewhere
averageparameter 𝑎 must
target value in be
thegreater than
training zero, D.
dataset andAa comprehensive value for 𝑝 (prior)
frequently used explanation of the is
the
CatBoostaverage target
algorithm value
can in the training dataset 𝐷
[42].zero, and a frequently used value for 𝑝the
. A comprehensive explanation of Cat-
where parameter 𝑎 be obtained
must fromthan
be greater (prior) is
Boost algorithm can be obtained from [42].
the average target value in the training dataset 𝐷 . A comprehensive explanation of the Cat-
3.3. Data Balancing Using SMOTE and F1 Score Metric
Boost algorithm can be obtained from [42].
3.3.This
Data Balancing
section Usinga SMOTE
provides and F1-ScoreofMetric
general explanation using SMOTE to address data imbalance
3.3.This
problems and
Datasection
usingprovides
BalancingtheUsing aSMOTE
F1 score general explanation
as theand
evaluation of using
F1-Scoremetric
Metric SMOTE
in this to address data imbal-
research.
ance problems and using the F1 score as the evaluation metric in this research.
3.3.1. SMOTE This section provides a general explanation of using SMOTE to address data imbal-
ance problems and using the F1 score as the evaluation metric in this research.
3.3.1.
TheSMOTE
synthetic minority oversampling technique (SMOTE) is an approach that uses
“synthetic”
The instances
synthetic to oversample
minority the minority
oversampling class to(SMOTE)
technique resolve unbalanced data.that
is an approach Using uses
3.3.1. SMOTE
synthetic examples in “feature space” rather than “data space” means
“synthetic” instances to oversample the minority class to resolve unbalanced data. Using that SMOTE is
conducted The synthetic
based on theminority
value oversampling
and technique
characteristics of the (SMOTE)
data is an approach
relationships instead that
of uses
synthetic examples in “feature space” rather than “data space” means that SMOTE is con-
“synthetic”
focusing instances to oversample the minority class to resolve unbalanced data. Using
ducted on all on
based datapoints. SMOTE
the value and works by of
characteristics injecting
the datasynthetic casesinstead
relationships along the lines
of focusing
synthetic
connecting anyexamples
or all of in “feature
the k-nearestspace” ratherofthan
neighbors each“data space”
minority means
class and that SMOTE is con-
oversampling
on all datapoints. SMOTE works by injecting synthetic cases along the lines connecting
each ducted based onNeighbors
the value and characteristics of the data relationships instead of focusing
anyminority
or all of class.
the k-nearest from
neighbors the
of k-nearest neighbors
each minority are oversampling
class and picked randomly eachbased
minor-
onity on all
theclass.
amountdatapoints. SMOTE
of oversampling works by injecting synthetic cases along the lines connecting
Neighbors from theneeded [45].neighbors are picked randomly based on the
k-nearest
any or all of the k-nearest neighbors of each minority class and oversampling each minor-
amount of oversampling needed [45].
3.3.2.ity
F1 class.
Score Neighbors from the k-nearest neighbors are picked randomly based on the
amount
TheF1 of oversampling
F1Score
score is a metric usedneeded [45].
to evaluate a classifier’s performance by combining its
3.3.2.
precision and recall. It combines these two measures into a single statistic by taking the
TheF1
3.3.2. F1Score
score is a metric used to evaluate a classifier’s performance by combining its
harmonic mean of the precision and recall values [46]. The F1 score is commonly used to
precision and recall. It combines these two measures into a single statistic by taking the
compare the Theeffectiveness
F1 score is aof different
metric usedclassifiers.
to evaluate a classifier’s performance by combining its
precision and recall. It combines these two measures into a single statistic by taking the
P∗R
F1 = 2 ∗ (10)
P+R
where P is precision, and R is recall.
Furthermore, we improved the results for each model using SMOTE, a technique
to handle the imbalance of MBTI data in this research. SMOTE increased the number of
datapoints by generating new samples from existing ones. This technique helped to make
the dataset more balanced, which improved the model’s performance, as seen clearly from
the results in Table 3.
Table 4 shows that the LR model experienced an improvement from the previous score
of 0.8282 to a score of 0.8337, with dimension 3 (N/S) again obtaining the highest score at
0.8821. Furthermore, the results showed an increase in the scores for some dimensions and
a decrease in the scores for others with specific models. Overall, the results showed that
the LR model was better-suited for MBTI personality prediction using word embedding
and machine learning than the other models. The use of SMOTE also improved the results
significantly, further validating this technique’s effectiveness.
Based on the results of this research, we realized that many different methods and
dimensions could be used to assess the efficacy of a machine learning model for predicting
MBTI personality type. Previous research used either 4 dimensions or 16 dimensions, as
well as combining machine learning with deep learning to obtain the optimum results or
using machine learning alone, as in this research.
Research conducted by Amirhosseini and Hassan [7] used the XGBoost method, and
then divided the data into four dimensions and yielded an average accuracy of 0.7543.
Mushtaq et al. [16] used the K-means clustering and XGBoost methods and divided the
data into four dimensions, yielding an average accuracy of 0.8630. Moreover, Ontoum
Information 2023, 14, 217 12 of 15
and Jonathan [17] used recurrent neural networks with BI-LSTM and divided the data
into 16 dimensions, yielding an average accuracy of 0.4975. According to these varied
results, the research conducted by Mushtaq et al. [16] yielded the highest values, though
the process and performance metrics differed. Our research process for predicting MBTI
used Word2Vec as a word-embedding technique and SMOTE as a technique to handle the
imbalanced data. Moreover, the metric we used was the F1 score, whereas the previous
research used accuracy as the primary metric. We chose the F1 score as the primary metric
rather than accuracy since, in this case, we were dealing with an imbalanced dataset, and
the F1 score considers both precision and recall, offering a more accurate estimate of a
model’s ability to accurately identify both positive and negative classes [46].
In sum, the LR model, with an F1 score of 0.8337 after the implementation of SMOTE,
along with the various data-handling techniques proposed in this research, could help other
researchers identify problems that might have been overlooked in previous or subsequent
research regarding personality predicting.
5. Conclusions
In this research, the prediction of MBTI personality types based on sentences was
performed using the Python programming language. The proposed method used in this
research involved Word2Vec embedding, SMOTE, and six machine learning classifiers
that we trained and tested individually to predict MBTI personality type. The results
showed that the best machine learning model for predicting MBTI type dimensions in this
research was logistic regression (LR), with an average F1 score of 0.8282. The employed
SMOTE technique also showed a better result, with the F1 score increasing to 0.8337, and
dimension 3 (N/S) had the highest score of 0.8821. The acceptable threshold for the F1
score varies depending on the application, but an F1 score close to 1 is generally considered
high for data classification. Therefore, this result was more favorable when compared to the
other models considered, showing that the proposed approach could be used to enhance
our understanding of MBTI and could be employed in various applications that require
personality classification.
In future works, we plan to enhance our research by incorporating other data sources
using more advanced machine learning algorithms and deep learning architectures, such
as convolutional neural networks (CNNs) [48] and recurrent neural networks (RNNs) [49],
to predict MBTI personality types more accurately. Furthermore, we plan to experiment
with different word-embedding techniques, such as global vectors for word representation
(GloVe) [50] and bidirectional encoder representations from transformers (BERT) [51], to
more accurately represent the semantic relationships between words. On top of this, we aim
to include information from other sources, such as social media data, to enrich our under-
standing of personality types. Finally, we believe that we can achieve even more accurate
results by incorporating recent advancements in natural language processing techniques
such as transformers. With these future research directions, we aim to achieve an even
better F1 score and provide a more comprehensive analysis of the MBTI personality types.
Author Contributions: Conceptualization, G.R. and P.K.; methodology, G.R. and P.K.; software,
G.R. and P.K.; validation, G.R. and P.K.; formal analysis, G.R. and P.K.; investigation, G.R. and P.K.;
resources, G.R. and P.K.; data curation, G.R. and P.K.; writing—original draft preparation, G.R. and
P.K.; writing—review and editing, G.R., P.K. and D.S.; visualization, G.R. and P.K.; supervision,
D.S.; project administration, D.S.; funding acquisition, D.S. All authors have read and agreed to the
published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The (MBTI) Myers–Briggs Personality Type Dataset is available from
Kaggle at https://www.kaggle.com/datasets/datasnaek/mbti-type (accessed on 20 November 2022).
Information 2023, 14, 217 13 of 15
Acknowledgments: The work was supported by Bina Nusantara University. The authors are also
profoundly grateful for the reviewers’ helpful comments and suggestions, which helped improve
the presentation.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
References
1. Petrosyan, A. Worldwide Digital Population July 2022. Statista. Available online: https://www.statista.com/statistics/617136
/digital-population-worldwide/ (accessed on 6 January 2023).
2. Dixon, S. Number of Social Media Users Worldwide 2017–2027. Statista. 2022. Available online: https://www.statista.com/
statistics/278414/number-of-worldwide-social-network-users/ (accessed on 6 January 2023).
3. Dixon, S. Global Social Networks Ranked by Number of Users 2022. Statista. 2022. Available online: https://www.statista.com/
statistics/272014/global-social-networks-ranked-by-number-of-users/ (accessed on 6 January 2023).
4. Myers, I.B.; Mccaulley, M.H. Manual, a Guide to the Development and Use of the Myers-Briggs Type Indicator; Consulting Psychologists
Press: Palo Alto, CA, USA, 1992.
5. The Myers & Briggs Foundation—MBTI® Basics. Available online: https://www.myersbriggs.org/my-mbti-personality-type/
mbti-basics/home.htm (accessed on 8 January 2023).
6. Varvel, T.; Adams, S.G. A Study of the Effect of the Myers Briggs Type Indicator. In Proceedings of the 2003 Annual Conference
Proceedings, Nashville, TN, USA, 22–25 June 2003. [CrossRef]
7. Amirhosseini, M.H.; Kazemian, H. Machine Learning Approach to Personality Type Prediction Based on the Myers–Briggs Type
Indicator® . Multimodal Technol. Interact. 2020, 4, 9. [CrossRef]
Information 2023, 14, 217 14 of 15
8. Ong, V.; Rahmanto, A.D.; Suhartono, D.; Nugroho, A.E.; Andangsari, E.W.; Suprayogi, M.N. Personality Prediction Based on
Twitter Information in Bahasa Indonesia. In Proceedings of the 2017 Federated Conference on Computer Science and Information
Systems, Prague, Czech Republic, 3–6 September 2017. [CrossRef]
9. DISC Profile. What Is DiSC® . Discprofile.com. 2021. Available online: https://www.discprofile.com/what-is-dis (accessed on 9
January 2023).
10. John, O.P.; Srivastava, S. The Big-Five Trait Taxonomy: History, Measurement, and Theoretical Perspectives; University of California:
Berkeley, CA, USA, 1999; pp. 102–138.
11. Tandera, T.; Suhartono, D.; Wongso, R.; Prasetio, Y.L. Personality Prediction System from Facebook Users. Procedia Comput. Sci.
2017, 116, 604–611. [CrossRef]
12. Santos, V.G.D.; Paraboni, I. Myers-Briggs Personality Classification from Social Media Text Using Pre-Trained Language Models.
JUCS—J. Univers. Comput. Sci. 2022, 28, 378–395. [CrossRef]
13. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and Their
Compositionality. arXiv 2013, arXiv:1310.4546. [CrossRef]
14. Aizawa, A. An Information-Theoretic Perspective of Tf–Idf Measures. Inf. Process. Manag. 2003, 39, 45–65. [CrossRef]
15. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013,
arXiv:1301.3781. [CrossRef]
16. Mushtaq, Z.; Ashraf, S.; Sabahat, N. Predicting MBTI Personality Type with K-Means Clustering and Gradient Boosting. In
Proceedings of the 2020 IEEE 23rd International Multitopic Conference (INMIC), Bahawalpur, Pakistan, 5–7 November 2020.
[CrossRef]
17. Ontoum, S.; Chan, J.H. Personality Type Based on Myers-Briggs Type Indicator with Text Posting Style by Using Traditional and
Deep Learning. arXiv 2022, arXiv:2201.08717. [CrossRef]
18. (MBTI) Myers-Briggs Personality Type Dataset. Available online: https://www.kaggle.com/datasets/datasnaek/mbti-type
(accessed on 20 November 2022).
19. Jalayer, M.; Kaboli, A.; Orsenigo, C.; Vercellis, C. Fault Detection and Diagnosis with Imbalanced and Noisy Data: A Hybrid
Framework for Rotating Machinery. Machines 2022, 10, 237. [CrossRef]
20. Loper, E.; Steven, B. NLTK: The Natural Language Toolkit. arXiv 2019, arXiv:cs/0205028. [CrossRef]
21. Sklearn.model_selection.train_test_split–Scikit-Learn 0.20.3 Documentation. 2018. Available online: https://scikit-learn.org/
stable/modules/generated/sklearn.model_selection.train_test_split.html (accessed on 10 January 2023).
22. Nick, T.G.; Campbell, K.M. Logistic Regression. In Topics in Biostatistics; Springer: Berlin/Heidelberg, Germany, 2007; pp. 273–301.
[CrossRef]
23. Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction;
Springer: New York, NY, USA, 2001.
24. Binary Logistic Regression—A Tutorial. 2021. Available online: https://digitaschools.com/binary-logistic-regression-
introduction/ (accessed on 10 January 2023).
25. Wong, G.Y.; Mason, W.M. The Hierarchical Logistic Regression Model for Multilevel Analysis. J. Am. Stat. Assoc. 1985, 80,
513–524. [CrossRef]
26. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [CrossRef]
27. Zhang, W.; Yoshida, T.; Tang, X. Text Classification Based on Multi-Word with Support Vector Machine. Knowl. Based Syst. 2008,
21, 879–886. [CrossRef]
28. Suthaharan, S. Support Vector Machine. Mach. Learn. Model. Algorithms Big Data Classif. 2016, 36, 207–235. [CrossRef]
29. Platt, J. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines; Microsoft: Washington, DC,
USA, 1998.
30. Stochastic Gradient Descent—Scikit-Learn 0.23.2 Documentation. Available online: https://scikit-learn.org/stable/modules/
sgd.html (accessed on 11 January 2023).
31. Gaye, B.; Zhang, D.; Wulamu, A. Sentiment Classification for Employees Reviews Using Regression Vector- Stochastic Gradient
Descent Classifier (RV-SGDC). PeerJ Comput. Sci. 2021, 7, e712. [CrossRef]
32. Bottou, L. Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade, 2nd ed.; Springer: Berlin/Heidelberg,
Germany, 2012; pp. 421–436. [CrossRef]
33. IBM. What Is Random Forest?|IBM. Available online: https://www.ibm.com/topics/random-forest (accessed on 11 Jan-
uary 2023).
34. Biau, G.; Erwan, S. A Random Forest Guided Tour. TEST 2016, 25, 197–227. [CrossRef]
35. Liaw, A.; Matthew, W. Classification and regression by randomForest. R New 2022, 2, 18–22.
36. Jabeur, S.B.; Gharib, C.; Mefteh-Wali, S.; Arfi, W.B. CatBoost model and artificial intelligence techniques for corporate failure
prediction. Technol. Forecast. Soc. Chang. 2021, 166, 120658. [CrossRef]
37. Speiser, J.L.; Miller, M.E.; Tooze, J.; Ip, E. A Comparison of Random Forest Variable Selection Methods for Classification Prediction
Modeling. Expert Syst. Appl. 2019, 134, 93–101. [CrossRef]
38. Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [CrossRef]
39. Ramraj, S.; Uzir, N.; Sunil, R.; Banerjee, S. Experimenting XGBoost algorithm for prediction and classification of different datasets.
Int. J. Control. Theory Appl. 2016, 9, 651–662.
Information 2023, 14, 217 15 of 15
40. Chen, T.; Carlos, G. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining—KDD ’16, San Francisco, CA, USA, 13–17 August 2016. [CrossRef]
41. CatBoost—Amazon SageMaker. Available online: https://docs.aws.amazon.com/id_id/sagemaker/latest/dg/catboost.html
(accessed on 2 February 2023).
42. Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features.
arXiv 2019, arXiv:1706.09516. [CrossRef]
43. Hussain, S.; Mustafa, M.W.; Jumani, T.A.; Baloch, S.K.; Alotaibi, H.; Khan, I.; Khan, A. A Novel Feature Engineered-CatBoost-
Based Supervised Machine Learning Framework for Electricity Theft Detection. Energy Rep. 2021, 7, 4425–4436. [CrossRef]
44. Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient Boosting with Categorical Features Support. arXiv 2018, arXiv:1810.11363.
[CrossRef]
45. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell.
Res. 2002, 16, 321–357. [CrossRef]
46. Dalianis, H. Evaluation Metrics and Evaluation. In Clinical Text Mining; Springer: Berlin/Heidelberg, Germany, 2018; pp. 45–53.
[CrossRef]
47. Sklearn.metrics.f1_score—Scikit-Learn 0.21.2 Documentation. 2019. Available online: https://scikit-learn.org/stable/modules/
generated/sklearn.metrics.f1_score.html (accessed on 11 January 2023).
48. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86,
2278–2324. [CrossRef]
49. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Representations by Back-Propagating Errors. Nature 1986, 323, 533–536.
[CrossRef]
50. Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Available online: https:
//aclanthology.org/D14-1162.pdf (accessed on 11 January 2023).
51. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understand-
ing. arXiv 2018, arXiv:1810.04805. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.