Journal of Statistical Software: Plssem
Journal of Statistical Software: Plssem
Abstract
We provide a package called plssem that fits partial least squares structural equation
models, which is often considered an alternative to the commonly known covariance-based
structural equation modeling. plssem is developed in line with the algorithm provided by
Wold (1975) and Lohmöller (1989). To demonstrate its features, we present an empirical
application on the relationship between perception of self-attractiveness and two specific
types of motivations for working out using a real-life data set. In the paper we also show
that, in line with other software performing structural equation modeling, plssem can
be used for putting in relation single-item observed variables too and not only for latent
variable modeling.
Keywords: factor analysis, latent variables, partial least squares, PLS, PLS-PM, PLS-SEM,
path models, Stata, structural equation modeling, SEM.
1. Introduction
The traditional statistical techniques (e.g., linear regression, logistic regression, multilevel
regression, etc.) are used to estimate models representing the relationship between one or
more than one independent variable and a single dependent variable. The independent and
dependent variables in these models are all measured using single items such as income,
height, weight, length of education and so on. Following this reasoning, we can refer to
these traditional statistical approaches as single-equation techniques containing single-item
variables both on the left-hand side (dependent) and right-hand side (independent) of the
equation. Typically, these methods are employed in the social sciences to explain and predict
quantities of interest.
Structural equation modeling (SEM) too can be used for explanation and prediction purposes
in the social sciences. The difference, and accordingly the advantage of SEM over single-
2 plssem: Structural Equation Modeling with PLS in Stata
equation techniques, is that SEM allows for estimating the relationship between a number
of independent variables and more than one dependent variable at the same time. Further-
more, while the traditional techniques such as regression analysis lets one only use single-item
variables, SEM allows for use of multi-item independent and dependent variables.
As such, in a broader sense, we can refer to SEM as a simultaneous multiple-equation tech-
nique estimating models including single or/and multi-item variables on both sides of the
equations. This broader definition reflects also the reason why in the course of the past four
decades SEM has become probably the most popular statistical estimation technique in the
social sciences. The approach to incorporating the multi-item variables in SEM has basi-
cally led to the development of two different methods: covariance-based structural equation
modeling (COV-SEM) introduced by Jöreskog (1969), and variance-based structural equa-
tion modeling (VAR-SEM) proposed by Wold (1975). While in COV-SEM the paths between
common factors are examined, in VAR-SEM the paths between weighted composites (replac-
ing the common factors) are estimated. This implies that in COV-SEM, multi-item variables
are incorporated into the model using the factor analytic technique whereas in VAR-SEM
weighted composites are generated from multi-item variables.
In a nutshell, we can view COV-SEM as the factor-based and VAR-SEM as the component-
based structural equation modeling methods (Chin 1995). COV-SEM and VAR-SEM are
commonly referred to in the literature respectively as maximum likelihood SEM (ML-SEM;
see for example Bollen 1989; Kline 2016), which is typically associated with software packages
such as LISREL (Jöreskog and Sörbom 2015), EQS (Bentler 2008), AMOS (Arbuckle 2014)
or Mplus (Muthén and Muthén 2017), and partial least squares (PLS-SEM or PLS-PM; see
for example Esposito Vinzi, Trinchera, and Amato 2010), which are instead associated with
the software packages SmartPLS (Ringle, Wende, and Becker 2015) or XLSTAT (Addinsoft
2007).
Although there is an ongoing debate as to the strengths and weaknesses of COV-SEM and
PLS-SEM in the literature (see for example Rönkkö and Evermann 2013; Henseler, Dijkstra,
Sarstedt, Ringle, Diamantopoulos, Straub, Ketchen, Hair, Hult, and Calantone 2014), there
still appears to be a general consensus that these two approaches should be considered com-
plementary rather than alternatives to each other. In line with this observation, Hair, Hult,
Ringle, and Sarstedt (2017, p. 23) suggest PLS-SEM be used when:
For more details on the pros and cons of the PLS-SEM approach versus COV-SEM we refer
the reader to Hair et al. (2017).
In this paper we present the plssem package for Stata (StataCorp 2017). The aim of the
package is to provide an open-source implementation of the PLS-SEM methodology for Stata.
To the best of our knowledge, the only package currently available for fitting PLS-SEM
models in Stata is the user-contributed pls package developed by Rönkkö (2016). However, as
Journal of Statistical Software 3
indicated in the package’s documentation, pls is provided for educational purposes since it only
calculates composite variables and it does not produce any other output as well as it does
not allow for any further postestimation analysis of a PLS-SEM fitted model. Essentially,
we started from the pls code as a basis for the development of our plssem package, but
then we fully redesigned and enhanced it with numerous additional output and tools for
postestimation.
At the time of writing, PLS-SEM is not supported by any of the most popular commercial
statistical software packages like SAS (SAS Institute Inc. 2013) or SPSS (IBM Corporation
2017), which only support PLS regression1 . However, many open-source and commercial soft-
ware packages have been developed independently over the years for fitting PLS-SEM models.
Currently, the most widespread open-source implementations of the PLS-SEM methodology
are all for the R software (R Core Team 2019), in particular matrixpls (Rönkkö 2017), plspm
(Sanchez, Trinchera, and Russolillo 2017) and semPLS (Monecke and Leisch 2012). For
what regards the commercial packages, the most popular ones are SmartPLS and XLSTAT-
PLSPM2 . We now provide a brief overview for each of them. A further now dated comparison
of PLS path modeling software is also available in Temme, Kreis, and Hildebrandt (2010).
detection of latent classes by using the REBUS-PLS approach for uncovering unobserved
heterogeneity in PLS-SEM models (Trinchera 2007; Esposito Vinzi, Trinchera, Squil-
lacciotti, and Tenenhaus 2008). As for matrixpls, model specification occurs through
user-defined adjacency matrices. A book-length description of the package is provided
in Sanchez (2013).
semPLS in R: This is a further package developed by Monecke and Leisch (2012) for struc-
tural equation modeling with partial least squares in R. The plsm() function is used
to create a valid model specification (a so called ‘plsm’ object), while sempls() fits
the model. Models can be specified by providing the user-defined adjacency matrices.
Bootstrapping is available too by leveraging the boot package (Canty and Ripley 2017)
and the calculation of quality indices (R2 , Q2 , Dillon-Goldstein’s ρ, etc.) is performed
via specific methods. However, no method is provided in the package for dealing with
observed (e.g., MGA) or unobserved heterogeneity (e.g., REBUS-PLS). A distinctive
feature of semPLS is that it is possible to export ‘plsm’ objects for use with the popular
sem package (Fox, Nie, and Byrnes 2017). Similarly, it is possible to import model
specification created with SmartPLS with the function [Link]().
The plssem package for Stata presented in this manuscript includes the following features:
• Mediation analysis through estimation and inference (including bootstrap) for up to five
indirect effects.
Journal of Statistical Software 5
• Moderation analysis through the inclusion of interactions among latent variables in the
structural model specification; this provides an implementation of the so called product
indicator approach (Sanchez 2013, Section 7.3).
• Possibility to fit models that include equations with binary dependent variables. To
the best of our knowledge, none of the existing PLS-SEM software facilitates binary
dependent variable estimation using maximum likelihood.
• Multigroup analysis of outer loadings and path coefficients for dealing with observed
heterogeneity; in particular, it allows the comparison of an arbitrary number of groups
using either normal-based, bootstrap or permutation tests.
• A range of graphical and postestimation commands for representing and inspecting the
results of a fitted PLS-SEM model.
The plssem package is available through the Statistical Software Components (SSC) archive3 ,
often called the Boston College Archive, and it allows to fit various PLS-SEM models. To
install the package one needs to execute the command
which will download and copy all the ado, help and data files for the commands discussed
here4 .
The rest of the paper is organized as follows: In Section 2 we present the main technical aspects
of the PLS-SEM approach as well as the most common indicators discussed in the literature
for assessing the quality of a fitted model. Section 3 provides an introduction to the plssem
package. In particular, after discussing the general syntax, we provide a full description of
the available options. Moreover, we present the different postestimation commands one can
run after fitting a model and the objects that are saved during the estimation. These objects
can clearly be used in subsequent analyses. In Section 4 we show some empirical applications
of the PLS-SEM approach with the plssem package using two different data sets. Finally,
Section 5 provides some closing thoughts and our plans for the future releases of the package.
In the rest of the paper we adopt the same mathematical notation provided by Monecke and
Leisch (2012, pp. 9–13).
x1 λ11
λ12
x2 y1 β13
x7
x3 λ13 λ37
λ38
y3 x8
λ39
x4 w24 x9
w25
β23
x5 y2
x6 w26
The measurement model provides the relationships between latent variables (or constructs5 )
and the indicators they are defined by. The measurement part is represented in Figure 1 by
all arrows apart from those included in the dashed box. The example includes two reflective
(i.e., y 1 and y 3 ) and one formative (i.e., y 2 ) construct. The association between the reflective
constructs and the corresponding indicators (that is the arrows pointing from the constructs
to the indicators) is indicated in the picture by λ11 , λ12 , λ13 , λ37 , λ38 and λ39 , which are also
called outer loadings. The relationship between the formative construct and the corresponding
indicators (i.e., the arrows pointing from indicators to constructs) are denoted with w24 , w25
and w26 and are also referred to as outer weights. All indicators are congeneric in that none
of them loads on more than one construct decided a priori (Brown 2015). The measurement
model can be described by an adjacency matrix M whose entries mkj take the value one if
indicator xk belongs to the block that defines latent variable y j , and zero otherwise, with
k = 1, . . . , K and j = 1, . . . , J. The adjacency matrix of the measurement model for the
example shown in Figure 1 is provided in Table 1. Note that the matrix M does not convey
any information about whether a construct is measured in a reflective or formative way.
The structural model shows the relationships between latent variables themselves. For the
example shown in Figure 1 the structural model is represented by the arrows included in
the dashed-line box. Latent variables in the structural model that are used as predictors are
called exogenous, while those denoted as outcome variables are called endogenous. In our
5
In the SEM literature latent variables or constructs are often related to multi-item variables used in factor-
based SEM. However, as explained by Henseler et al. (2014, p. 3), one can also use these terms to refer to
multi-item variables used in component-based SEM.
Journal of Statistical Software 7
y1 y2 y3
x1 1 0 0
x2 1 0 0
x3 1 0 0
x4 0 1 0
x5 0 1 0
x6 0 1 0
x7 0 0 1
x8 0 0 1
x9 0 0 1
Table 1: Measurement model adjacency matrix M for the example shown in Figure 1. The
elements mkj of the matrix are set to one if indicator xk belongs to the block that defines
latent variable y j , and zero otherwise.
y1 y2 y3
y1 0 0 1
y2 0 0 1
y3 0 0 0
Table 2: Structural model adjacency matrix S for the example shown in Figure 1. The
elements sij of the matrix are set to one if the latent variable y i is a predecessor of the latent
variable y j in the model, and zero otherwise.
example there are two exogenous (y 1 and y 2 ) and one endogenous (y 3 ) latent variable. The
relationships among the latent variables are labeled using the corresponding path coefficients
(β13 and β23 ). The structural model can also be summarized by an adjacency matrix S
whose entries sij take the value one if the latent variable y i is a predecessor of the latent
variable y j in the model, and zero otherwise, with i, j = 1, . . . , J. The adjacency matrix of
the structural model for the example shown in Figure 1 is reported in Table 2. Note that
matrix S allows to recover the information about whether a latent variable is exogenous or
endogenous. More specifically, if the column corresponding to the latent variable y j contains
only zeros, that indicates that y j is exogenous. In other words, contrary to the matrix M
for the measurement model, S accounts for the directionality of the relationships among the
latent variables.
To sum up the description of a PLS-SEM model, the structural part is similar to a regression
model, while the measurement part resembles a factor or a principal component analysis. As
such, PLS-SEM can be viewed as an advanced multivariate technique facilitating these two
analyses at one go.
case. Using these scores, in the second stage measurement model parameters (weights/load-
ings) are estimated. In the same manner, in the third stage structural model parameters (path
coefficients) are finally estimated. The first stage is what makes PLS-SEM a novel method
in that the second and third stages, as it will be shown below, are about conducting a series
of regression analysis using the ordinary least squares method. To help the reader grasp the
whole process, we summarize the procedure for PLS-SEM estimation in Algorithm 1. We
provide now more details on each stage.
We now give a brief description of these steps. We will denote the data matrix including all
the indicators as X, and the block of indicators measuring the jth latent variable y j as X j .
Similarly, we indicate with Y the whole set of latent variables. As it is common in the SEM
and PLS-SEM literature, we assume that prior to starting the entire process all indicators are
standardized to have a mean zero and unit variance. Additionally, after each step the latent
variables are scaled likewise.
Y
c = XM ,
where M is the measurement model adjacency matrix presented in Section 2, such as that
reported in Table 1.
Step 1: Estimation of the inner weights. Inner weights are calculated for each latent
variable to reflect how strongly the other latent variables are connected to it. The three most
common schemes used for computing the inner model weights are the centroid scheme origi-
nally proposed by Wold (1982), and the factorial and path schemes introduced by Lohmöller
(1989). We provide below a brief description of these schemes assuming that we collect all
the inner weights in a matrix denoted as E.
Journal of Statistical Software 9
4: Scale the latent variables scores to have zero mean and unit variance.
5: Set the iteration counter to zero (i ← 0) and the maximum relative difference of the outer
weights δ to 1 (δ ← 1).
6: while δ ≥ tol and i < imax do
7: Estimate the inner weights using either Equations 1, 2 or 3 forming matrix E.
8: Compute the inner approximation of the latent variable scores as
Y
f=Y
cE.
9: Scale the latent variables scores to have zero mean and unit variance.
10: for j ← 1, J do
11: if y j is in the set of mode A latent variables then
12: Compute the outer weights as
−1
b>
w e>
j = y j y
ej e>
y j Xj.
15: end if
16: end for
17: Compute the outer approximation of the latent variable scores as
Y
c = XW
c,
22: for j ← 1, J do
23: if y j is in the set of mode A latent variables then
24: Compute the cross loadings as
cross
λ
b
j = COR(X, y
b j ).
28: end if
29: Compute the structural model parameters (i.e., the path coefficients) as
−1
β j b pred>
b = y
j b pred
y j b pred>
y j y
bj .
Centroid scheme: This scheme produces weights eij based on the sign of rij = COR(y i , y j ),
the empirical linear correlation coefficient between the latent variables y i and y j re-
sulting from the outer approximation (Step 4 below7 ), assuming they are neighbors. In
particular, if y i and y j are adjacent, the weight eij is set to +1 if the correlation is
positive and to −1 if the correlation is negative. If y i and y j are nonadjacent, eij is set
to 0. More formally, for i, j = 1, . . . , J,
(
sign (rij ) if cij = 1
eij = , (1)
0 otherwise
where cij is the (i, j)th element of the matrix C = S + S > , with S the adjacency
matrix of the structural model introduced in Section 2. Thus, C is a symmetric matrix
whose element cij takes value one if the latent variables y i and y j are neighbors in the
structural model, and zero otherwise.
Note that, as implied by Equation 1, correlations very close to zero may cause the
weights to take a non-zero value during the iterative process, which may lead to insta-
bility. Thus, the centroid scheme should be used when the indicators of a block (latent
variable) are strongly correlated to each other, otherwise the factorial scheme is usually
recommended (Esposito Vinzi et al. 2010).
Factorial scheme: In this scheme the correlation value between each pair of latent variables
7
At the first iteration of the algorithm the outer proxies of the latent variable scores correspond to the
initial values computed in Step 0.
Journal of Statistical Software 11
Path scheme: In this scheme two types of weight values are produced depending on the re-
lationship between the latent variables. When a latent variable, say y i , is “causing”
another latent variable y j (the so called successor), the weight value corresponds to
the linear correlation coefficient rij = COR(y i , y j ). If instead the latent variable y i is
“caused” by another latent variable y j (so called predecessor), the weight is determined
using a multiple regression model. In particular, the estimated linear regression coeffi-
cient on the predecessor will then be used as the weight. More formally, according to
the path scheme the weights are computed as follows
γ̂j
for j ∈ y pred
i
eij = rij for j ∈ y succ
i , (3)
0 otherwise
where y pred
i indicates the set of predecessors of y i and y succ
i represents the corresponding
set of successors. The coefficient γ̂j corresponds to the estimate of the y j coefficient in
the linear regression model
y i = y pred
i γ + εi ,
assuming y j belongs to the predecessor set of y i .
Step 2: Inner approximation of the latent variable scores. Here, we update the
latent variable scores yb 1, . . . , y
b J obtained in the previous iteration with new ones, y
e 1, . . . , y
eJ ,
which are computed as a weighted sum of their respective adjacent latent variables. More
specifically, the inner approximation of the latent variable scores are computed as
Y
f=Y
cE, (4)
where the matrix E contains the inner weights as obtained from Step 1.
Step 3: Estimation of the outer loadings/weights. So far we did not make any dis-
tinction between reflective and formative measures. On the contrary, we now need to take this
difference into account to properly estimate the weights/loadings of the measurement model.
That is, we need to recalculate the latent variables scores obtained from Step 2 using yet
another weighting update. The new weights are called loadings when the latent variables are
modeled as reflective and just weights when the latent variables are modeled as formative. In
the classical algorithm, there are two possible choices for updating the outer weights, which
are usually referred to as mode A and mode B. In the marketing literature, mode A refers to
a reflective, while mode B refers to a formative measure.
In mode A, we regress each of the indicators onto the corresponding latent variable scores (i.e.,
using the latent variables included in the indicator block as the predictors of the regression
12 plssem: Structural Equation Modeling with PLS in Stata
model). Since the latent variables from Step 2 are standardized, the regression coefficients do
correspond to linear correlation coefficients, that is
−1
b>
w e>
j = y j y
ej e>
y j X j = COR(y
e j , X j ). (5)
In mode B, we regress each latent variable against the indicators in its block. The weights
will then correspond to the partial coefficients, that is
−1
b j = X>
w j Xj X> e j = VAR(X j )−1 COR(X j , y
j y e j ). (6)
Step 4: Outer approximation of the latent variable scores. In this step, we estimate
the latent variable scores using the weights w
b j obtained from Step 3 above by computing
Y
c = XW
c,
where W
c is the matrix that collects all the weights w
b j , that is
b1 0 ···
w 0
0 w b2 ··· 0
W =
c .. .. .. .. .
. . . .
0 0 ··· w
bJ
Step 5: Convergence checking. The process from Step 1 through Step 4 is then repeated
until the maximum relative difference between the outer weights from one iteration to the
next falls below a given tolerance value chosen by the analyst (e.g., 10−5 ). More formally, the
algorithm stops when
old − ŵ new
ŵkj kj
max new < tol.
k=1,...,K ŵkj
j=1,...,J
where pj denotes the number of indicators in the jth block and xhj is the hth indicator in
the jth block8 .
The average communality is the average of all COR(xhj , y j )2 , that is
J J pj
1X 1X X
communality = pj × communalityj = pj COR(xhj , y j )2 .
p j=1 p j=1 h=1
For each endogenous latent variable, redundancy measures the amount of variance in the
indicators measuring the variable that is explained by the exogenous latent variables that
8
A common measure to establish convergent validity on the construct level, that is the extent to which a
measure correlates positively with alternative measures of the same construct, is the average variance extracted
(AVE) measure. The AVE is equivalent to the communality of a construct.
14 plssem: Structural Equation Modeling with PLS in Stata
If more than one endogenous variable is available in the model, then one can also calculate
the average redundancy indices for all of the endogenous variables.
Finally, the goodness-of-fit (GoF) index, which takes into account both the measurement and
structural model performance, is used for judging the quality of a PLS-SEM model as a whole.
GoF is calculated as the geometric mean of the average communality and the average R2 :
q
2
GoF = communality × R ,
where the average R2 is computed using only the endogenous latent variables in the model.
A well known theoretical deficiency of PLS-SEM is that it lacks an overall optimization
criterion (such as for example the sum of squared residuals in linear regression or the likelihood
function in COV-SEM). Therefore, no index for the assessment of the global validation of the
model is available. The GoF index represents an operational solution to this problem which
is very often used in the practical application of PLS-SEM9 .
In PLS-SEM it is assumed that the block of indicators for a reflective measure is unidimen-
sional in the same sense of factor analysis. To check the unidimensionality of a reflective block,
some reliability indexes are typically computed, the most popular ones being the Cronbach’s
alpha (αj ) and the Dillon-Goldstein’s coefficient (ρj ). When standardized indicators and
latent variables are used, these indices are defined respectively10 as
P
h6=kCOR(xhj , xkj ) pj
αj = × ,
pj − 1
P
pj + h6=k COR(xhj , xkj )
and
pj b outer 2
P
h=1 λhj
ρj = 2 2 ,
Ppj b outer Ppj b outer
h=1 λhj + h=1 1− λ hj
where λhj is the outer loading for indicator h in block j. Since Cronbach’s alpha tends to
underestimate the internal consistency reliability, the Dillon-Goldstein’s coefficient is often
preferred in practice (Chin 1998, p. 320).
For more details on the assessment of PLS-SEM results and rules of thumb for evaluating
the quality of a fitted model, we refer the reader to the literature (e.g., an updated and
comprehensive survey is available in Hair et al. 2017).
9
The lack of an explicit optimization criterion is a critical drawback of the PLS-SEM approach, which has
some unpleasant consequences. The most serious is the impossibility to statistically test the relative superiority
of a PLS-SEM model over any other. However, we also notice that in recent years successful attempts to derive
the criteria optimized by PLS-SEM have been made (for a review see Esposito Vinzi and Russolillo 2013).
10
Without loss of generality, in the calculation of the Cronbach’s alpha it is usually assumed that all the
indicators in the block are positively correlated. This is not really a big issue, since the indicators can always
be built in this way.
Journal of Statistical Software 15
Clearly, one can specify as many LVs as it is needed in the model. The specification of
reflective measures in the measurement model require to use the greater-than sign between
a latent variable and its associated indicators (e.g., LV1 > indblock1), while the less-than
sign needs to be provided when one needs to include latent variables measured in a formative
way (e.g., LV1 < indblock1).
To specify the structural part11 , one needs to provide the endogenous/dependent latent vari-
able (say, LV2) first followed by the exogenous latent variables (say, LV1) by typing
One can specify further structural relationships following the same approach. For example,
suppose one has two further latent variables in the model, LV3 and LV4, still measured in a
reflective way, with LV4 endogenous and LV3 exogenous. Then, the syntax for the structural
part should be
plssem (LV1 > indblock1) (LV2 > indblock2) (LV3 > indblock3) ///
(LV4 > indblock4), structural(LV2 LV1, LV4 LV3).
In addition, in line with most of the Stata commands, we can fit a full PLS-PM model by
sub-setting the data directly in the syntax using the if and in qualifiers.
More generally, the syntax for the plssem command is provided by12
[by groupvar :] plssem (LV1 > indblock1) (LV2 > indblock2) (... ...) ///
[if exp ] [in range ] [, structural(LV2 LV1, ... ...) options ],
where square brackets distinguish optional qualifiers and options from required ones, groupvar
denotes a variable name in the data set, exp denotes an algebraic expression, range denotes
an observation range, and options denotes a list of available options. The optional by
prefix causes Stata to repeat a command for each subset of the data for which the values
of the groupvar variable are equal. In other words, when prefixed with by, the result of
the command will be the same as if one had formed separate data sets for each group of
11
While the measurement part is mandatory, the plssem package allows to fit models that do not include
the structural part.
12
The plssem package is compatible with Stata version 10 and above.
16 plssem: Structural Equation Modeling with PLS in Stata
observations, saved them, and then gave the command on each data set separately. The list
of available options for the plssem command are illustrated in the next section.
3.2. Options
The options allowed by the plssem command are detailed below:
binary(LV) indicates the latent variables that are defined by a single binary variable. This
allows essentially for estimating a model with a binary dependent variable using a logistic
regression model. The latent variable LV needs to be specified in the measurement part
of the syntax at the same time (e.g., LV > binaryvar)13 .
seed(#) sets the seed number for the bootstrap calculations. This option may be useful if
reproducibility is one of the analyst’s concerns.
tol(#) sets the tolerance value used for checking convergence attainment (see Step 5 in
Stage I described in Section 2.1). The default tolerance value is 1e-7.
maxiter(#) indicates the maximum number iterations the algorithm runs. The default is 100
iterations. Note that usually the algorithm requires a very limited number of iterations
to reach convergence, typically less than 10.
missing(imputation_method) provides the choice of the imputation method for the indica-
tor missing values. Possible choices are mean (i.e., the mean of the available indicators)
or knn (i.e., the kth nearest neighbor method).
k(#) sets the number of nearest neighbors to use with missing(knn). The default number
of nearest neighbors is 5.
init(init_method) lets the user choose between two options for initialization. These are
indsum14 (default) and eigen15 . The eigen option is required if the user wants to
estimate only the measurement part of the model16 .
digits(#) sets the number of decimals to display the model estimates. The default is 3.
13
This is in fact showing how we can work with single indicators using the plssem command. We can
include both continuous and dichotomous single indicators in the model by linking them to latent variables
in the measurement part of the syntax. Unless any of these latent variables is specified as binary using the
binary() option, the structural part will apply linear (regress), otherwise logistic (logit) regression will be
used. However, we stress that the same algorithm is used for the measurement part regardless of the nature
of the indicators.
14
The initial values in this option are 1s for all of the indicators.
15
The initial values (i.e., the weights) in this option are the values associated with the first eigenvector in
factor extraction’s iterative process.
16
What this initialization does is essentially running separate factor analyses with principal component
extraction method (factor, pcf) for each latent variable in Stata. Thus, plssem command can conveniently
be used as an alternative to the factor, pcf command as plssem would provide the user with some further
estimations (i.e., reliability coefficients and discriminant validity assessment).
Journal of Statistical Software 17
stats displays some summary statistics (mean, standard deviations, etc.) for each indicator.
group(grouping variable, [suboptions]) provides both the structural and the measure-
ment part of the estimation results for each category of the grouping variable as well
as the comparison between the categories based on normal theory (default). As an al-
ternative to normal-based theory estimations, the user can choose between two resam-
pling techniques. More specifically, by adding the suboptions method(permutation)
or method(bootstrap) one can get the results based on permutation or bootstrap re-
sampling. The default number of replications for both permutation and bootstrap is
100. However, this can be changed by adding the suboption reps(#). Further, with the
suboption groupseed(#) one can also set a certain seed number to be able reproduce
the bootstrap or permutation results. Finally, by using the suboption plot one can
get a graphical output showing the estimates differences between the groups based on
alpha level of 0.05 (default). The significance level can also be changed by adding the
suboption alpha(#).
correlate(mv lv cross, cutoff()) lets the user ask for correlations among the indicators
or manifest variables (mv), latent variables (lv) as well as cross-loadings (cross) between
the indicators and latent variables17 . When doing so, the user can also set a certain
cut-off value for the correlations to be displayed by using the suboption cutoff(). For
instance, cutoff(0.3) will display the correlations above 0.3 in absolute terms.
rawsum uses the sum of the raw indicators and the resulting aggregated scores (also called
summated scales) are used directly for estimating the structural part. In this sense,
rawsum is an alternative procedure to the PLS-algorithm for estimating the latent vari-
able scores.
noscale if chosen, the manifest variables are not standardized before running the algorithm.
estat indirect, effects(dep med ind, ...) estimates the specified (standardized) in-
direct effects and tests the significance of these effects using either the Sobel’s z statis-
tic (default) as well as the bootstrap approach18 (Sobel 1982; Baron and Kenny 1986;
VanderWeele 2015). The command can estimate up to five different indirect effects at
a time. Each of these should be specified by sequentially typing the dependent (dep),
mediator (med) and independent (ind) variable from any PLS-SEM model. By adding
the suboption boot(#), you can obtain the results based on the bootstrap. To facili-
tate the reproducibility of results, the suboption seed(#) can further be added to set
the seed for the bootstrap calculations. Confidence intervals for the estimated indirect
effects are also provided. The default confidence level is 95%, but one can change it
by adding the suboption level(#). To change the number of decimals used to display
the estimates, one can change the default (3 digits) to another value by adding the
suboption digits(#).
estat total produces the decomposition of the total effects in (standardized) direct and
indirect effects19 . Adding the suboption plot will generate a bar plot of the effect
decomposition. You can here too change the decimals by making use of the suboption
digits(#).
estat vif computes the variance inflation factors (VIFs) for the independent variables spec-
ified in the structural part of a PLS-SEM model. With the digit(#) suboption, one
can change the decimal display.
estat unobshet assesses the presence of unobserved heterogeneity. Currently, the command
implements only the REBUS-PLS approach proposed by Trinchera (2007) and Esposito
Vinzi et al. (2008).
plssemplot, loadings provides a bar plot of the loadings of indicators for their respective
latent variables.
plssemplot, crossloadings provides bar plots of the loadings of indicators for not only
their respective but all the other latent variables (i.e., the cross loadings; see line 24 of
Algorithm 1).
plssemplot, scores provides the scatterplot matrix of the scores for the latent variables
defined in the PLS-SEM model.
plssemplot, stats(LV) provides the scatterplot matrix for the indicators in the block defin-
ing the latent variable LV.
18
estat indirect provides the indirect effects mediated by only one latent variable.
19
In particular, the overall indirect effects via more than one mediator variable are provided.
Journal of Statistical Software 19
predict, xb residuals creates new variables containing linear predictions (option xb, the
default) and residuals (option residuals). These quantities are provided only for reflec-
tive blocks of manifest variables in the measurement/outer model and for endogenous
latent variables in the structural/inner model.
• Finally, plssem saves a function returning an indicator that marks the observations
used for fitting the model; this function is accessible through:
Together with the above objects, the plssem command also saves the latent variable scores
as new columns in the active data set. These columns are labeled as the latent variables
specified in the model syntax.
4. Empirical application
In this section we illustrate the use of the plssem package through an example taken from
our research agenda. More specifically, we use a real-life data set collected from members of
a training/fitness center in 2014 in a medium-sized city in Norway. The members were asked
to indicate how well having an attractive face and being sexy described them as a person
using an ordinal scale (1 = very badly to 6 = very well). Using a similar scale (1 = not at all
important to 6 = very important), the members were also asked to indicate how important
each of the following measures was for working out:
• to improve my appearance;
• to develop my muscles;
• to get stronger;
22 plssem: Structural Equation Modeling with PLS in Stata
Table 3: List of indicators collected and latent variables they measure for the empirical
application described in Section 4.
• to increase my endurance;
• to lose weight;
• to burn calories;
• to control my weight.
Table 3 reports the list of indicators, the variable name in the data set and the corresponding
latent construct they measure.
H1: The more attractive one perceives herself/himself, the more the person wants to work
out to improve her/his physical appearance (i.e., Attractive → Appearance).
H2: The more the person wants to work out to improve her/his physical appearance, the
more s/he wants to work out to build up muscles (i.e., Appearance → Muscle).
H3: The more the person wants to work out to improve her/his physical appearance, the
more s/he wants to work out to lose weight (i.e., Appearance → Weight).
H4: The more attractive one perceives herself/himself will indirectly influence the person to
work out more to build up muscles (i.e., Attractive → Appearance → Muscle).
H5: The more attractive one perceives herself/himself will indirectly influence the person to
work out more to lose weight (i.e., Attractive → Appearance → Weight).
Muscle
Attractive Appearance
Weight
sexy face
Figure 2: The hypothesized PLS-SEM model according to the hypotheses described in Sec-
tion 4. Attractive, Appearance, Muscle and Weight are the latent variables defined in
the model, with Attractive being the only exogenous variable. All the latent variables are
measured in a reflective way.
Model estimation
Following the syntax and options described in Section 3, we specify and estimate our research
model represented in Figure 2 with the following code:
. use [Link],clear
. plssem (Attractive > face sexy) ///
(Appearance > body appear attract) ///
(Muscle > muscle strength endur) ///
(Weight > lweight calories cweight), ///
structural(Appearance Attractive, ///
Muscle Appearance, ///
Weight Appearance) ///
boot(200) seed(123) stats correlate(lv)
muscle | 0.886
strength | 0.873
endur | 0.623
lweight | 0.916
calories | 0.937
cweight | 0.911
--------------+-----------------------------------------------------------
Cronbach | 0.801 0.914 0.734 0.912
DG | 0.909 0.946 0.842 0.944
--------------------------------------------------------------------------
As one can see, the output commences with some summary statistics followed by the measure-
ment part of the estimation results including the bootstrap standardized loadings. We then
see a table showing the discriminant validity assessment22 before displaying the structural
part of the estimation results including bootstrap standardized path coefficients. Finally, we
get a table showing the correlations among the latent variables of our model.
The output provided gives us the necessary information to test the first three hypotheses,
namely H1, H2 and H3. To be able to test mediational hypotheses (H4 and H5), we make
further use of the following code to estimate the indirect effects and test their statistical
significance using the bootstrap method.
We can further ask for a graphical output showing the size of the outer loadings for each
latent variable using the following code, which yields the graph shown in Figure 3:
. plssemplot, loadings
Stored results
plssem stores the following objects in e() after estimating our proposed model.
. ereturn list
22
To be able to demonstrate discriminant validity, the average variance extracted (AVE) values should be
larger than the squared correlations among the latent variables.
Journal of Statistical Software 27
Loadings
Attractive Appearance Muscle Weight
1
.8
.6
.4
.2
0
face
sexy
body
appear
attract
muscle
strength
endur
lweight
calories
cweight
Figure 3: Bar chart reporting the outer loadings by blocks. Colors denote different indicator
blocks. The dashed line provides a value (i.e., 0.7) frequently used in the literature to assess
the quality of the fit.
. ereturn list
scalars:
e(converged) = 1
e(tolerance) = 1.00000000000e-07
e(iterations) = 7
e(maxiter) = 100
e(reps) = 200
e(N) = 187
macros:
e(cmd) : "plssem"
e(cmdline) : "plssem (Attractive > face sexy) (Appearance > body
appear ..."
e(estat_cmd) : "plssem_estat"
e(predict) : "plssem_p"
e(title) : "Partial least squares structural equation modeling"
e(datasignaturevars) : "face sexy body appear attract muscle strength endur
lweight ..."
e(datasignature) : "[Link]37:1706307640"
28 plssem: Structural Equation Modeling with PLS in Stata
matrices:
e(indstats) : 11 x 7
e(loadings) : 11 x 4
e(loadings_bs) : 11 x 4
e(loadings_se) : 11 x 4
e(cross_loadings) : 11 x 4
e(cross_loadings_bs) : 11 x 4
e(cross_loadings_se) : 11 x 4
e(adj_meas) : 11 x 4
e(relcoef) : 2 x 4
e(sqcorr) : 4 x 4
e(ave) : 1 x 4
e(struct_b) : 2 x 3
e(struct_se) : 2 x 3
e(struct_table) : 5 x 3
e(pathcoef) : 4 x 4
e(pathcoef_bs) : 4 x 4
e(adj_struct) : 4 x 4
e(total_effects) : 4 x 4
e(rsquared) : 1 x 4
e(redundancy) : 1 x 4
e(assessment) : 1 x 4
e(ow_history) : 8 x 11
e(outerweights) : 11 x 4
e(reldiff) : 1 x 7
functions:
e(sample)
Multigroup analysis
To demonstrate a further feature of the plssem package, in this section we perform a multi-
group analysis based on the model depicted in Figure 2. More specifically, we now check
whether the model estimates (path coefficients and loadings) differ between male and female
respondents in our sample. As described earlier in the paper, plssem offers two approaches
for comparing model estimates across groups: permutation and bootstrap (as well as the
standard one based on normal theory). Here, we show the results using the bootstrap option
with 200 replications. For reproducibility purposes, we set an arbitrary seed. We also set
a significance level (alpha) of 0.1 to display significant path coefficients or loadings in the
Journal of Statistical Software 29
resulting plot. The grouping variable, women, is a dummy-coded variable in which men are
coded as 0.
Figure 4: Comparison of path coefficients using multigroup analysis. The statistically signif-
icant differences at the given alpha level are highlighted with (*).
Group 1: 71
Group 2: 116
The results show the path coefficients for the whole sample (Global) as well as those for
the samples containing the men (Group 1) and women (Group 2). More importantly, a
Journal of Statistical Software 31
Figure 5: Comparison of outer loadings using multigroup analysis. In this case, none of the
differences is significant at the given alpha level.
bootstrapped t test is run based on these estimates. We can conclude that the effect of
Appearance on Weight is larger among women than men. This difference is significant at the
0.1 significance level though. The remaining path coefficients are not significantly different
between the two groups. Figure 4 reports the plot produced by the above code showing
the magnitudes of the differences among the path coefficients. The plot also shows the path
coefficients (if any) that are significantly different between groups according to the chosen
alpha level by marking them with a star.
The same command also provides the comparison of the model’s loadings between men and
women represented in Figure 5. None of the loadings is significantly different between men
and women. Equality of loadings is indeed an important condition that must be met for
establishing measurement invariance before comparing path coefficients of different models.
Thus, ideally and as done in real-life research practice, the comparison of the measurement
model parameters should precede the comparison of the structural model parameters.
5. Conclusion
In this article, we introduced the plssem package for estimation of partial least squares struc-
tural equation modeling (PLS-SEM). We demonstrated the capabilities of the package using
a common and multi-featured empirical application. plssem can as easily be used to estimate
more complex PLS-SEM models such as higher-order latent variable models. Future releases
32 plssem: Structural Equation Modeling with PLS in Stata
of the command will include further more advanced features, in particular we plan to add
capabilities for multilevel modeling, more options for missing values imputation and more
elaborate approaches for dealing with observed and unobserved heterogeneity.
References
Lohmöller JB (1989). Latent Variable Path Modeling with Partial Least Squares. Physica.
doi:10.1007/978-3-642-52512-4.
Mevik BH, Wehrens R, Liland KH (2018). pls: Partial Least Squares and Principal Component
Regression. R package version 2.7-0, URL [Link]
Monecke A, Leisch F (2012). “semPLS: Structural Equation Modeling Using Partial Least
Squares.” Journal of Statistical Software, 48(3), 1–32. doi:10.18637/jss.v048.i03.
Muthén LK, Muthén BO (2017). Mplus User’s Guide. 8th edition. Muthén & Muthén, Los
Angeles.
R Core Team (2019). R: A Language and Environment for Statistical Computing. R Founda-
tion for Statistical Computing, Vienna, Austria. URL [Link]
Sanchez G (2013). PLS Path Modeling with R. Trowchez Editions. URL [Link]
[Link]/[Link].
Sanchez G, Trinchera L, Russolillo G (2017). plspm: Tools for Partial Least Squares Path Mod-
eling (PLS-PM). R package version 0.4.9, URL [Link]
plspm.
Sarstedt M, Becker JM, Ringle CM, Schwaiger M (2011). “Uncovering and Treating Un-
observed Heterogeneity with FIMIX-PLS: Which Model Selection Criterion Provides an
Appropriate Number of Segments?” Schmalenbach Business Review, 63(1), 34–62. doi:
10.1007/bf03396886.
SAS Institute Inc (2013). The SAS System, Version 9.4. SAS Institute Inc., Cary. URL
[Link]
Sobel MN (1982). “Asymptotic Confidence Intervals for Indirect Effects in Structural Equa-
tions Models.” In S Leinhart (ed.), Sociological Methodology, pp. 290–312. Jossey-Bass.
StataCorp (2017). Stata Statistical Software: Release 15. StataCorp LLC, College Station.
URL [Link]
Journal of Statistical Software 35
Tenenhaus M, Esposito Vinzi V, Chatelin YM, Lauro C (2005). “PLS Path Modeling.”
Computational Statistics & Data Analysis, 48(1), 159–205. doi:10.1016/[Link].2004.
03.005.
Van Buuren S (2012). Flexible Imputation of Missing Data. CRC Press. doi:10.1201/b11826.
Wold HOA (1975). “Path Models with Latent Variables: The NIPALS Approach.” In
HM Blalock, A Aganbegian, FM Borodkin, R Boudon, V Cappecchi (eds.), Quantitative
Sociology, pp. 307–359. Academic Press.
Wold HOA (1982). “Soft Modeling: The Basic Design and Some Extensions.” In KG Jöreskog,
HOA Wold (eds.), Systems under Indirect Observations, Part II, pp. 1–54. North-Holland.
Affiliation:
Sergio Venturini
Department of Decision Sciences
Università Commerciale L. Bocconi
20136 Milan, Italy
E-mail: [Link]@[Link]
Mehmet Mehmetoglu
Department of Psychology
Norwegian University of Science and Technology
7491 Trondheim, Norway
E-mail: mehmetm@[Link]