CAS Forum Winter 2003: Data & Ratemaking
CAS Forum Winter 2003: Data & Ratemaking
To CAS Members:
This is the Winter 2003 Edition of the Casualty Actuarial Society Forum. It
contains six Data Management, Quality, and Technology Call Papers, nine Ratemaking
Discussion Papers, and seven additional papers.
The Casualty Actuarial Society Forum is a nonrefereed journal printed by the
Casualty Actuarial Society. The viewpoints published herein do not necessarily re-
flect those of the Casualty Actuarial Society.
The CAS Forum is edited by the CAS Committee for the Casualty Actuarial
Society Forum. Members of the committee invite all interested persons to submit
papers on topics of interest to the actuarial community. Articles need not be written
by a member of the CAS, but the paper's content must be relevant to the interests of
the CAS membership. Members of the Committee for the Casualty Actuarial Society
Forum request that the following procedures be followed when submitting an article
for publication in the Forum:
The CAS Forum is printed periodically based on the number of call paper
programs and articles submitted. The committee publishes two to four editions dur-
ing each calendar year.
All comments or questions may be directed to the Committee for the Casualty
Actuarial Society Forum.
Sincerely,
The Winter 2003 Edition of the CAS Forum is a cooperative effort between
the CAS Forum Committee, the CAS Committee on Management Data and Informa-
tion, and the CAS Committee on Ratemaking.
The CAS Committee on Management Data and Information presents for dis-
cussion six papers prepared in response to its Call for 2003 Data Management, Qual-
ity, and Technology Papers.
The CAS Committee on Ratemaking presents for discussion nine papers pre-
pared in response to its Call for 2003 Ratemaking Discussion Papers.
This Forum includes papers that will be discussed by the authors at the 2003
CAS Ratemaking Seminar, March 27-28, 2003, in San Antonio, Texas.
Abstract
This paper addresses the issues and techniques for Property/Casualty actuaries using data
mining techniques. Data mining means the efficient discovery of previously unknown
patterns in large databases. It is an interactive information discovery process that includes
data acquisition, data integration, data exploration, model building, and model validation. The
paper provides an overview of the information discovery techniques and introduces some
important data mining techniques for application to insurance including cluster discovery
methods and decision tree analysis.
1. Introduction
Because of the rapid progress of information technology, the amount of information stored in
insurance databases is rapidly increasing. These huge databases contain a wealth of data and
constitute a potential goldmine of valuable business information. As new and evolving loss
exposures emerge in the ever-changing insurance environment, the form and structure of
insurance databases change. In addition, new applications such as dynamic financial analysis
and catastrophe modeling require the storage, retrieval, and analysis of complex multimedia
objects, which are often represented by high-dimensional feature vectors. Finding the valuable
information hidden in those databases and identifying appropriate models is a difficult task.
Data mining (DM) is the process of exploration and analysis, by automatic or semi-automatic
means, of large quantities of data in order to discover meaningful patterns and rules (Berry and
Linoff, 2000). A typical data mining process includes data acquisition, data integration, data
exploration, model building, and model validation. Both expert opinion and data mining
techniques play an important role at each step of this information discovery process.
This paper introduces two important data mining techniques for application to insurance:
cluster discovery methods and decision tree analysis.
Cluster analysis is one of the basic techniques that are often applied in analyzing large data
sets. Originating from the area of statistics, most cluster analysis algorithms have originally
been developed for relatively small data sets. In recent years, the clustering algorithms have
been extended to efficiently work on large data sets, and some of them even allow the
clustering of high-dimensional feature vectors (see Ester, Kriegel, Sander, and Xu, and
Hinneburg, and Keim, 1998, for example).
Decision tree analysis is another popular data mining technique that can be used in many areas
of actuarial practice. We discuss how to use decision trees to make important design decisions
and explain the interdependencies among the p~operties of insurance data. We will also
provide examples of how data mining techniques can be used to improve the effectiveness and
efficiency of the modeling process.
The paper is organized as follows. Section 2 provides an overview of data mining and a list of
potential DM applications to insurance. Section 3 demonstrates the cluster analysis data
mining techniques. Section 4 presents application of predictive data mining process. This
section identifies factors that influence auto insurance claims using decision tree techniques
and quantifies the effects and interactions of these risk factors using logistic regression.
Model assessment is also discussed in this section. Section 5 concludes the paper.
2. Data Mining
In this section, we will provide an overview of the data mining process (2.1), data mining
operations (2.2), data mining techniques and algorithms (2.3), and their potential applications
in the insurance industry (2.4).
Data mining combines techniques from machine learning, pattern recognition, statistics,
database theory, and visualization to extract concepts, concept interrelations, and interesting
patterns automatically from large corporate databases. Its primary goal is to extract knowledge
from data to support the decision-making process. Two primary functions of data mining are:
prediction, which involves finding unknown values/relationships/patterns from known values;
and description, which provides interpretation of a large database.
STEP 1: Data acquisition. The first step is to select the types of data to be used. Although a
target data set has been created for discovery in some applications, DM can be performed on a
subset of variables or data samples in a larger database.
STEP 2: Preprocessing data. Once the target data is selected, the data is then preprocessed for
cleaning, scrubbing, and transforming to improve the effectiveness of discovery. During this
preprocessing step, developers remove the noise or outliers if necessary and decide on
strategies for dealing with missing data fields and accounting for time sequence information or
known changes. In addition, the data is often transformed to reduce the effective number of
variables under consideration by either converting one type of data to another (e.g., categorical
values into numeric ones) or deriving new attributes (by applying mathematical or logical
operators).
STEP 3: Data exploration and model building. The third step of DM refers to a series of
activities such as deciding on the type of DM operation; selecting the DM technique; choosing
the DM algorithm; and mining the data. First, the type of DM operation must be chosen. The
DM operations can be classified as classification, regression, segmentation, link analysis, and
deviation detection (see Section 2.2 for details). Based on the operation chosen for the
application, an appropriate data-mining technique is then selected. Once a data-mining
technique is chosen, the next step is to select a particular algorithm within the DM technique
chosen. Choosing a data-mining algorithm includes a method to search for patterns in the
data, such as deciding which models and parameters may be appropriate and matching a
particular data-mining technique with the overall objective of data mining. After an
appropriate algorithm is selected, the data is finally mined using the algorithm to extract novel
patterns hidden in databases.
STEP 4: Interpretation and evaluation. The fourth step of the DM process is the interpretation
and evaluation of discovered patterns. This task includes filtering the information to be
presented by removing redundant or irrelevant patterns, visualizing graphically or logically the
useful ones, and translating them into understandable terms by users. In the interpretation of
results, we determine and resolve potential conflicts with previously found knowledge or
decide to redo any of the previous steps. The extracted knowledge is also evaluated in terms of
its usefulness to a decision maker and to a business goal. Then extracted knowledge is
subsequently used to support human decision making such as prediction and to explain
observed phenomena.
The four-step process of knowledge discovery should not be interpreted as linear, but as an
interactive, iterative process through which discovery evolves.
Assuming you have prepared a data set for mining, you then need to define the scope of your
study and choose the subject of your study. This is referred as choosing a DM operation.
There are five types of DM operations: classification, regression, link analysis, segmentation,
and deviation detection. Classification and regression are useful for prediction, whereas link
analysis, segmentation, and deviation detection are for description of patterns in the data. A
DM application typically requires the combination of two or more DM operations.
Classification
The goal of classification is to develop a model that maps a data item into one of several
predefined classes. Once developed, the model is used to classify a new instance into one of
the classes. Examples include the classification of bankruptcy patterns based on the financial
ratios of a firm and of customer buying patterns based on demographic information to target
the advertising and sales of a firm effectively toward the appropriate customer base.
Regression
This operation builds a model that maps data items into a real-valued prediction variable.
Models have traditionally been developed using statistical methods such as linear and logistic
regression. Both classification and regression are used for prediction. The distinction between
these two models is that the output variable of classification is categorical, whereas that of
Table i. DM Techniques for DM Operations
DM Technique Induction ~Teural Genetic 2lusteringLogistic AssociationSequence r
~letworks Algorithms ~,egression Discovery Discovery
DM Operation
Classification x x x
~.egression x x
Linkanalysis x x
Segmentation x x
Deviation X X
Induction Techniques
Induction techniques develop a classification model from a set of records -- the training set of
examples. The training set may be a sample database, a data mart, or an entire data warehouse.
Each record in the training set belongs to one of many predefined classes, and an induction
technique induces a general concept description that best represents the examples to develop a
classification model. The induced model consists of patterns that distinguish each class. Once
trained, a developed model can be used to predict the class of unclassified records
automatically. Induction techniques represent a model in the form of either decision trees or
decision rules. These representations are easier to understand, and their implementation is
more efficient than those of neural network or genetic algorithms. A more detailed discussion
on decision tree techniques and their applications will be presented in Section 4.
Neural Networks
Neural networks constitute the most widely used technique in data mining. They imitate the
way the human brain learns and use rules inferred from data patterns to construct hidden
layers of logic for analysis. Neural networks methods can be used to develop classification,
regression, link analysis, and segmentation models. A neural net technique represents its
model in the form of nodes arranged in layers with weighted links between the nodes. There
are two general categories of neural net algorithms: supervised and unsupervised.
Supervised neural net algorithms such as Back propagation (Rumelhart, Hinton, and
Williams, 1986) and Perceptron require predefined output values to develop a
classification model. Among the many algorithms, Back propagation is the most
popular supervised neural net algorithm. Back propagation can be used to develop not
only a classification model, but also a regression model.
Unsupervised neural net algorithms such as ART (Carpenter and Grossberg, 1988) do
not require predefined output values for input data in the training set and employ self-
organizing learning schemes to segment the target data set. Such self-organizing
networks divide input examples into clusters depending on similarity, each cluster
representing an unlabeled category. Kohonen's Feature Map is a well-known method
in self-organizing neural networks.
For organizations with a great depth of statistical information, neural networks are ideal
because they can identify and analyze changes in patterns, situations, or tactics far more
regression is numeric and continuous. Examples of regression are the prediction of change
between the yen and the Government Bond Market and of the crime rate of a city based on the
description of various input variables such as populations, average income level and
education.
Link Analysis
Link analysis is used to establish relevant connections between database records. Its typical
application is market-basket analysis, where the technique is applied to analyze point-of-sales
transaction data to identify product affinities. A retail store is usually interested in what items
sell together -- such as baby's diapers and formula -- so it can determine what items to display
together for effective marketing. Another application could find relationships among medical
procedures by analyzing claim forms submitted to an insurance firm. Link analysis is often
applied in conjunction with database segmentation.
Segmentation
The goal is to identify clusters of records that exhibit similar behaviors or characteristics
hidden in the data. The clusters may be mutually exclusive and exhaustive or may consist of a
richer representation such as hierarchical or overlapping categories. Examples include
discovering homogenous groups of consumers in marketing databases and segmenting the
records that describe sales during "Mother's Day" and "Father's Day." Once the database is
segmented, link analysis is often performed on each segment to identify the association among
the records in each cluster.
Deviation Detection
This operation focuses on discovering interesting deviations. There are four types of deviation:
9 Unusual patterns that do not fit into previously measured or normative classes,
9 Significant changes in the data from one time period to the next,
9 Outlying points in a dataset -- records that do not belong to any particular cluster, and
9 Discrepancies between an observation and a reference.
Deviation Detection is usually performed after a database is segmented to determine whether
the deviations represent noisy data or unusual casualty. Deviation detection is often the source
of true discovery since deviations represent anomaly from some known expectation or norm.
At the heart of DM is the process of building a model to represent the data set and to carry out
the DM operation. A variety of DM techniques (tools) are available to support the five types
of DM operations presented in the previous section. The most popular data mining techniques
include Bayesian analysis (Cheeseman et al., 1988), neural networks (Bishop, 1995; Ripley,
1996), genetic algorithms (Goldberg, 1989), decision trees (Breiman et al., 1984), and logistic
regression (Hosmer and Lemeshow, 1989), among others.
Table 1 summarizes the DM techniques used for DM operations. For each of the DM
techniques listed in Table 1, there are many algorithms (approaches) to choose from. In the
following, some of the most popular technologies are discussed.
quickly than any human mind. Although the neural net technique has strong representational
power, interpreting the information encapsulated in the weighted links can be very difficult.
One important characteristic of neural networks is that they are opaque, which means there is
not much explanation of how the results come about and what rules are used. Therefore, some
doubt is cast on the results of the data mining. Francis (2001) gives a discussion on Neural
Network applications to insurance problems.
Genetic Algorithms
Genetic algorithms are a method of combinatorial optimization based on processes in
biological evolution. The basic idea is that over time, evolution has selected the "fittest
species." For a genetic algorithm, one can start with a random group of data. Afitness
function can be defined to optimizing a model of the data to obtain "fittest" models. For
example, in clustering analysis, a fitness function could be a function to determine the level of
similarity between data sets within a group.
Genetic algorithms have often been used in conjunction with neural networks to model data.
They have bee/n used to solve complex problems that other technologies have a difficult time
with. Micha61ewicz (1994) introduced the concept of genetic algorithms and applying them
with data mining.
Logistic Regression
Logistic regression is a special case of generalized linear modeling. It has been used to study
odds ratios (e pj', j = 1, 2,..., k as defined in the following), which compares the odds of the
event of one category to the odds of the event in another category, for a very long time and its
properties have been well studied by the statistical community. Ease of interpretation is one
advantage of modeling with logistic regression. Assume that the data set consist of i = 1, 2,
.... n records. Let p~, i = 1, 2,..., n be the corresponding mortality rate for each record and
x~ = (x,, x2i, ..-, x~ ) be a set of k variables associated with each record. A linear-additive
logistic regression model can be expressed as
If the model is correctly specified, each dependent variable affects logit linearly.
Exponentiation of the parameter estimate of each slope, e pj , j = 1, 2, ..., k, can be interpreted
as the odds ratio of the probability that Pi is associated with input variable xji (Kleinbaum,
D., Kupper, L., and Muller, K., 1988). However, it poses several drawbacks especially with
large data sets. The curse of dimensionality makes the detection of nonlinearities and
interactions difficult. If the model is not correctly specified, the interpretation of the model
parameter estimates becomes meaningless. In addition, the data might not be evenly
distributed among the whole data space. It is very likely that some segments of the data space
have more records than other segments. One model that fits the whole data space might not be
the best choice depending on the intended application. Although there are many existing
methods such as backward elimination and forward selection that can help data analyst to
build logistic regression model, judgment should be exercised regardless of the method
selected.
Clustering
Clustering techniques are employed to segment a database into clusters, each of which shares
common and interesting properties. The purpose of segmenting a database is often to
summarize the contents of the target database by considering the common characteristics
shared in a cluster. Clusters are also created to support the other types of DM operations, e.g.
link analysis within a cluster. Section 3 will introduce more details of clustering and its
application to insurance.
Associated Discovery
Given a collection of items and a set of records containing some of these items, association
discovery techniques discover the rules to identify affinities among the collection of items as
reflected in the examined records. For example, 65 percent of records that contain item A also
contain item B. An association rule uses measures called "support" and "confidence" to
represent the strength of association. The percentage of occurrences, 65 percent in this case, is
the confidence factor of the association. The algorithms find the affinity rules by sorting the
data while counting occurrences to calculate confidence. The efficiency with which
association discovery algorithms can organize the events that make up an association or
transaction is one of the differentiators among the association discovery algorithms. There are
a variety of algorithms to identify association rules such as Apriori algorithm and using
random sampling. Bayesian Net can also be used to identify distinctions and relationships
between variables (Fayyad et al., 1996).
Sequence Discovery
Sequence discovery is very similar to association discovery except that the collection of items
occurs over a period of time. A sequence is treated as an association in which the items are
linked by time. When customer names are available, their purchase patterns over time can be
analyzed. For example, it could be found that, if a customer buys a tie, he will buy men's shoes
within one month 25 percent of the time. A dynamic programming approach based on the
dynamic time warping technique used in the speech recognition area is available to identify
the patterns in temporal databases (Fayyad et al., 1996).
Visualization
A picture is worth thousands of numbers! Visual DM techniques have proven the value
in exploratory data analysis, and they also have a good potential for mining large databases.
Visualizations are particularly useful for detecting phenomena hidden in a relatively small
subset of the data. This technique is often used in conjunction with other DM techniques:
features that are difficult to detect by scanning numbers may become obvious when the
summary of data is graphically presented. Visualization techniques can also guide users when
they do not know what to look for to discover the feature. Also, this technique helps end users
comprehend information extracted by other DM techniques. Specific visualization techniques
include projection pursuit and parallel coordinates. Tufte (1983, 1990) provided many
examples of visualization techniques that have been extended to work on large data sets and
produce interactive displays.
2.4 Using Data Mining in the Insurance Industry
Data mining methodology can often improve existing actuarial models by finding additional
important variables, by identifying interactions, and by detecting nonlinear relationships. DM
can help insurance firms make crucial business decisions and turn the new found knowledge
into actionable results in business practices such as product development, marketing, claim
distribution analysis, asset liability management and solvency analysis. An example of how
data mining has been used in health insurance can be found in Borok, 1997. To be more
specific, data mining can perform the following tasks.
Database segmentation and more advanced modeling techniques enable analysts to more
accurately choose whom to target for retention campaigns. Current policyholders that are
likely to switch can be identified through predictive modeling. A logistic regression model is a
traditional approach to predict those policyholders who have larger probabilities of switching.
Identifying the target group for retention campaigns may be improved by modeling the
behavior of policyholders.
Reinsurance
DM can be used to structure reinsurance more effectively than the using traditional methods.
Data mining technology is commonly used for segmentation clarity. In the case of reinsurance,
a group of paid claims would be used to model the expected claims experience of another
group of policies. With more granular segmentation, analysts can expect higher levels of
confidence in the model's outcome. The selection of policies for reinsurance can be based
upon the model of experienced risk and not just the generalization that it is a long tailed book
of business.
DM operations such as Link Analysis and Deviation Detection can be used to improve the
claim estimation.
The estimate of the claims provision generated from a predictive model is based on the
assumption that the future will be much like the past. If the model is not updated, then over
time, the assumption becomes that the future will be much like the distant past. However, as
more data become available, the predictive DM model can be updated, and the assumption
becomes that the future will be much like the recent past. Data mining technology enables
insurance analysts to compare old and new models and to assess them based on their
performance. When the newly updated model outperforms the old model, it is time to switch
to the new model. Given the new technologies, analysts can now monitor predictive models
and update as needed.
An important general difference in the focus between existing actuarial techniques and DM is
that DM is more oriented towards applications than towards describing the basic nature of the
underlying phenomena. For example, uncovering the nature of the underlying individual claim
distribution or the specific relation between drivers' age and auto type are not the main goal of
Data Mining. Instead, the focus is on producing a solution that can improve the predictions
for future premiums. DM is very effective in determining how the premiums related to
10
multidimensional risk factors such as drivers' age and type of automobile. Two examples of
applying data mining techniques in insurance actuarial practice will be presented in the next
two sections.
Clustering is one of the most useful tasks in data mining process for discovering groups and
identifying interesting distributions and patterns in the underlying data. Clustering problem
is about partitioning a given data set into groups (clusters) such that the data points in a
cluster is more similar to each other than points in different clusters (Guha et al., 1998). For
example, segmenting existing policyholders into groups and associating a distinct profile with
each group can help future rate making strategies.
Clustering ~ethods perform disjoint cluster analysis on the basis of Euclidean distances
computed from one or more quantitative variables and seeds that are generated and updated by
the algorithm. You can specify the clustering criterion that is used to measure the distance
between data observations and seeds. The observations are divided into clusters such that
every observation belongs to at most one cluster.
Conceptual clustering algorithms consider all the attributes that characterize each record and
identify the subset of the attributes that will describe each created cluster to form concepts.
The concepts in a conceptual clustering algorithm can be represented as conjunctions of
attributes and their values. Bayesian clustering algorithms automatically discover a clustering
that is maximally probable witti respect to the data using a Bayesian approach. The various
clustering algorithms can be characterized by the type of acceptable attribute values such as
continuous, discrete or qualitative; by the presentation methods of each cluster; and by the
methods of organizing the set of clusters, either hierarchically or into flat files. K-mean
clustering, a basic clustering algorithm is introduced in the following.
11
Given a data set with N n-dimensional data points x", the goal is to determine a natural
partitioning of the data set into a number of clusters (k) and noise. We know there are k
disjoint clusters containing Njdata points with representative vector/.tj where j = l . . . . . k. The
K-means algorithm attempts to minimize the sum-of-squares clustering function given by
k 2
= x" - ill
j=l r~S i
The training is carried out by assigning the points at random to k clusters and then computing
the mean vectors flj of the Njpoints in each cluster. Each point is re-assigned to a new
cluster according to which is the nearest mean vector. The mean vectors are then recomputed.
The synthetic data that was obtained from the vendor is given in Table 2.
12
After preprocessing the data, which might include selecting a random sample of the data for
initial analysis, filtering the outlying observations, and standardizing the variables in some
way, we use the K-means clustering to form the clusters.
The following pie chart provides a graphical representation of key characteristics of the
clusters.
Ck~s|er,for EMDATA.DRIVERS
3 a
13
In the pie chart, slice width is the root-mean-square distance (root-mean-square standard
deviation) between cases in the cluster; the height means the frequency and the color
represents the distance of the farthest cluster member from the cluster. Cluster 5 contains the
most cases while cluster 9 has the fewest.
Figure 2 below displays the input means for the entire data set over all of the clusters. The
input means are normalized using a scale transformation
x-min(x)
y=
max(x)- min(x)
Input
Encoded EDUCATION.
Encoded COVERAGE. @
Encoded CAR_TYPE. @
Encoded CLIMATE.
m
Encoded LOCATION. @
GENDER:F- @
ID-
IWI
Credit Score- []
CarAge.
E
Age- Im
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65
NormalizedMean
lt~num
The Normalized Mean Plot can be used to compare the overall normalized means with the
normalized means in each cluster. Figure 3 compare the input means from cluster 1 (red
blocks) to the overall input means (blue blocks). You want to identify the input means for
clusters that differ substantially from the overall input means. The plot ranks the input based
on how spread out the input means are for the selected cluster relative to the overall input
means. The input that has the biggest spread is listed at the top and the input with the smallest
spread is listed at the bottom. The input with the biggest spread typically best characterizes
the selected cluster (Cluster 1 in Figure 3). Figure 3 shows that the variable "Car-Type" and
"Location" are key inputs that help differentiate drivers in Cluster 1 from all of the drivers in
14
the data set. Drivers in Cluster 1 tend to have higher than average education levels than
average drivers in the data set.
Figure 3. Comparing the Input Means for Cluster 1 to the Overall Means
Input
Encoded CAR_TYPE.
Encoded LOCATION
Encoded EDUCATION
il
Encoded CLIMATE 9
Age-
GENDER:E- |
Encoded COVERAGE
D
CatAge
R
Credit Scorn-
ID-
-~lusnum.
_Au__ml ]
Cluster 5, as shown in Figure 4, has higher than average education and better than average
credit scores. Most drivers in Cluster 5 live in location zone 4 and they drive newer car than
average drivers. These characteristics can also be observed from Table 3.
15
Figure 4. Comparing the Input Means for Cluster 5 to the Overall Means
Input
Encoded LOCATION
m
Encoded EDUCATION
m
Credit Score
m
Encoded CLIMATE
GENDER:F
M
Encoded COVERAGE
CarAge mm
ID- @
Age -
Encoded CAR_TYPE
--~lusnum
_ALL_ 9 5
16
Table 3 displays information about each cluster. The statistics Root-Mean-Square Standard
Deviation means the root-mean-square error across variables of the cluster standard
deviations, which is equal to the root-mean-square distance between cases in the cluster.
Maximum
Distance Distance
from to
Frequency Cluster Nearest Nearest Credit Car Car
Cluster of Cluster S e e d Cluster Cluster Score Age Age Gender Location Climate Type CoverageEducation
9 7 2.87 5 2.82 0.86 3.29 35.57 1.00 3.43 1.29 3.57 2.43 1.86
8 20 3.22 7 2.40 0.62 2.15 46.65 0.65 2.80 2.55 2.25 2.85 1.85
7 22 3.25 2 2.25 0.65 2.73 24.59 0.27 1.95 2.09 1.45 2.36 2.27
6 21 3.38 4 2.41 0.81 6.52 35.19 0.43 2.00 1.48 1.67 1.19 1.76
5 33 3.41 4 2.37 0.82 3.00 32.79 0.58 3.82 2.33 2.03 2.39 3.03
4 18 3.83 5 2.37 0.59 5.17 34.44 0.39 3.50 1.83 2.72 1.44 2.56
3 7 3.21 7 3.14 0.46 8.00 20.57 0.43 3.57 2.43 1.14 1.43 2.00
2 18 3.38 7 2.25 0.56 3.56 26.00 0.28 2.89 2.67 1.28 2.28 1.39
1 27 3.40 5 2.55 0.75 2.37 44.15 0.07 2.04 1.52 3.30 2.70 3.00
During the clustering process, an importance value is computed as a value between 0 and 1 for
each variable. Importance is a measure of worth of the given variable to the formation of the
clusters. As shown in Table 4, variable "Gender" has an importance of 0, which means that
the variable was not used as a splitting variable in developing the clusters. The measure of
"importance" indicates how well the variable divides the data into classes. Variables with
zero importance should not necessary be dropped.
Name Importance
GENDER 0
ID 0
LOCATION 0
CLIMATE 0
CAR_TYPE 0.529939
COVERAGE 0.363972
CREDIT_SCORE 0.343488
CAR_AGE 0.941952
AGE 1
EDUCATION 0.751203
17
be interpreted more easily. As a result, actuaries can more accurately predict the likelihood of
a claim and the amount of the claim. For example, one insurance company found that a
segment of the 18- to 20-year old male drivers had a noticeably lower accident rate than the
entire group of 18- to 20-year old males. What variable did this subgroup share that could
explain the difference? Investigation of the data revealed that the members of the lower risk
subgroup drove cars that were significantly older than the average and that the drivers of the
older cars spent time customizing their "vintage autos." As a result, members of the subgroup
were likely to be more cautious driving their customized automobiles than others in their age
group.
Lastly, the cluster identifier for each observation can be passed to other nodes for use as an
input, id, group, or target variable. For example, you could form clusters based on different
age groups you want to target. Then you could build predictive models for each age group by
passing the cluster variable as a group variable to a modeling node.
Decision trees are part of the Induction class of DM techniques. An empirical tree represents
a segmentation of the data that is created by applying a series of simple rules. Each rule
assigns an observation to a segment based on the value of one input. One rule is applied after
another, resulting in a hierarchy of segments within segments. The hierarchy is called a tree,
and each segment is called a node. The original segment contains the entire data set and is
called the root node of the tree. A node with all its successors forms a branch of the node that
created it. The final nodes are called leaves. For each leaf, a decision is made and applied to
all observations in the leaf. The type of decision depends on the context. In predictive
modeling, the decision is simply the predicted value.
The decision tree DM technique enables you to create decision trees that:
9 Classify observations based on the values of nominal, binary, or ordinal targets,
9 Predict outcomes for interval targets, or
9 Predict the appropriate decision when you specify decision alternatives.
Specific decision tree methods include Classification and Regression Trees (CART; Breiman
et. al., 1984) and the count or Chi-squared Automatic Interaction Detection (CHAID; Kass,
1980) algorithm. CART and CHAID are decision tree techniques used to classify a data set.
The following discussion provides a brief description of the CHAID algorithm for building
decision trees. For CHAD, the inputs are either nominal or ordinal. Many software packages
accept interval inputs and automatically group the values into ranges before growing the tree.
For nodes with many observations, the algorithm uses a sample for the split search, for
computing the worth (measure of worth indicates how well a variable divides the data into
18
each class), and for observing the limit on the minimum size of a branch. The samples in
different nodes are taken independently. For binary splits on binary or interval targets, the
optimal split is always found. For other situations, the data is first consolidated, and then either
all possible splits are evaluated or else a heuristic search is used.
The consolidation phase searches for groups of values of the input that seem likely to be
assigned the same branch in the best split. The split search regards observations in the same
consolidatiori group as having the same input value. The split search is faster because fewer
candidate splits need evaluating. A primary consideration when developing a tree for
prediction is deciding how large to grow the tree or, what comes to the same end, what nodes
to prune off the tree. The CHAID method of tree construction specifies a significance level of
a Chi-square test to stop tree growth. The splitting criteria are based on p-values from the F-
distribution (interval targets) or Chi-square distribution (nominal targets). For these criteria,
the best split is the one with the smallest p-value. By default, the p-values are adjusted to take
into account multiple testing.
A missing value may be treated as a separate value. For nominal inputs, a missing value
constitutes a new category. For ordinal inputs, a missing value is free of any order restrictions.
The search for a split on an input proceeds stepwise. Initially, a branch is allocated for each
value of the input. Branches are alternately merged and re-split as seems warranted by the p-
values. The original CHAID algorithm by Kass stops when no merge or re-splitting operation
creates an adequate p-value. The final split is adopted. A common alternative, sometimes
called the exhaustive method, continues merging to a binary split and then adopts the split
with the most favorable p-value among all splits the algorithm considered.
After a split is adopted for an input, its p-value is adjusted, and the input with the best-adjusted
p-value is selected as the splitting variable. If the adjusted p-value is smaller than a threshold
you specified, then the node is split. Tree construction ends when all the adjusted p-values of
the splitting variables in the unsplit nodes are above the user-specified threshold.
Tree techniques provide insights into the decision-making process, which explains how the
results come about. The decision tree is efficient and is thus suitable for large data sets.
Decision trees are perhaps the most successful exploratory method for uncovering deviant data
structure. Trees recursively partition the input data space in order to identify segments where
the records are homogeneous. Although decision trees can split the data into several
homogeneous segments and the rules produced by the tree can be used to detect interaction
among variables, it is relatively unstable and it is difficult to detect linear or quadratic
relationships between the response variable and the dependent variables.
19
First, we use the decision tree algorithm to identify the factors that influence claim frequency.
After the factors are identified, the logistic regression technique is used to quantify the claim
frequency and the effect of each risk factor.
The data for the study has the following variables as shown in Table 5:
We now use the decision tree algorithm to analyze the influences and the importance of the
claim frequency risk factors. The tree algorithm used in this research is SAS/Enterprise Miner
Version 4.2 (2002). We built 100 binary regression trees and 100 CHAD-like trees for
optimal decision tree. Our decision tree analysis reveals that the credit score has the greatest
impact on the claim frequency. The claim frequency, and the interaction among different
factors that affect the claim frequency, vary as the credit score status changes. Furthermore,
there is a significant climate influence within the "higher credit score" status.
The tree diagram displays node (segment) statistics, the names of variables used to split the
data into nodes, and the variable values for several levels of nodes in the tree. Figure 5 shows
a partial profile of the tree diagram for our analysis:
20
Figure 5. Tree Diagram
ICredft Score J
I
I! 63,0Z
47.0s 60,0| I
SO.Os
35 7
31 7
T~tal H 14
41,7s 22,21
~6.3s 77.8s
14 " 7
Tutal 24 $
In Figure 5, each leaf node displays the percentage and n-count of the values that were used to
determine the branching. The second colunm contains the learning from the training data
including the percentage for each target level, the count for each target level, and the total
count. The third column contains the learning from the validation data including the
percentage for each target level, the count for each target level, and the total count. For
example, among these drivers with credit score below 75.5%, 53% of them submitted a claim
from the training data.
The assessment values are used to re.cursively partition the data in homogenous subgroups.
The method is re.cursive because each subgroup results from splitting a subgroup from a
previous split. The numeric labels directly above each node indicate at which point the tree
algorithm found significant splits in interval level variable distributions or in categorical splits
for nominal or ordinal level distributions. The character labels positioned central to each split
are the variable names. You can trace the paths from the root to each leaf and express the
results as a rule.
As shown in Figure 5, the claim frequency varies with the most important risk factor (the
credit score status, in this study) among all the other variables. Based on tree analysis, the car
age, coverage, and car-type are the irrelevant factors. They should not be included in the claim
frequency model.
Based on the tree analysis, we now use logistic regression to estimate the probability of claim
occurrence for each driver based on the factors under consideration. As discussed in Section 2,
21
logistic regression attempts to predict the probability of a claim as a function of one or more
independent inputs. Figure 6 shows a bar chart of the effect T-scores from the logistic
regression analysis. An effect T-score is equal to the parameter estimate divided by its
standard error.
The scores are ordered by decreasing absolute value in the chart. The color density legend
indicates the size of the score for a bar. The legend also displays the minimum and maximum
score to the left and right of the legend, respectively. The vertical axis represents the absolute
value for the effect. In this example, the first variable, Age has the largest absolute value,
Credit Score has the second largest absolute value, and so on. The estimates for Location and
Education are positive, so their bar values is colored a shade of orange. The estimates for Age
and Credit Score have negative values, so their bars are displayed in yellow.
Assessment is the final part of the data mining process. The Assessment criterion is a
comparison of the expected to actual profits or losses obtained from model results. This
criterion enables you to make cross-model comparisons and assessments, independent of all
other factors (such as sample size, modeling node, and so on).
Figure 7 is a cumulative % claim-occurrence lift chart for the logistic regression model. Lift
charts show the percent of captured claim-occurrence (a.k.a. the lift value) on the vertical axis.
In this chart the target drivers are sorted from left to right by individuals most likely to have an
accident, as predicted by each model. The sorted group is lumped into ten percentiles along
the X-axis; the left-most percentile is the 10% of the target predicted most likely to have an
accident. The vertical axis represents the predicted cumulative % claim-occurrence if the
driver from that percentile on down submitted a claim.
22
Figure 7. Lift Chart for Logistic Regression
The lift chart displays the cumulative % claim-occurrence value for a random baseline model,
which represents the claim rate if you chose a driver at random, given the logistic regression
model.
The performance quality of a model is demonstrated by the degree the lift chart curve pushes
upward and to the left. For this example, the logistic regression model captured about 30% of
the drivers in the 10th percentile. The logistic regression model does have better predictive
power from about the 20th to the 80th percentiles. At about the 90th percentile, the cumulative
% claim-occurrence values for the predictive model are about the same as the random baseline
model.
5. Conclusions
This paper introduced the data mining approach to modeling insurance risk and some
implementation of the approach. In this paper, we provide an overview of data mining
operations and techniques and demonstrate two potential applications to property/casualty
actuarial practice. In section 3.2, we used k-means clustering to better describe a group of
drivers by segmentation. In section 4.2, we examined several risk factors for automobile
drivers with the goal of predicting their claim frequency. The influences and the correlations
of these factors on auto claim distribution were identified with exploratory data analysis and
decision tree algorithm. Logistic regression is then applied to model claim frequency.
23
Due to our use of synthetic data, however, the examples show limited advantages of DM over
traditional actuarial analysis. The great significance of the data mining, however, can only be
shown with huge, messy databases. Issues on how to improve data quality through data
acquisition, data integration, and data exploration will to be discussed in the future study.
The key to gaining a competitive advantage in the insurance industry is found in recognizing
that customer databases, if properly managed, analyzed, and exploited, are unique, valuable
corporate assets. Insurance firms can unlock the intelligence contained in their customer
databases through modern data mining technology. Data mining uses predictive modeling,
database segmentation, market basket analysis, and combinations thereof to more quickly
answer crucial business questions with greater accuracy. New products can be developed and
marketing strategies can be implemented enabling the insurance firm to transform a wealth of
information into a wealth of predictability, stability, and profits.
Acknowledgement
The authors would like to thank the CAS Committee on Management Data and Information
for reviewing this paper and providing many constructive suggestions.
References
Adya, M. 1998. "How Effective Are Neural Networks at Forecasting and Prediction? A
Review and Evaluation." Journal of Forecasting 17(5-6, Sep-Nov): 481-495.
Berry, M. A., AND G. S. Linoff. 2000. Mastering Dam Mining. New York, N,Y.: Wiley.
Bishop, C. M. 1995. Neural Networksfor Pattern Recognition. New York: Oxford University
Press.
Borok, L.S. 1997. "Data mining: Sophisticated forms of managed care modeling through
artificial intelligence." Journal of Health Care Finance. 23(3), 20-36.
Carpenter, G., AND S. Grossberg. 1988. "The ART of Adaptive Pattern Recognition by a
Self-Organizing Neural Network." IEEE Computer, 21(3): 77-88.
Chessman, P., J. Kelly, M. Self, J. Stutz, W. Taylor, AND D. Freeman. 1988. "Auto Class: A
Bayesian classification system." 5th lnt'l Conf. on Machine Learning, Morgan Kaufman.
24
Goldberg, D.E. 1989. Genetic Algorithms in Search, Optimization and Machine Learning.
Morgan Kaufmann.
Guha, S., R. Rastogi, AND K. Shim K. 1998. "CURE: An Efficient Clustering Algorithm for
Large Databases." Proceedings of the ACM SIGMOD Conference.
Ester, M., H. Kriegel, J. Sander, AND X. Xu. 1998. "Cluttering for Mining in Large Spatial
Databases." Special Issue on Data mining, KI-Journal, 1. Scien Tec Publishing.
Fisher, D., M. Pazzani, AND P. Langley. 1991. Concept Formation: Knowledge and
Experience in Unsupervised Learning. San Mateo, CA: Kanfmann.
Hand, D., H. Mannila, AND P. Smyth. 2001. Principles of Data Mining. Cambridge,
Massachusetts: MIT Press.
Hinneburg, A., AND D.A. Keim. 1998. "An Efficient Approach to Clustering in Large
Multimedia Databases with Noise." Proceeding 4tn Int. Conf. on Knowledge Discovery
and Data Mining.
Hosmer, D. W., AND S. Lemeshow. 1989. Applied Logistic Regression. New York, N. Y.:
John Wiley & Sons.
Kleinbaum, D. G., L. Kupper, AND K. Muller. 1988. Applied Regression Analysis and other
Multivariable Methods, 2nd edition. PWS-KENT Publishing Company, Boston.
D.E. Rumelhart, G.E. Hinton, AND R.J. Williams. 1986. "Learning Internal Representation by
Error Propagation." Parallel Distributed Processing, ed. by Rumelhart, D.E., J.L.
McClelland, AND the PDP Research Group. Cambridge, MA: The MIT Press: 318-362.
Tufte, E.R. 1983. The Visual Display of Quantitative Information, Graphics Press, Cheshire,
CN.
25
26
Martian Chronicles."
ls MARS Better than Neural Networks?
27
Martian Chronicles: Is MARS better than Neural Networks?
Abstract:
A recently developed data mining technique, Multivariate Adaptive Regression Splines
(MARS) has been hailed by some as a viable competitor to neural networks that does not
suffer from some of the limitations of neural networks. Like neural networks, it is
effective when analyzing complex structures which are commonly found in data, such as
nonlinearitiesand interactions. However, unlike neural networks, MARS is not a "black
box", but produces models that are explainable to management.
This paper will introduce MARS by showing its similarity to an already well-understood
statistical technique: linear regression. It will illustrate MARS by applying it to insurance
fraud data and will compare its performance to that of neural networks,
Acknowledgements:
The author thanks the Automobile Insurers Bureau of Massachusetts for supplying the
data used in this paper and wishes to acknowledge the efforts of Richard Derrig in
providing insight into the data and approaches to analyzing it. The author also
acknowledges the following people who reviewed this paper and provided many
constructive suggestions: Jane Taylor, Francois Morrisette, Christopher Yaure, Patdcia
Francis-Lyon and Virginia Lambert.
28
Martian Chronicles: Is MARS better than Neural Networks?
The casualty actuarial literature contains only a few papers about data mining techniques.
Speights et al. (Speights et al., 1999) and Francis (Francis, 2001) introduced the neural
network procedure for modeling complex insurance data. Hayward (Hayward, 2002)
described the use of data mining techniques in safety promotion and better matching of
premium rates to risk. The methods discussed by Hayward included exploratory data
analysis using pivot tables and stepwise regression.
In this paper, a new technique, MARS, which has been proposed as an alternative to
neural networks (Steinberg, 2001), will be introduced. The name MARS, coined for this
technique by its developer, Freidman, (Hastie, et al., 2001), is an acronym for
Multivariate Adaptive Regression Splines. The technique is a regression based technique
which allows the analyst to use automated procedures to fit models to large complex
databues. Because the technique is regression based, its output is a linear function that is
readily understood by analysts and can be used to explain the model to management.
Thus, the technique does not suffer from the "black box" limitation of neural networks.
However, the technique addresses many of the same data complexities addressed by
neural networks.
Neural networks are one of the more popular data mining approaches. These methods are
among of the oldest data mining methods and are included in most data mining sot~are
packages. Neural networks have been shown to be pmicularly effective in handling
some complexities commonly found in data. Neural networks are well known for their
ability to model nonlinear functions. The research has shown that a neural network with a
sumcient number of parameters can model any continuous nonlinear function
accurately. 1 Francis (Francis, 2001) also showed that neural networks are valuable in
fitting models to data containing interactions. Neural networks are often the tools of
choice when predictive accuracy is required. Berry and Linoff(Berry and Linoff, 1997)
suggest that neural networks are popular because of their proven track record.
Neural networks are not ideal for all data sets. Warner and Misra presented several
examples where they compared neural networks to regression (Warner and Misra, 1996).
Their research showed that regression outperformed neural networks when the functional
relationship between independent and dependent variables was known. Francis (Francis,
29
2001) showed that when the relationship between independent and dependent variables
was linear, classical techniques such as regression and factor analysis outperformed
neural networks.
Francis (Francis, 2001) listed several complexities found in actual insurance data and
then showed how neural networks were effective in dealing with these complexities. This
paper will introduce MARS and will compare and contrast how MARS and neural
networks deal with several common data challenges. Three challenges that will be
addressed in this paper are:
The Data
This paper features the application of two data mining techniques, neural networks and
MARS, to the fraud problem. The data for the application was supplied by the
Automobile Insurers Bureau of Massachusetts (A1B). The data consists of a random
sample of 1400 closed claims that were collected from PIP (personal injury protection or
no-fault coverage) claimants in Massachusetts in 1993. The database was assembled
with the cooperation often large insurers. This data has been used by the AIB, the
Insurance Fraud Bureau of Massachusetts (IFB) and other researchers to investigate
fraudulent claims or probable fraudulent claims (Derrig et al., 1994, Weisberg and
Derrig, 1995, Viaene et al., 2002). While the typical data mining application would use
a much larger database, the AIB PIP data is well suited to illustrating the use of data
mining techniques in insurance. Viaene et al. used the AIB data to compare the
performance of a number of data mining and conventional classification techniques
(Viaene et al., 2002).
30
Two key fraud related dependent variables were collected in the study: an overall
assessment (ASSESS) of the likelihood the claim is fraudulent or abusive and a suspicion
score (SUSPICION). Each record in the data was assigned a value by an expert. The
value indicates the expert's subjective assessment as to whether the claim was legitimate
or whether fraud or abuse was suspected. Experts were asked to classify suspected fraud
or abuse claims into the following categories: exaggerated damages, opportunistic fraud
or planned fraud. As shown in Table 1, the assessment variable can take on 5 possible
values. In addition, each claim was assigned a score from 0 (none) to 10 (very high)
indicating the expert's degree of suspicion that the claim was abusive or fraudulent.
Wcisberg and Derrig (Weisberg and Derrig, 1993) found that more serious kinds of
fraud, such as planned fraud were associated with higher suspicion scores than "softer"
fraud such as exaggeration of damages. They suggest that the suspicion score was able to
measure the range of"soft" versus "hard" fraud.
The databasecontains detailed objective claim information on each claim in the study.
This includes information about the policy inception date, the date the accident occurred,
the date it was reported, the paid and incurred loss dollars, the injury type, payments to
health care providers and the provider type. The database also contains "red flag" or
fraud indicator variables. These variables are subjective assessments of characteristics of
the claim that are believed to be related to the likelihood of fraud or abuse. More
information on the variables in the model is supplied below in the discussion of specific
models.
Table 1
Assessment Variable
Value Assessment Percent of Data
1 Probably legitimate 64%
2 Excessive treatment only 20%
3 Suspected opportunistic fraud, no injury 3%
4 Suspected opportunistic fraud, exaggerated injury 12%
5 Suspected planned fraud 1%
We may use the more inclusive term "abuse" when referring to the softer kinds of
fraudulent activity, as only a very small percentage of claims meet the strict standard of
criminal fraud (Derfig, 2002). However, misrepresentation and exaggeration of the
nature and extent of the damages, including padding of the medical bills so that the value
of the claim exceeds the tort threshold, occur relatively frequently. While these activities
are often thought of as fraud, they do not meet a legal definition of fraud. Therefore, they
will be referred to as abuse. Overall, about one third of the claims were coded as
probable abuse or fraud claims.
Nonlinear Functions
The relationships encountered in insurance data are often nonlinear. Classical statistical
modeling methods such as linear regression have had a tremendous impact on the
analysis and modeling of data. However, traditional statistical procedures often assume
31
that the relationships between dependent and independent variables are linear.
Traditional modeling also allows linear relationship that result from a transformation of
dependent or independent variables, so some nonlinear relationships can be
approximated. In addition, there are techniques specifically developed for fitting
nonlinear functions such as nonlinear regression. However, these techniques require that
theory or experience specify the "true" form of the nonlinear relationships. Data mining
techniques such as neural networks and MARS do not require that the relationships
between predictor and dependent variables be linear (whether or not the variables are
transformed). Both neural networks and MARS are also considered nonparametric
because they require no assumptions about the form of the relationship between
dependent and independent variables.
For this illustration, a dependent variable that is not categorical (i.e. values have a
meaningful order) was selected. The selected dependent variable was SUSPICION.
Unlike the ASSESS variable, the values on the SUSPICION variable have a meaningful
range, with higher values associated with suspicion of more serious fraud.
To illustrate methods of fitting models to nonlinear curves, a variable was selected which
1) had a significant correlation with the dependent variable, and 2) displayed a highly
nonlinear relationship. Illustrating the techniques is the objective of this example. The
data used may require significant time to collect and may therefore not be practical for an
application where the objective is to predict abuse and fraud (which would require data
that is available soon after the claim is reported) Later in the paper, models for
prospectively predicting fraud will be presented. The variable selected was the first
medical provider's bill2. A medical provider may be a doctor, a clinic, a chiropractor or a
physical therapist. Prior published research has indicated that abusive medical treatment
patterns are often key drivers of fraud (Derrig et al., 1994, Weisberg and Derrig, 1995).
Under no-fault laws, claimants will often deliberately run the medical bills up high
enough to exceed tort thresholds. In this example the relationship between the first
provider's medical bill and the value of the suspicion score will be investigated. The AIB
fraud database contains the medical bills submitted from the top two health care
providers. If more costly medicine is delivered to suspicious claims than non-suspicious
claims, the provider bills should be higher for the suspicious claims.
Figure 1 presents a scatterplot of the relationship between SUSPICION and the provider
bill. No relationship is evident from the graph. However, certain nonlinear relationships
can be difficult to detect visually.
2 N0te that Massachusetts PIP covers only the first $8,000 of medical payments if the claimant has health
insurance. Large bill amounts may represent data from claimants with no coverage. Bills may also exceed
$8,000 even ffpayments are limited. However, the value of medical bills on some claims may be
truncated because reimbursemenl is not expected.
32
Figure 1
Scatterplot of SUSPICION v$ Provider Bill
i
/
10 ~ o~ o o o
o oo o ~o o
et ~o o o ~ o ocm o
~ ~ o ~ o oo oco oo o o
m
|,1 . . . .
2t azo~m~llZlmmCl~OCo o o o
~ o ~ o ~ o o
0t ~o oo ooo o
Neural networks will first be used to fit a curve to the data. A detailed description of how
neural networks analyze data is beyond the scope of this paper. Several sources on this
topic are Francis, Lawrence and Smith (Francis, 2001, Lawrence, 1994, Smith, 1996).
Although based upon how neurons function in the brain, the neural network technique
essentially fits a complex non-parametric nonlinear regression. A task at which neural
networks are particularly effective is fitting nonlinear functions. The graph below
displays the resulting function when the dependent variable SUSPICION is fit to the
provider bill by a neural network. This graph displays a funetion that increases quickly at
lower bill amounts and then levels off. Although the curve is flat over much of the range
of medical bills, it should be noted that the majority of bills are below $2,000 (in 1993
dollars).
Figure 2
Neural N e t w o r k Fit of SUSPICION vs Provider Bill
4.00
~ 3.00-
i
~ zoo.
lJ
o.~-
33
One of the most common statistical procedures for curve fitting is linear regression.
Linear regression assumes the relationship between the dependent and independent
variables is linear. Figure 3 displays the graph of a fitted regression line of SUSPICION
on provider bill. The regression forces a linear fit to SUSPICION versus the payment
amount Thus, rather than a curve with a rapidly increasing trend line that levels off, a
line with a constant slope is fitted. If the relationship is in fact nonlinear, this procedure
is not as accurate as that of the neural network.
Figure 3
Regression Fit o f S U S P I C I O N v s Provider Sill
~4
0
0 1000 20oo 3o0o 40oo 5o0o so~e 7oo0 ~00
~ m
When the true relationship between a dependent and independent variable is nonlinear,
various approaches are available when using traditional statistical procedures for fitting
the curve. One approach is to apply a nonlinear transformation to the dependent or
independent variable. A linear regression is then fit to the transformed variables. As an
example, a log transform was applied to the provider bill variable in the AIB data. The
regression fit was of the form:
Y = B o + B 1I n ( X )
That is, the dependent variable, the suspicion score, is assumed to be a linear function of
the natural log of the independent variable, provider bill. Figure 4 displays the curve fit
using the logarithmic transformation.
34
Figure 4
Log Transform Fit of SUSPICION vs Provider Bill
i 3
4I
i2
~t-t
1000 3000 50~O 7000
ProVx~ BII
Y =B o +B1X+B2X 2 +...+B.X n
Generally, low order polynomials are used in the approximation. A cubic polynomial
(including terms up to provider bill raised to the third power) was used in the fit. Figure
5 displays a graph of a fitted polynomial regression.
Figure 5
Polynomial Regression Fit of S U S P I C I O N vs Provider SIN
3
m9
2
o . . . . . . . . .
o 1ooo 2ooo ~ooo 4ooo ~ eoeo 70oo 8~o
~er Ea
35
The use of polynomial regression to approximate functions is familiar to readers from its
use in Taylor series expansions for this purpose. However, the Taylor series expansion
is used to approximate a function near a point, rather than over a wide range. When
evaluating a function over a range, the maximums and inflection points of the polynomial
may not exactly match the curves of the function being approximated.
The neural network model had an R 2 (coefficient of determination) of 0.37 versus 0.25
for the linear model and 0.26 for the log transform. The R 2 of the polynomial model was
comparable to that of the neural network model. However, the fit was influenced
strongly by a small number of claims with large values. Though not shown in the graph,
at high values for the independent variable the curve declines below zero and then
increases again. This unusual behavior suggests that the fitted curve may not
approximate the "true" relationship between provider bill and suspicion score well at the
extremes of the data and may perform poorly on new claims with values outside the
range of the data used for fitting.
Table 2 below shows the values of SUSPICION for ranges of the provider bill variable.
The table indicates that SUSPICION increases rapidly at low bill amounts and then levels
offat about $3,000.
Table 2
Suspicion Scores by Provider Bill
Provider Bill Number of Claims Mean Suspicion Score
$0 444 0.3
1 - 1,000 376 1.1
1,001 - 2,000 243 3.0
2,001 - 3,000 227 4.2
3,001 - 4,000 60 4.6
4,001 - 5000 33 4.2
5,001 - 6,000 5 5.8
6,001 - 7,000 12 4.3
The examples illustrate that traditional techniques which require specific parametric
assumptions about the relationship between dependent and independent variables may
lack the flexibility to model nonlinear relationships. It should be noted, however, that
Francis (Francis, 2001) presented examples where traditional techniques performed as
weU as neural networks in fitting nonlinear functions. Also, when the true relationship
between the dependent and independent variables is linear, classical statistical methods
are likely to outperform neural networks.
36
To continue the previous example, a function was fit by MARS. The graph below
displays the MARS fitted function. It can be seen that the curve is broken into a steeply
sloping line, which then levels off much the way the neural network fitted function did.
Figure 6
MARS Fit of SUSPICION vs Provider Bill
g3.
~.
o-
MARS uses an optimization procedure that fits the best piecewise regression. Simpler
functions may adequately approximate the relationship between predictor and dependent
variables and are favored over more complex functions. From the graph, it can he seen
that the best MARS regression had two pieces:
1) The curve has a steep slope between bitl amounts of $0 and $2,185
2) The curve levels off at bill amounts above $2,185
where
The points in the data range where the curves change slope are known as knots. The
impact of knots on the model is captured by basis functions. For instance BF1 is a basis
function. Basis fimctions can he viewed as similar to dummy variables in linear
regression. Dummy variables are generally used in regression analysis when the
predictor variables are categorical. For instance, the Provider bill variable can be
37
MARS can perform regressions on binary variables. When the dependent variable is
binary, MARS is run in binary mode. In binary mode, the dependent variable is
converted into a 0 (legitimate) or a 1 (suspected fraud or abuse). Ordinary least squares
regression is then performed regressing the binary variable on the predictor variables.
Logistic regression is a more common procedure when the dependent variable is binary.
Suppose that the true target variable is the probability that a given claim is abusive, and
this probability is d e n o t e d p ( x ) . The model relatingpOc) to the a vector of independent
variables x is:
ln(l_~Pp ;x ) = B o + B i X 1 + . . . + B n X n
where the quantity ln(p(x)/(I-p(x))) is known as the logit function or log odds. Logistic
regression can be used to produce scores that are between zero and one, consistent with
viewing the score as a probability. Binary regressions can produce predicted values
which can be less than zero and greater than one. One solution to this issue is to truncate
the predicted values at zero and one. Another solution is to add the extra step of fitting a
logistic regression to the data using the MARS predicted value as the independent
variable and the binary assessment variable as the dependent variable. The fired
probabilities from the logistic regression can then be assigned as a score for the claim.
The neural network model was also run in binary mode and also produced fired values
which were less than zero or greater than one. In this analysis, logistic regression was
applied to the results of both the MARS and neural network fits to convert the predicted
values into probabilities.
The red flag variables were supplemented with claim file variables deemed to be
available early in the life of a claim and therefore of practical value in predicting fraud
and abuse.
The variables selected for use in the full model are the same as those used by Viaene et
al. (Viaene et. al., 2002) in their comparison of statistical and data mining methods.
While a much larger number of predictor variables is available in the AIB data for
38
modeling fraud, the red flag and objective claim variables selected for incorporation into
their models by Viaene et al. were chosen because of early availability. Therefore they
are likely to be useful in predicting fraud and abuse soon enough in the claim's lifespan
for effective mitigation efforts to lower the cost of the claim. Tables 6 and 7 present the
red flag and claim file variables.
Table 6
Red Flag Variables
Indicator
Subject Variable Description
Accident ACC01 No report by police officer at scene
ACC04 Single vehicle accident
ACCO9 No plausible explanation for accident
ACC10 Claimant in old, low valued vehicle
ACC11 Rental vehicle involved in accident
ACC 14 Property Damage was inconsistent with accident
ACC 15 Very minor impact collision
ACC16 Claimant vehicle stopped short
ACC19 Insured felt set up, denied fault
Claimant CLT02 Had a history of previous claims
CLTO4 Was an out of state accident
CLT07 Was one of three or more claimants in vehicle
injury INJ01 injury consisted of strain or sprain only
INJ02 No objective evidence of injury
INJ03 Police report showed no injury or pain
INJ05 No emergency treatment was given
INJ06 Non-emergency treatment was delayed
INJI 1 Unusual injury for auto accident
Insured INS01 Had history of previous claims
INS03 Readily accepted fault for accident
INS06 Was difficult to contact/uncooperative
INS07 Accident occunred soon after effective date
LOst Wages LW0f Claimant worked for self or a family member
LWO3 Claimant recently started employment
Table 7
Claim Variables Available Eady in Life of Claim
Variable Description
AGE Age of claimant
POLLAG Lag from policy inception to date of accident8
RPTLAG Lag from date of accident to date reported
TREATLAG Lag from date of accident to eadiest treatment by sen/ice provider
AMBUL Ambulance charges
PARTDIS The claimant partially disabled
TOTDIS The claimant totally disabled
LEGALREP The claimant represented byan attorney
39
One of the objectives of this research is to investigate which variables are likely to be of
value in predicting fraud and abuse. To do this, procedures are needed for evaluating the
importance of variables in predicting the target variable. Below, we present some
methods that can be used to evaluate the importance of the variables.
The effective degrees of freedom is the means by which the GCV error functions puts a
penalty on adding variables to the model. The effective degrees of freedom is chosen by
the modeler. Since MARS tests many possible variables and possible basis functions, the
effective degrees of freedom used in parameterizing the model is much higher than the
actual number of basis function in the final model. Steinberg states that research
indicates that k should be two to five times the number of basis functions in the model,
although some research suggests it should be even higher (Steinberg, 2000).
The GCV can be used to rank the variables in importance. To rank the variables in
importance, the GCV is computed with and without each variable in the model.
For neural networks, a statistic known as the sensitivity can be used to assess the relative
importance of variables. The sensitivity is a measure of how much the predicted value's
error increases when the variables are excluded from the model one at a time. Ports
(Potts, 2000) and Francis (Francis, 2001) described a procedure for computing this
statistic. Many of the major data mining packages used for fitting neural networks supply
this statistic or a ranking of variables based on the statistic. Statistical procedures for
testing the significance of variables are not well developed for neural networks. One
approach is to drop the least important variables from the model, one at a time and
evaluate whether the fit deteriorates on a sample of claims that have been held out for
testing. On a large database this approach can be time consuming and inefficient, but it is
feasible on small databases such as the AIB database.
40
Table 8 displays the ranking of variable importance from the MARS model. Table 9
displays the ranking of importance from the neural network model. The final model
fitted by MARS uses only the top 12 variables in importance. These were the variables
that were determined to have made a significant contribution to the final model. Only
variables included in the model, i.e., found to be significant are included in the tables.
Table 8
MARS Ranking of Variables
Rank Variable Description
1 LEGALREP Legal Representation
2 TRTMIS Treatment lag missing
3 ACC04 Single vehicle accident
4 INJ01 Injury consisted of strain or sprain only
5 AGE Claimant age
6 PARTDIS Claimant partially disabled
7 ACC14 Property damage was inconsistent with accident
8 CLT02 Had a history of previous claims
9 POLLAG Policy lag
10 RPTLAG Report lag
11 AMBUL Ambulance charges
12 ACC15 Very minor impact collision
The ranking of variables as determined by applying the sensitivity test to the neural
network model is shown below.
Table 9
Both the MARS and the neural network find the involvement of a lawyer to be the most
important variable in predicting fraud and abuse. Both procedures also rank as second a
missing value on treatment lag. The value on this variable is missing when the claimant
has not been to an outpatient health care provider, although in over 95% of these cases,
41
the claimant has visited an emergency room. 9 Note that both medical paid and total paid
for this group is less than one third o f the medical paid and total paid for claimants who
visited a provider. Thus the TRTMIS (treatment lag missing) variable appears to be a
surrogate for not using an outpatient provider. The actual lag in obtaining treatment is not
an important variable in either the M A R S or neural network models.
BFI = LEGALREP = i)
BF2 = = 2)
BF3 = TRTLAG = missing)
BF4 = TRTLAG # missing)
BF5 = INJ01 = I) * BF2
BF7 = ACC04 = i) * BF4
BF9 = ACCI4 = i)
BFil = PARTDIS = i) * BF4
BFI5 = max(0 AGE - 36) * BF4
BFI6 = max(0 36 - AGE ) * BF4
BFI8 = max(0, 55 - AMBUL ) * BFI5
BF20 = max(0, i0 - RPTLAG ) * BF4
BF21 = ( CLT02 = I)
BF23 = POLLAG * BF21
BF24 = ( ACCI5 = I) * BFI6
9 Because of the strong relationship between a missingvalue on treatment lag and the dependent variable,
and the high percentage of claims in this category which had emergency room visits, an indicator variable
for emergency room visits was tested as a surrogate. It was found not to be significant.
42
T a b l e 10
Description of Categorical Variables
Variable Value Description
LEGALREP 1 No legal representation
2 Has legal representation
INJ01 1 Injury consisted of strain or sprain only
2 Injury did not consist of strain or sprain only
ACC04 1 Single vehicle accident
2 Two or more vehiCle accident
ACC14 1 Property damage was inconsistent with accident
2 Property damage was consistent with accident
PARTDIS 1 Partially disabled
2 Not partially disabled
CLT02 1 Had a history of previous claims
2 No history of previous claims
ACC15 1 Was very minor impact collision
2 Was not very minor impact collision
The basis functions and regression produced by MARS assist the analyst in
understanding the impact of the predictor variables on the dependent variable. From the
formulae above, it can be concluded that
Of the red flag variables, small contributions were made by the claimant having a
previous history of a claim 1~and the accident being a minor impact collision. Of the
objective continuous variables obtained from the claim file, variables such as claimant
age, report lag and policy lag have a small impact on predicting fraud or abuse.
Figures 11 and 12 display how MARS modeled the impact of selected continuous
variables on the probability of fraud and abuse.. For claims receiving outpatient health
10
This variableonlycaptures histolyof a prior claim if it was recordedby the insurance company. For
9 .
43
care, report lag has a positive impact on the probability of abuse, but its impact reaches
its maximum value at about 10 days. Note the interaction between claimant age and
ambulance costs displayed in Figure 12. For low ambulance costs, the probability of
abuse rises steeply with claimant age and maintains a relatively high probability except
for the very young and very old claimants. As ambulance costs increase, the probability
of fraud or abuse decreases, and the decrease is more pronounced at lower and higher
ages. Ambulance cost appears to be acting as a surrogate for injury severity.
Figure 11
Conttit~ion of ReportLag to Predicted
For Claims with TRTLAG rnlaslng
0.14
~oo.I0
0
/ 10
i
20
,
30
Rep01t Lag
,
40
r
50
,
60
Figure 12
Surface 1: BF16--Categorical-Ordinal Interaction
TREATLAG_mis
44
This section on explaining the model illustrates one of the very useful qualities of MARS
as~r~ *.~ n~'.~'al networks: the output of the model is a formula which describes the
relationships between predictor and dependent variables and which can be used to explain
the model to management. To some extent, the sensitivity measure assists us in
understanding the relationships fit by the neural network model, as it provides a way to
assess the importance of each of the variables to the prediction. However, the actual
functional relationships between independent and dependent variables are not typically
available and the model can he difficult to explain to management, u
Both a MARS model and a neural network model were fit to four samples of the data.
Each time the fitted model was used to predict the probability of fraud or abuse for one
quarter of the data that was held out. The predictions from the four test samples were
then combined to allow comparison of the MARS and neural network procedures.
Table l 1 presents some results of the analysis. This table presents the R 2 of the regression
of ASSESS on the predicted value from the model. The table shows that the neural
network R 2 was higher than that of MARS. The table also displays the percentage of
observations whose values were correctly predicted by the model. The predictions are
based only on the samples of test claims. The neural network model correctly predicted
79~ of the test claims, while MARS correctly predicted 77% oftbe test claims.
Table I I
Four Fold Cross-validation
Percent
Technique Rz Correct
MARS 0.35 0.77
Neural Network 0.39 0.79
45
Tables 12 and 13 display the accuracy of MARS and the neural network in classifying
fraud and abuse claims. 12 A cutoff point of 50% was used for the classification. That is,
if the model's predicted probability o f a 1 on ASSESS exceeded 50%, the claim was
deemed an abuse claim. Thus, those claims in cell Actual =1 and Predicted=l are the
claims assessed by experts as probably abusive which were predicted to be abusive.
Those claims in cell Actual=l, Predicted =0, are the claims assessed as probable abuse
claims which were predicted by the model to be legitimate.
Table 12
MARS Predicted * Actual
Predicted Actual
0 1 Total
0 738 160 898
1 157 344 501
Total 895 505
Table 13
Neural Network Predicted * Actual
Predicted Actual
0 1 Total
0 746 127 873
1 149 377 526
Total 895 505
Table 14 presents the sensitivity and specificity of each of the models. The sensitivity is
the percentage of events (in this case suspected abuse claims) that were predicted to be
events. The specificity is the percentage of nonevents (in this case claims believed to be
legitimate) that were predicted to be nonevents. Both of these statistics should be high
for a good model. The table indicates that both the MARS and neural network models
were more accurate in predicting nonevent or legitimate claims. The neural network
model had a higher sensitivity than the MARS model, but both were approximately equal
in their specifieities. The neural network's higher overall accuracy appears to be a result
of its greater accuracy in predicting the suspected fraud and abuse claims. Note that the
sensitivity and specificity measures are dependent on the choice of a cutoff value. Thus,
if a cutoff lower than 50% were selected, more abuse claims would be accurately
predicted and fewer legitimate claims would be accurately predicted.
Table 14
Model Sensitivity Specificity
MARS 68.3 82.5
Neural Network 74.8 83.4
46
A common procedure for visualizing the accuracy of models used for classification is the
receiver operating characteristics (ROC) curve. This is a curve of sensitivity versus
specificity (or more accurately 1.0 minus the specificity) over a range of cutoffpoints.
When the eutoffpoint is very high (i.e. 1.0) all claims are classified as legitimate. The
specificity is 100~ (1.0 minus the specificity is 0), but the sensitivity is 0 % As the
cutoff point is raised, the sensitivity increases, but so does 1.0 minus the specificity.
Ultimately a point is reached where all claims are predicted to be events, and the
specificity declines to zero. The baseline ROC curve (where no model is used) can be
thought of as a straight line from the origin with a 45-degree angle: If the model's
sensitivity increases faster than the specificity decreases, the curve "lifts" or rises above a
45-degree line quickly. The higher the "lift", the more accurate the model. It can be seen
from the graph of the ROC curve that both the MARS and neural network models have
significant "lilt" but the neural network model has more "lift" than the MARS model.
F i g u r e 13
Fg~Oar~
s.S"
SS S,S
0.8
S
/
s"
s"
/ r'
I s"
~e s"
0.6 T/ s "S
i ~4 Im s.~l
02 l s,s,S.
S"S
's
(~0
I. Slaat~
47
Table 15
Statistics for Area Under the ROC Curve
Test Result Variables Area Std Asymptotic Sig Lower Upper
Error 95% 95%
Bound Bound
MARS Probability 0.85 0.01 0.000 0.834 0.873
Neural Probability 0.88 0.01 0.000 0.857 0.893
Summary of Comparison
The R O C curve results suggest that in this analysis the neural network enjoyeda modest
though not statistically significant advantage over M A R S in predictive accuracy. It
should be noted that the database used for this study was quite small for a data mining
application and may produce results that do not generalize to larger applications.
Steinberg (Steinberg, 2001) reports that on other applications MARS equaled or exceeded
the performance o f neural networks. It should also be noted that some o f the key
comparative strengths o f MARS such as its ability to handle missing data were not a
significant factor in the analysis, as all but one o f the variables were fully populated. 13
In addition, M A R S ' s capability o f clustering levels o f categorical variables together was
not relevant to this analysis, as no categorical variable had more than two levels.
A practical advantage that MARS enjoys over neural networks is the ease with which
results can be explained to management. Thus, one potential use for MARS is to fit a
model using neural networks and then apply M A R S to the fitted values to understand the
functional relationships fitted by the neural network model. The results o f such an
exercise are shown below:
BF1 = (LEGALREP = 1)
BF2 = (LEGALREP = 2)
BF3 = ( T R T L A G :# missing)
BF4 = ( T R T L A G = missing)
BF5 = ( INJ01 = 1)
BF7 = ( ACC04 = 1) * BF3
BF8 = ( ACC04 = 2) * BF3
BF9 = ( PARTDIS = 1) * BF8
B F l l = max(0, A M B U L - 182) * BF2
BF12 = max(0, 182 - A M B U L ) * BF2
BF13 = ( ACC14 = 1) * BF3
BF15 = ( CLT02 = 1) * BF3
BF17 = max(0, P O L L A G - 21) * BF3
BF19 = max(0, AGE - 41) * BF3
BF20 = max(0, 41 - AGE) * BF3
]3 One of the claims was missing data on the AGE variable, and this claim was eliminated from the neural
network analysis and from comparisons of MARS the neural network model. Had more claims been
missing the AGE variable, we would have modeled it in the neural network.
48
BF21 = ( INS06 = 1)
BF23 = max(0, RPTLAG - 24) * BF8
BF24 = max(0, 24 - RPTLAG ) * BF8
BF25 = B F I * BF4
BF27 = ( ACC15 = 1) * BF8
BF29 = ( INJ03 = 1) * BF2
Y = 0.098 - 0.272 * BF1 + 0.334 * BF3 + 0.123 * BF5 - 0.205 * BF7 + 0.145 *
BF9 .623E-04 * B F l l + .455E-03 * BF12 + 0.258 * BF13 + 0.100 * BF15 +
-
This model had an R 2 o f 0.9. Thus, it was able to explain most o f the variability in the
neural network fitted model. Though the sensitivity test revealed that LEGALREP is the
most significant variable in the neural network model, its functional relationship to the
probability o f fraud is unknown using standard neural network modeling techniques. As
interpreted by MARS, the absence o f legal representation reduces the probability o f fraud
by 0.272., even without interacting with other variables. LEGALREP also interacts with
the ambulance cost variable, INJ03 (police report shows no injury) and no use o f a health
care provider (treatment lag missing). The sensitivity measure indicated that the presence
or absence o f a value for treatment lag was the second most important variable. As stated
earlier, this variable can be viewed as a surrogate for use o f an outpatient health care
provider. The use o f an outpatient health care provider (TRTLAG :~ missing) adds 0.334
to the probability o f fraud or abuse, but this variable also interacts with the policy lag,
report lag, claimant age, partial disability, ACC04, (single vehicle accident), ACC14
(property damage inconsistent with accident) and CLT02 (history o f prior claims).
The MARS model helps the user understand not only the nonlinear relationships
uncovered by the neural network model, but also describes the interactions which were fit
by the neural network.
A procedure frequently used by data mining practitioners when two or more approaches
are considered appropriate for an application is to construct a hybrid model or average the
results o f the modeling procedures. This approach has been reported to reduce the
variance o f the prediction (Salford Systems, 1999). Table 16 displays the AUROC
statistics resulting from averaging the results o f the MARS and neural network models.
The table indicates that the performance o f the hybrid model is about equal to the
performance o f the neural network. (The graph including the ROC curve for the
combined model is not shown, as the curve is identical to Figure 13 because the neural
network and combined curves cannot be distinguished.) Salford Systems (Salford
Systems, 1999) reports that the accuracy o f hybrid models otten exceeds that o f its
components, but usually at least equals that o f the best model. Thus, hybrid models that
combine the results o f two techniques may be preferred to single technique models
because uncertainty about the accuracy o f the predicted values on non-sample data is
reduced.
49
Table 16
Statistics for Area Under the ROC Curve
Test Result Variables Area Std Asymptotic Lower Upper
Error Sig 95% 95%
Bound Bound
MARS Probability 0.853 0.01 0.000 0.834 0.873
Neural Probability 0.875 0.01 0.000 0.857 0.893
Combined Probability 0.874 0.01 0.000 0.857 0.892
The results of both the MARS and neural network analysis suggest that both claim file
variables (present in most claims databases) and red flag variables (common wisdom
about which variables are associated with fraud) are useful predictors of fraud and abuse.
However, this and other studies support the value of using analytical tools for identifying
potentially abusive claims. As pointed out by Derrig (Derrig, 2002), fraud models can
help insurers sort claims into categories related to the need for additional resources to
settle the claim efficiently. For instance, claims assigned a low score by a fraud and
abuse model, can be settled quickly with little investigative effort on the part of adjusters.
Insurers may apply increasingly greater resources to claims with higher scores to acquire
additional information about the claimant/policyholder/provider and mitigate the total
cost of the claim. Thus, the use of a fraud model is not conceived as an all or nothing
exercise that classifies a claim as fraudulent or legitimate, but a graduated effort of
applying increasing resources to claims where there appears to be a higher likelihood of
material financial benefit from the expenditures.
Conclusion
This paper has introduced the MARS technique and compared it to neural networks.
Each technique has advantages and disadvantages and the needs of a particular
application will determine which technique is most appropriate.
One of the strengths of neural networks is their ability to model highly nonlinear data.
MARS was shown to produce results similar to neural networks in modeling a nonlinear
function. MARS was also shown to be effective at modeling interactions, another
strength of neural networks.
50
In dealing with nominal level variables, MARS is able to cluster togcther the categories
of the variables that have similar effects on the dependent variable. This is a capability
not possessed by neural networks that is extremely useful when the data contain
categorical variables with many levels such as ICD9 code.
MARS has automated capabilities for handling missing data, a common feature of large
databases. Though missing data can be modeled with neural networks using indicator
variables, automated procedures for creating such variables are not available in most
standard commercial software for fitting neural networks. Moreover, since MARS can
create interaction variables from missing variable basis functions and other variables, it
can create surrogates for the missing variables. Thus, on applications using data with
missing values on many variables, or data where the categorical variables have many
values, one may want to at least preprocess the data with MARS to create basis functions
for the missing data and categorical variables which can be used in other procedures.
A significant disadvantage of neural networks is that they are a "black box". The
functions fit by neural networks are difficult for the analyst to understand and difficult to
explain to management. One of the very useful features of MARS is that it produces a
regression like function that can be used to understand and explain the model; therefore it
may be preferred to neural networks when ease of explanation rather than predictive
accuracy is required. MARS can also be used to understand the relationships fit by other
models. In one example in this paper MARS was applied to the values fit by a neural
network to uncover the important functional relationships modeled by the neural network.
Neural networks are often selected for applications because of their predictive accuracy.
In a fraud modeling application examined in this paper the neural network outperformed
MARS, though the results were not statistically significant. The results were obtained on
a relatively small database and may not generalize to other databases. In addition, the
work of other researchers suggests that MARS performs well compared to neural
networks. However, neural networks are highly regarded for their predictive capabilities.
When predictive accuracy is a key concern, the analyst may choose neural networks
rather than MARS when neural networks significantly outperform MARS. An alternative
approach that has been shown to improve predictive accuracy is to combine the results of
two techniques, such as MARS and neural networks, into a hybrid model.
This analysis and those of other researchers supports the use of intelligent techniques for
modeling fraud and abuse. The use of an analytical approach can improve the
performance of fraud detection procedures that utilize red flag variables or subjective
claim department rules by l) determining which variables are really important in
predicting fraud, 2) assigning an appropriate weight to the variables when using them to
predict fraud or abuse, and 3) using the claim file and red flag variables in a consistent
manner across adjusters and claims.
51
References
Berry, Michael J. A., and Linoff, Gordon, DataMining Techniques, John Wiley and
Sons, 1997
Brockett, Patrick L., Xianhua Xia and Richard A. Derrig, 1998, "Using Kohonen's Self-
Organizing Feature Map to Uncover Automobile Bodily Injury Claims Fraud", Journal of
Risk and Insurance, June, 65:2.
Brockett, Patrick L., Richard A. Derrig, Linda L. Golden, Arnold Levine and Mark
Alpert, "Fraud Classification Using Piincipal Component Analysis of RIDITs", Journal
of Risk and lnsurance, September, 2002, pp. 341-371.
Dhar, Vasaat and Stein, Roger, Seven Methods for Transforming Corporate Data Into
Business Intelligence, Prentice Hall, 1997
Derdg, Richard A., Herbert I. Weisberg and Xiu Chert, 1994, "Behavioral Factors and
Lotteries Under No-Fault with a Monetary Threshold: A Study of Massachu~zcts
Automobile Claims", Journal of Risk andlnsurance, June, 1994, 61:2: 245-275.
Derrig, Richard, "Patterns, Fighting Fraud With Data", Contingencies, Sept/Oct, 1999,
pp. 40-49.
Derrig, Richard A., and Valerie Zicko, "Prosecuting Insurance Fraud: A Case Study of
The Massachusetts Experience in the 1990s", Risk Management and Insurance Research,
2002.
Derrig, Richard, "Insurance Fraud", Journal of Risk and Insurance, September, 2002, pp.
271-287
Freedman, Roy S., Klein, Robert A. and Lederman, Jess, Artificial Intelligence in the
CapitalMarkets, Probus Publishers 1995
Hastie, Trevor, Tibshirani, Robert, Generalized Additive Models, Chapman and Hall,
1990
52
Hastie, Trevor, Tibshirani, Robert and Freidman, Jerome, The Elements of Statistical
Learning: Data Mining, Inference and Prediction, Springer, 2001
I-Iayward,Gregory, "Mining Insurance Data to Promote Traffic Safety and Better Match
Rates to Risk", Casualty Actuarial Society Forum, Winter 2002, pp. 31 - 56.
Hosmer, David W. and Lemshow, Stanley, Applied Logistic Regression, John Wiley and
Sons, 1989
Little, Roderick and Rubin, Donald, Sta~stical Analysis with Missing Data, John Wiley
and Sons, 1987
Marsh, Lawrence and Comfier, David, Spline Regression Models, Sage Publications,
20O2
Martin, E. B. and Morris A. J., "Artificial Neural Networks and Multivariate Statistics",
in Statistics and Neural Networks: Advances at the Interface, Oxford University Press,
1999, pp. 195 - 292
Miller, Robert and Wichern, Dean, Intermediate Business Statistics, Holt, Reinhart and
Winston, 1977
Plate, Tony A., Bert, Joel, and Band, Pierre, "Visualizing the Function Computed by a
Feedforward Neural Network", Neural Computation, June 2000, pp. 1337-1353.
Potts, William J.E., Neural Network Modeling: Course Notes, SAS Institute, 2000
Salford Systems, "Data mining with Decision Trees: Advanced CART Techniques",
Notes from Course, 1999
Speights, David B, Brodsky, Joel B., Chudova, Durya L., "Using Neural Networks to
Predict Claim Duration in the Presence of Right Censoring and Covariates", Casualty
Actuarial Society Forum, Winter 1999, pp. 255-278.
53
Steinberg, Dan, "An Introduction to MARS", Salford Systems, 1999
Venebles, W.N. and Ripley, BD., Modem Applied Statistics with S-PLUS, third edition,
Springer, 1999
Viaene, Stijn, Derrig, Richard, Baesens, Bart and Dedene Guido, "A Comparison of
State-of-the-Art Classification Techniques for Expert Automobile Insurance Fraud
Detection", Journal of Risk andlnsurance, September, 2002, pp. 373 - 421
Weisberg, Herbert I., and Richard A. Derrig, "Fraud and Automobile Insurance: A
Report on the Baseline Study of Bodily Injury Claims in Massachusetts", Journal of
Insurance Regulation, June, 1991, 9:4: 497-541.
Weisberg, Herbert I., and Richard A. Derrig, "Quantitative Methods for Detecting
Fraudulent Automobile Injury Claims", AIB Cost Containment/FraudFiling (DOI
Docket R95-12), Automobile Insurers Bureau of Massachusetts, July,1993, 49-82.
Weisberg, Herbert I., and Richard A. Derrig, "Massachusetts Automobile Bodily Injury
Tort Reform", Journal oflnsurance Regulation, Spring, 1992, 10:3:384-440.
54
Rainy Day."
Actuarial Software and Disaster Recovery
55
Rainy Day:
Actuarial Software and Disaster Recovery
Aleksey S. Popelyukhin, Ph.D.
Abstract
Tragic events with disastrous consequences that are happening all around the Globe made
Disaster Recovery and Continuity Planning a much higher priority for every company.
Scenarios, in which data centers, paper documents and even recovery specialists themselves may
perish, became more probable.
Both, actuarial workflow and actuarial software design should be affected by disaster recovery
strategy. Actuaries may simplify recovery task and insure higher rate of success if they properly
modify their applications' architecture and their approaches to documenting algorithms and
storing structured data.
The article attempts to direct actuaries to strategies that may increase chances of complete
recovery: from separation of data and algorithms to effective storage of actuarial objects to
automated version management and self-documenting techniques.
56
Rainy Day:
Actuarial Software and Disaster Recovery
Aleksey S. Popelyukhin, Ph.D.
FaUed Assumptions
Presumably, every insurance company has a backup system. Files, databases and documants are
copied to tapes or CDs and stored offsite. It gives protection against hard disk failure, rogue
viruses and provides an audit trail.
Many of the existing backup solutions, however, are built on the assumpt/ons that after disaster
strikes restoration will be performed by the same personnel to the same (compatible)
hardware/sol,ware system. As the events of September 11s painfully demonstrated, these
assumptions may not exactly hold true.
Consequently, disaster recovery and business continuity plans have to address them.
A company's tapes stored offsite may survive a disastrous event, but it does not mean they can be
used effectively for the restoration. It may not be immediately clear how to perform a restoration:
on what hardware with what backup software and in which order. It may also happen that the
backup/restore software is so old it requires an older Operating System (OS) not available
anymore. It may not be evident how to reinstall software without the manual and a license key.
Moreover, there might not be anyone who remembers where to restore, what to restore and in
which o~er.
57
Sure, it is not up to actuaries to perform the restoration tasks, but it is in their best interests to
make sure their software is part of the restoration effort (including installation disks, manuals and
licenses) and that they do everything possible to simplify that effort.
Restoration Priorities
Any BIA (Business impact Analysis) study will assign very low priority to the restoration of an
Actuarial subsystem. Indeed, experience shows (see [1 ]) that the most important service for the
business continuity is communications.
j.,
F i g l (see[2])
Experience shows that restoration priorities start with e-mail and end with the actuarial
subsystem:
9 e-mail~communications
9 accounting
9 payroll
9 lrade/marketing
9 underwriting
9 claims
9 actuarialapplications
There is nothing wrong with that picture: it just means that actuaries have to be ready to perform
some or all restoration tasks by themselves and not wait for IT department help. It also implies
that actuaries would be much better off if their applications were easy to restore or reproduce
even if some knowledgeable personnel were not available.
58
Forced upgrading
The implicit assumption behind the majority of existing recovery plans is that restoration would
be performed on the same version ofhardware/OS combo that was used for backup. Or (given the
long upgrade cycles of the recent past) quite similar and compatible versions. Not anymore.
Every major OS upgrade may render backup/restore software useless; every advance in drive
technology may make backup tapes unreadable. Skip a couple upgrade cycles and you may fall to
find the appropriate drive to read your tapes and have no software to recognize recording format.
And if the company's computers are destroyed, the company may be forced to upgrade.
Thus, downloading patches for backup software should be done as vigorously as downloading for
anti-virus and security purposes. It is also crucial to monitor availability of the tape drives
backward-compatible with existing tapes.
The same forced upgrading trap may occur with actuarial software. During restoration, one may
discover that new computers come only with the newer OS, Utilities and Spreadsheet versions,
which are not necessarily compatible with existing files. Imagine if one had to read VisiCale or
WordStar files today. Nobody guarantees that oracle 7 will run on Windows XP or that Excel
will properly interpret that old trustworthy *.wk3 file. It is even more of a problem for third-party
proprietary software. It has to be maintained compatible with the latest OS, compiler, hardware
key protection software and, possibly, a spreadsheet or a database: quite a formidable task.
Sales data from the suppliers of actuarial software imply that actuaries heavily rely on "shrink-
wrapped" applications from the third parties. Development, distribution and compatibility of
theses applications are eontroUed by their vendors. Yet, disaster recovery cannot be completed
without restoring full functionality of these programs. Actuaries cannot do much about these
applications except to make sure that they can be restored.
Actuaries may require that their license agreement include a contractual obligations from the
supplier for:
9 Adequate code baseprotection and
9 Technology Assurance
Adequate code base protection should include measures taken by the vendor to protect the
application coda with backups and offsite escrow. In addition, the vendor has to guarantee access
to the code in case it goes out business or cannot longer support an application.
Technology Assurance is a fancy name for the continuous compatibility upgrades and patches
that would guarantee application compatibility with the ever-changing software environment.
Vendors should make sufficient effort to maintain their applications capable of running under the
latest OS and interoperating with the latest spreadsheet or the underlying database.
59
Hardwara keys
Actuaries also have to clarify a procedure for restoring third-party applications that utilize
hardware-key protection schemes. In a plausible disaster scenario, hardware keys may cease to
exist rendering an application useless. In that case
* Does the License Agreementprovidefor replacement keys?
9 Can the vendor deliver replacement keys from Australia (England, Connecticu 0 fast?
* Does the vendorprovide a downloadable temporarily unprotected version?
All these questions have to be answered before the disaster strikes: this way, actuaries can avoid a
few unpleasant surprises during restoration.
In.house development
Programming cycle
Aside from analysis, actuaries perform some activities that closely resemble software
development. Indeed, no matter what computer language they are using (Lotus, PL/1, APL,
Mathematica, VBA), they are programming. Thus, as programmers, they have to conform to
development cycle routines established in a programming world. Documentation, versioning,
testing, debugging - these activities are well studied, and even automated.
Both actuarial workflow and software design should be affected by disaster the recovery strategy.
Actuaries have to design their applications in such a way that somebody else other than the
designers can understand the spreadsheet, the code and the logic.
A spreadsheet is a very popular actuarial tool. It is so versatile: it can be used as a datahase and as
a calculation engine, as an exchange format and as report generator, as a programming
environment and as a rich front-end to Internet. Actuaries use spreadsheets in all these aspects;
the problem, though, arises when they use multiple features in one file. More precisely, when
they use single spreadsheet file as the engine for calculations andas the storage for results of
these calculations, creating multiple copies of the engine.
Actuaries do realize that input data like loss triangles, premiums vectors and induatxy factors
come from outside and do not belong to their calculation template. What they rarely realize (or
don't realize at all) is that output results such as predicted ultimates or fitted distribution
parameters do not belongto the template either, and that they (results) have to be stored outside
just like input data.
60
Fig 2
Usually, it is not the case: actuaries routinely create 72 files with the same algorithm for 72 lines
of business rather then keep one file and storing 72 answers separately. This strategy creates an
obstacle for the effective:
9 debugging/versioning,
9 moa~fications/improvements,
9 integrity/securify,
9 reporting and
9 multi-user access.
Indeed, conecting an error in one file takes 72 times less time than correcting the same error in 72
copies of that file. Extracting answers for reports from 72 files requires much more effort then
summ~izing 72 records in the database. And it is much harder to guarantee that nobody modified
the 57~ file incorrectly.
From a recovery standpoint, restoring a single file with formulas and separate data records is
definitely easier then restoring 72 files with commlnsled data and formulae, especially given that
the probability of a corrupted file is 72 times greater for 72 files.
It would be wise for actuaries to modify their workflow and spreadsheet design in order to
separate data and algorithms. Rethinking their methodology in this light, actuaries inevitably will
arrive at the idea to store data along with some kind o f description, that is, to treat data as
Actuarial Objects" (see [3]).
61
$8~?0a $~,5 711
1~'11,021
The logical extension of this idea would be to modify a calculation template in such a way that it
understands these descriptions and acts upon them, serving like a traffic cop for the data objects.
In other words, build an engine for objects processing (see [4]).
This architecture would streamline actuarial workflow, encourage debugging and modifications,
simplify reporting and improve enormously recovery success chances.
If(in addition to all its functionality) Microsoft Excel incorporated simulations, there would be
only one actuarial software program: Excel. And, thus, no worries about file formats, conversions
and availability of the file reader programs. Fortunately or unfortunately, this is not the case, and
actuaries have to rely on applications with different file formats. The problem is that, as time
passes, it will be harder and harder to find a reader program for some obscure and proprietary
files. So, for the sake of disaster recovery actuaries should rely on the most ubiquitous file
formats.
In the foreseeable future one can count on the ability to read ASCII (including XML and HTML),
*.xls and *.doe files. Perhaps, *.dbf and *.pdf readers will be easily available too. Consequently,
it is always a good idea to store an extra copy of the most crucial data in one of the
aforementioned formats.
For example, SQL Server and Oracle tables can be dumped into ASCII*. Microsoft Access can
import/export tables in *.xls or *.dbf formats. Excel files can be seamlessly converted to XML
(structured ASCII). And, unless there are trade-secret considerations, it is always a good idea to
export VBA modules to *.bas text files.
From a disaster recovery standpoint, it is important that th/rd-party sofhvare has the capability to
read and write to one oftbe ubiquitous formats (a side benefit of that capability would be
possibility of data exchange and integration with other software programs).
* In fact, the whole database can be restored from ASCII files by running SQL scripts (ASCII) that recreate
database structure and loading tables' content from the dump file (ASCII).
62
Version management
If a calculation algorithm is used (or is going to be used) more than once, it needs versioning.
Indeed, if a "separation of data and algorithms" paradigm is embraced and implemented, it
becomes quite practical and useful to maintain a version of the algorithm (in case of Excel: a
version of template used for calculations).
The usefulness becomes obvious once one considers saving the version of the calculation engine
along with results of calculation. Doing so helps immensely in audit Wailing, debugging and, of
course, r e c o v e r y .
The practicalityderives from the fact that (presumably) the number of calculation
engines/templates is limited (usually the same algorithm can be reused for analysis of multiple
contracts,LOBs and products). So maintaining version information for a few filesis not an
overburdening chore.
Microsoft Office applications provide adequate facilities for versioning: Word automatically
updates "Revision number" (File/Properties/Statistics) and Excel allows custom properties to be
linked to cells inside the spreadsheet.
~,-Jl h . - ~ i ~ i = . l c - - ~ o~'~ I
Voamn v ~ 4.
~,: IT~ -I
i~rm: ~rs~ ~1 ~ I~mmMmt
~ : L~_~__ J~ ...........J.~_ . . . . . .
Chakl Ladder Text
4,1,3 Tea(t
a ~ o d i f ~ J by A M P. TSXt
~ C h a d ~ d by D~maP. Tlxt
Fig 4
If the user would dedicate a cell in a spreadsheet to store version info and add one line of VBA
code to the Workbook_BeforeSave event, he would get a "poor man" versioning mechanism for
free.
63
Private Sub Workbook_BeforeSave(ByVal SaveAsUI As Boolean, Cancel As Boolean)
Range("Version").Value = InputBox("Version number: ", "Propcr~es", Range("Versiou"))
End Sub
Fig 5
If there is a necessity to synchronize the version number through several files, the cell linked to
the custom property can contain a formula referring to the information in the other (main)
template.
Using files properties for versioning (and, possibly, other information*) has some nice side
benefits: one can use them for targeted file searches (File/Open/Tools/Search/Advanced).
[] ~, tlxl
91~'1 b: . . . . . . . . . . . . . . . .
Templ,~te ~l
IText or Or0~cty
" ]
I
r -I
,i
Fig 6
* It is always a good idea to dedicate an area (with named cells) in a spreadsheet to meta-information about
it and link these cells to the Custom Properties like "Created by", "Last Modified by", "Verified by", etc...
64
More important, however, is the fact that Version info can be stored alongside with Results
generated by the template, providing perfect means for the audit tracking.
Documentation
Knowledge about available documentation features and familiarity with restoration techniques
may help actuaries to design their software in order to greatly simplify potential recovery efforts.
Usually, big nice printed manuals and an interactive online help system* are reserved for very
large projects only. It is unreasonable to expect an actuary to write a manuscript for every Excel
spreadsheet he creates in a hurry. Nevertheless, several simple approaches can be employed to
greatly simplify restoration tasks:
9 Self-documenting,
9 Excel Comments,
9 Code Remarks.
Serf-documenting features
IBb ~ ~ ~m m~m, um
.e,
I " I
Fig 7
* Several packages on the market, most notably RoboHelp fi*omeHelp, can convert Microsoft Word file(s)
into a full-featured interactive help system, either Windows or HTML based.
65
Documenter generates Notepad or Excel files with the information about any object in a database
w i t h as m a n y details as necessary.
Microsoft Access also provides facility for fields' descriptions (along with tables, queries, forms
and reports descriptions): it would be unwise not to use it.
ml
Fig 8
Excel is, by its very nature, self-documenting: by clicking on a cell the user can see the value in a
cell and a formula behind it in a Formula Bar. To view formulae in multiple cells one can open a
second window of the same spreadsheet (Window/new Window) and switch its mode to Formula
View (Tools/Options/View/Formulas or just press CTRL - ~).
Fig9
In recognition that building models in Excel is, essentially, some kind of programming, Microsoft
added a quintessential debugging tool to Excel: Watch Window (Tools/Formula Auditing/Show
Watch Window or right click on cell~Add Watch). Watch Window allows the user to track values
66
and simultaneously see formulas of multiple cells located anywhere in a spreadsheet. The tool's
value is not only in debugging, but also in ad-hoc goal seeking and audit trailing. Accompanying
it is the step-by-step Formula Evaluation tool (Tools/Formula Auditing/Evaluate Formula), which
used to be Excel4 Macro debugging insma-nent.
Sure, cells arc not the only place for formulae and settings: PivotTables, Solver add-in, Links,
External Data ranges, Web Queries and even Conditional Fonnatling contain important
information which could be crucial for understanding the functionality of an algorithm.
To view the query behind the External Data range, right-click on it and select Edit Query item
from the menu (or choose Data/Get External Dats/Edit Query). The same procedure works for
Web Queries. What's important is that in both cases Excel provides an option to save a query as a
text file (*.dqy in case of Data Queries and *.iqy for lnternet Queries). It is highly recommended
to do so. The benefit is threefold: a) queries get documented, b) it is easier to modify them in this
text form and c) it is so easy to execute them - just double-clicking on a *.dqy or *.iqy file.
;~.] iEk list ~ ]mart Fe,amt Zman Dam ~ IJM f'- . . . . . . . . . . =Jnl x_l
C16 ~ ie Yesteedoy
0 1
_J . . . . . A ......... l ~ _ ~__c_ ___s p _ . _ L ___e__. 1
http:/foond#.yahoo.con~rates.htrrd
~'rmam~fm I
', La~ Last [ Selection,,5,7
MatvrlW Yield Yesterday Week Month i
Folntening=NI
l,m +.e4 I~ 1.e~ PreFormattodTextToColumn$ffiTme
11816Monlh 1 57 1.518 1 55 1 811
=vw 2m4 2m ~,.'m ~,.2zl ConsecutlveDellmRerlkM OnefTrue
~+5 Year 318 32 3.3 344: SlngleBIockT extlmpod:False
tOYmr 4~12 4.15 43~1 4AISJ Dlsab~DMeRecognlUon=Felse
~130Year 4.92 495 501 5291 DisableRed#ectlons •False ,,J
Fig 10
Solver* is a very popular goal-seeking tool and, thankfully, Excel preserves Solver settings, but
only one per worksheet. This means that if Solver is used multiple times on the same sheet, it is a
good practice to save its settings (Tools/Solver/Options/Save Model) in a descriptively labeled
al~a.
Important settings are stored in Conditionally Formatted as well as Data Validated cells. To see
validationsettingsand format conditions,navigate to these cellsusing Edit/Go To/Special dialog
box.
* To access Solver select Tools/Solver from the Excel menu. If Solver is not listed in the menu, check
whether it's installed (run Office install) end/or enabled (checlonark in Tools/Add-ins).
67
A ~ e I c l o I r ~ r c ~
' 12-24 24-311 ~ POll~
rB~
r ~
I ~ Te~t
l~S
r mkr4p
r Curr~egan
r ~e~nt~w
r~
Fig 11
Most users use Names (Insert/Name/Define/Add) only for naming ranges. However, Excel allows
giving a Name to any formula, even a User-Defined one. To display the list of Names with their
definitions, press F3"/Paste List. Names in Excel are too important -- and too convenient tool for
documenting a spreadsheet -- to be ignored. In the ideal world, there should be no unnamed
references in Excel formulas: every variable, input region and output location has to be named.
Good naming conventions along with the habit of naming ranges and cells may prove invaluable
not only for disaster recovery, but for debugging, modifications and education of the new
employees.
Excel creators believe in named references so much they actually supply Names even if the user
himself didn't define any. Since version 97, users can use column and row labels as if you created
range names for rows and columns (since Office XP, the same syntax works for PivotTables'+).
To enable this functionality, check the Tools/Options/Calculation/Accept Labels in Formulas
option.
....... P.~ .~
I ~~,,. ~ 1 ~ 1
[- I w ~
68
2001 Is 2n0.~l 1.zs p,~i,t~a',m,t~A.v]
Fig 12
Structured Comments
Excel commants, if used creatively, rcpresant an amazingly powerful tool. Available through the
Reviewing toolhar, comments can be toggled (by moving mouse over commanted call) or
displayed pcrmanantly (Show Comment). They can be printed "in place" or as footnotes
(File/Page Setup/Sheet/Comments dropdown). Moreuaver, comments (as in any programming
anvironment) are invaluable for documenting designer's intentions and understanding algorithra's
logic.
comments can be used for storing structured attributes of an object a la XML (see [4]):
69
Auto backup copy
To increase the chances of recovery of the most important Excel files, it is wise to enable a built-
in facility for the automatic creation of backup copies. By launching "Save Options" dialog
(File/Save As/Tools/General Options) and choosing "Always create backup" option, user can be
assured that every time he saves the file an extra copy with the extension XLK is generated.
Still, for occurrences when files are corrupted or incompletely restored fi'om the tapes, Excel
2002 has beefed up its file repair utility. Available through the (File/Open/Open dropdowrdOpen
and Repair) menu item, the utility does a formidable job in recovering corrupted files.
w ~
Excel Fez ~ ~le followir~ fllm~ Save U~eorvm
you wW:, ~ k ~ p .
^v__a~_ m,s.........................
Fig 14
In a rare case, when an attempt to repair fails, as a last resort one can ~ to paste the content of
the corrupted file into a new spreadsheet (see [6]). To do that, open two new files, select cell A1
and copy it to clipboard, switch to the second file and after pasting the link (Edit/Paste
Special/Paste Link), change the link to the corrupted file. In most cases, Excel allows the user to
access this way as much content as it could recover. The rest of the file (VBA modules, External
Data queries and Pivot Table cubes) can be imported from the ASCII files.
Documenting Workflow
There arc many ways to document workflow. The most natural and powerful, though, is to use
"smart diagramming" software likeMicrosoft Visio or Micrografx iGrafx by Corcl. In addition to
theirabilityto document, analyze and simulate workflow, these packages (empowered by V B A )
may execute some actions automatically.
70
Workllow
Diagram , 0..~,
,
i '
Fig 15
Workflow diagrams - as important as they are for disaster recovery - provide additional benefit as
a way to look at the actuarial process as a whole, and possibly to stzeamline and simplify it.
Telecommute
Telecommuters present yet another challenge for fawless disaster recovery. A whole additional
layer of network subsystems (terminal services, VPN access, flrewall) has to be restored in order
to enable their access to the company's applications. Home and mobile computers and devices
represent an additional hazard for security and maintenance.
However, provided that security, connectivity, maintenance and support issues are solved, home
computers will become a decenu~lized independent distributed file storage system: an additional
chance to restore a copy of this most important lost file. Also, if configured accordingly, remote
computers may serve as a temporary replacement system until restoration of the main system is
complete. Indeed, many applications can be scaled to work on a standalnne machine: major
databases have compatible "personal" versions, while Office and many third-party actuarial
applications are "personal" by nature. The synchronization with the "main" system can be
possibly achieved via import/export to/from ubiquitous file formats.
71
Paperless Office
If the backup is set up and working smoothly, flies are copied to tapes and stored offsite, then
there is no excuse not to scan every paper document greatly reducing risk of losing it. IndeX,
with advances in scanning quality and OCR (optical character recognition) accuracy, it makes
perfect sense to convert allpaper documents into computer readable files.The ubiquitous PDF
(portable document format) filepreserves the look of the original,while at the same time enabling
index, catalog and search services to scan through itscontent as ifitwere simple text file.Even
Intemet Search Engines are now PDF-enabled, so Internetsearch queries are capable of looking
for information inside PDF files.Thus, scanned paper documents can be organized into a useful
searchable hierarchical "knowledge base" instead of lying in some storage boxes, being hard to
fred and, probably, unused.
Once again, an action geared toward bettor disaster protection may turn out to have a great side
benefit, perhaps even greater than the initial purpose of the action.
Conclusion
Any type of action - from a big radical change of architecture in order to "separate data from
algorithms" to a small conventional "enabling of auto-backups" in Excel - is better than no action.
Besides, all aforementioned recommendations help not only in the case of devastating disaster,
but also in the event of a virus attack, malicious user actions, and staff rotation. In fact, benefits
from such preparation measures as
9 clear documentation ofactuarialprocedures,
9 streamlinedalgorithms and
may far outweigh the potential payback ~ m the original objective of disaster preparedness.
These measures are more than worthy by themselves. Surely, the cost o f precautions should not
exceed estimated damages. However, side benefits such as audit U~il capabilities,design
discipline and improved understanding of calculationscan easilyjustifydisasterpreparedness
efforts.
Dedication
To Giya Aivazov and all friends and colleagues affected by The September 1l.
Stamford, 2001
72
Bibliography
[1] Annlee Hines. Planning for Survivable Networks: Ensuing Business Continuity. 2002, Wiley
Publishing
[2] http://csrc.nist.gov/publications/nistpubs/800-34/soS00-34.odf
[3] Aleksey S. Pope|yukhin. The Big Picture: Actuarial Processfrom the Data Processing Point
of View. 1996, Library of Congress
[4] Aleksey S. Popelyukhin. On Hierarchy of Actuarial Objects: Data Processing from the
Actuarial Point of View. Spring 1999, CAS Forum
[6] Mark Dodge. Microsofi Excel Version 2002 Inside Out. 2001, Penguin Books
Disaster Recoverywebsites:
[7] http://www.fema.gov
[8] http://www.disasterplan.com
[9] http://www.disasterrecoveryworld.com
73
74
Modeling Hidden Exposures in Claim Severity
via the EMAlgorithm
75
Modeling Hidden Exposures in Claim Severity via the EM
Algorithm
Grzegorz A. Rempala
Department of Mathematics, University of Louisville,
Richard A. Derrig
Automobile Insurers Bureau of Massachusetts
Abstract
We consider the issueof modeling the so~calledhidden severity exposureoccurring through
either incomplete data or an unobserved underlying risk factor, We use the celebrated EM
algorithm as a convenient tool in detecting latent (unobserved) risks in finite mixture models
of claim s'everity and in problems where data imputation is needed, We provide examples of
applicability of the methodology based on real-life auto injury claim data and compare, when
possible, the accuracy of our methods with that of standard techniques,
1 Introduction
Actuarial analysis can be viewed as the process of studying profitability and solvency of an insurance
firm under a realistic and integrated model of key input random variables such as loss frequency
and severity, expenses, reinsurance, interest and inflation rates, and asset defaults. In a modern
analysis of financial models of property-casualty companies, these input variables typically can
be classified into financial market variables and underwriting variables (cf. e.g., D'Arcy et al.
1{)97). The financial variables generally refer to asset-side generated cash flows of the business,
and the underwriting variables relate to the cash flows of the liabilities side. The process of
developing any actuarial model begins with the creation of probability distributions of these input
variables, including the establishment of the proper range of values of input parameters. The use of
parameters is generally determined by the use of the parametric families of distributions, although
the non-parametric techniques have a role to play as well (see, e.g., Derrig, et al, 2001). In this
article we consider an issue of hidden or "lurking" risk factors or parameters and point out the
possible use of the celebrated EM algorithm to uncover those factors. We begin by addressing
the most basic questions concerning hidden loss distributions. To keep things in focus we will
be concerned here only with two applications to modeling the severity of loss, but the methods
discussed may be easily applied to other problems like loss frequencies, asset returns, asset defaults,
and combining those into models of Risk Based Capital, Value at Risk, and general Dynamic
?6
Financial Analysis, including Cash Flow Testing and Asset Adequacy Analysis. Our applications
will illustrate the use of the EM algorithm (i) to impute missing values in an asset portfolio and
(ii) to screen medical bills for possible fraud or abusive practices.
??
consider a complete data set Z = ( X , y ) and specify the joint density
which is often called the complete likelihood. ]:orthe sake of computational simplicityit is often
more convenient to consider the logarithm of the complete likelihood
n
Note that the function above may be thought of as a random variable since it depends on the
unknown or missing information 3) which by assumption is governed by an underlying probability
distribution. Note also that in accordance with the likelihood principle,we now regard X as
constant.
The EM algorithm as described in Dempster, Laird and Rubin (lg77) consists of two steps
repeated iteratively. In its expectation step or the E-step, the algorithm first finds the expected
value of the complete log-likelihood function ]ogp(X, ~10) with respect to the unknown data
given the observed data ~" and the current parameter estimates. That is, instead of the complete
log-likelihood (2) we consider the following
Note the presence of the second argument in the function Q(O, 0(i-1)). Here e (~-1) stands for
the current value of the parameter e at the iteration (i - 1), that is, the value which is used to
evaluate the conditional expectation.
After the completion of the E-step, the second step of the algorithm is to maximize the
expectation computed in the first step. This is called the maximization or the M-step, at which
time the value of e is updated by taking
78
inequality) that if (~* maximizes Q(E), O (i-1)) with respect to E) for fixed O (i-1) then
and each iteration of the procedure indeed increases the value of complete log-likelihood (2). Let
us note that from the above argument it follows that a full maximization in the M-step is not
necessary: it suffices to find any value of O (~) such that Q(E)(0,E)(i-1)) > Q(E)(i-]),E)(i-1)).
Such procedures are called GEM (generalizedEIV0 algorithms. For a complete set of references
see, for instance, the monograph by McLachlan and Krishnan (1907) where also the issues of
convergence rates for the EM and GEM algorithms are thoroughly discussed. For some additional
references and examples see also Wu (1983) or the monographs by Little and Rubin (1987) and
Hastie,Tibshirani, and Friedman (2001).
79
Table 1:10 fictitious observed gains and losses from two risk portfolios in thousands.
essential to the analysis most often results in throwing the record out, thereby creating unknown
'hidden' biases. Likewise, financial time series data may be interrupted, unavailable, or simply lost
for securities or portfolios that are not widely tracked.
As an illustration of an application of the EM algorithm in this setting let us consider a
hypothetical example of 10 losses/gains from a two-dimensional vector of risk portfolios, which
we have generated using a bivariate normal distribution. The data is presented in Table 1 (in
thousands of dollars). As we can see parts of the last four observations are missing from the table.
In fact, for the purpose of our example, they have been removed from the generated data. We
shall illustrate the usefulness of the EM algorithm in estimating these missing values.
If we denote by Pg the observed (incomplete) data listed in Table 1 then following our notation
from previous section we have the complete data vector Z given by
z = (zl...-.,) = (xl..., ~ , (~1,7, y2,7) r, (~,8, y2,~) ~, (~1,~, =~,9) r , (~1,~0, =2,~o) T)
where x j = (Xld, x2d) T for j = 1 ... , 6 is the set of pairwise complete observations. The missing
data (corresponding to ? marks in Table 1) is, therefore,
n
1 I
t(OlZ) = -n l o g ( 2 ~ ) - ~ log I~1 -
5=1
~ = (o'11 o'12"~
\ff12 •22}
80
In order to describe the EM algorithm in thir setting we need to find the particular form of
Q(O, O (i-1)) defined by (3). Due to the independence of the zi's this is equivalent, in effect, to
evaluating
Eo(,-~) (YIX) and EO(,-1 ) (Y21X)
where Y is the underlying random variable for )), assumed to be normal. From the general formulae
for conditional moments of a bivariate normal variable X = (X1, X2) with the set of parameters
O as above, we have that
(~)
y(~) - (~) -- a12 [ and Y~k(i) = ~U2k]
(~ (i)~ 2 t-- 0"22.1(i) for k = 7, 8
2k -- /'t2 -i- " ~ ~Xlk -- ~tl) )
fill
(0
y~-- (O--ai2 [ x #~0) and Yl2k(i) [" (d)~2 j-
~tJlk] i ~ l~(i)
12 fork=9,10
- - t"l "I- a(i)22 ~ 2k - -
3. The M-step: given the current value of the imputed complete data vector Z(i) = (X, y(i))
set Mk = )"~=1 z(i~/n and Mkt = ) ' ~nj = l Z(i)
k j Z(i)
, j / n fork, l _- 1,2, andcalculate e(i+l) as
4. Repeat steps 2 and 3 until the relative difference of the subsequent values ofl(e(i+l)l:z(i) )
is sufficiently small.
81
The above algorithm in its non-iterative version was first introduced by Buck (1960) who used
the method of imputation via linear regression with subsequent covariance correction to estimate
means and covariance matrices of p dimensional random vectors in case when some parts of the
vector components were missing. For more details about Buck's imputation procedure, we refer
to his original paper (Buck 1960) or to Chapter 3 of Little and Rubin (lg87) or Chapter 2 of
McLachlan and Krishnan (lgg7).
The numerical illustration of the algorithm is presented in Table 2. As we can see from the
Iteration #1 /~2 o11 0"12 0"22 Y2,7 Y2,8 Yl,9 Y1,10 -20
1 0.6764 3.5068 1.8170 0.3868 2.0671 3.4399 3.6069 1.0443 1.1867 65.7704
5 0.8770 3.6433 1.8618 0.8671 2.2030 3.4030 3.7685 1.5982 1.8978 64.7568
10 0.9279 3.6327 1.9463 0.9837 2.1724 3.3466 3.7433 1.7614 2.1061 64.5587
20 0.9426 3.6293 1.9757 1.0181 2.1639 3.3301 3.7345 1.8102 2.1683 64.5079
30 0.9435 3.6291 1.9775 1.0202 2.1634 3.3291 3.7339 1.8132 2.1722 64.5048
35 0.9436 3.6291 1.9776 1.0203 2.1633 3.3290 3.7339 1.8134 2.1724 64.5047
40 0.9436 3.6291 1.9777 1.0204 2.1633 3.3290 3.7339 1.8134 2.1724 64.5046
45 0.CJ436 3.6291 1.9777 1.0204 2.1633 3.3290 3.7339 1.8134 2.1725 64.5046
table with the accuracy of up to three significant digits, the algorithm seems to converge after
about 30 steps or so and the estimated or imputed values of (5) are given by
Let us note, for the sake of comparison, that if we were to employ the standard, "naive" linear or
polynomial regression model based on 6 complete observations in order to fit the missing values in
Table 1 we would have obtained in this case
Both y(em) and yffeg) can be now compared with the actual values removed from Table 1 which
were
y = (3.362, 3.657,1.484, 3.410).
As we can see, in our example the EM method did reasonably well in recovering the missing values.
82
2.2 Massachusetts Auto Bodily Injury Liability Data. Fraud and Build-up Screen-
ing via Mixture Models
By now it is fairly well known that fraud and build-up, exaggerated injuries and/or excessive
treatment, are key components of the auto injury loss distributions (Derrig et al. 1994, Cummins
and Tennyson 1996, Abrahamse and Carroll 199(3). Indeed, inJury loss distributions are prime
candidates for mixture modeling, for at least the differing of payment patterns by injury type. Even
within an injury type as predominant as.strain and sprain, 2 there can be substantial differences in
subpopulations arising from fraud and build-up. One common method of identifying these claims
has been to gather additional features of the claim, the so-called fraud indicators, and to build
models to identify those bogus claims (Brockett, et al. 1998). The acquisition of reliable indicators
some of which may be highly subjective, is costly, and may not be efficient in uncovering abusive
patterns in injury claims (Crocker and Tennyson 1999). The use of more flexible methods such as
the fuzzy logic (see more below) may overcome the lack of this precision in subjective features in
an economically efficient manner by running a background algorithm on adJusters' electronic files
(see, for example, Derrig and Ostaszewski 1995, lggg).
Another approach to uncovering fraud and build up, perhaps grounded more in practical con-
siderations, is to construct a filter, or screening algorithm, for medical provider bills (Derrig 2002).
Routinely, excessive medical bills can be reduced to "reasonable and customary" levels by com-
puter algorithms that compare incom!ng bills to right censored billing distributions with "excessive"
being operationally defined to be above the censoring point. Less routine is the implementation
of systematic analysis of the patterns of a provider's billing practices (Major and Riedinger 1992).
Our second application of the EM algorithm is to build a first level screening device to uncover
potential abusive billing practices and the appropriate set of claims to review. We perform the
pattern analysis by uncovering abusive-like distributions within mixture models parametrized by
the estimates obtained via the EM algorithm. An illustration of the method follows.
In the table provided in Appendix B we present a set of outpatient medical provider's total
billings on the set of 348 auto bodily injury liability claims closed in' Massachusetts during 2001.
For illustration purposes, 76 claims with one "outlier" provider ( " A ' ) were chosen based on a
pattern typical of abusive practice; namely, an empirical kurtosis more than five times the overall
average. The "outlier" was then combined with medical bills in claims from a random sample
of providers. The losses are recorded in thousands and are presented in column two. Column 4
identifies each medical billing amount as provider "A" or "other". We will use the EM algorithm
applied to a normal (log) mixture model attempting to uncover provider A.
The relatively large volume of provider A's claims is clearly visible in the left panel of Figure 1,
where it is presented as a portion of the overall claims
Whereas the volume of claims by itself never constitutes a basis for the suspicion of fraud or
build-up, it certainly might warrant a closer look at the data at hand, especially via some type of
2Currently, Massachusettsinsured bodily injury claims are upwardsof 80 percentstrain and sprain claims as the
most costly part of the medicaltreatment. Of course, that may have a dependencyon the $2,000 dollar threshold
to file a tort claim.
83
Figure 1: Overall distribution of the 348 BI medical bill amounts from Appendix B compared with
that submitted by provider A. Left panel: frequency histograms (provider A's histogram in filled
bars). Right panel:density estimators (provider A's density in dashed line).
.'~ '0 ~ ;
L~(oml)
homogeneity analysis, since the second panel in Figure 1 clearly indicates the difference between
the overall claims distribution and that of the provider A. Hence in this problem we shall be looking
for a hidden exposure which could manifest itself as a non-homogenous component of the data,
albeit we shall not be assuming that this component is necessarily due to provider A. In fact, as the
initial inspection of the overall data distribution does not immediately indicate non-homogeneity we
shall not make any prior assumptions about the nature or source of the possible non-homogeneity.
Since the standard analysis of the data by fitting a kernel density estimator (see the solid curve
in the right panel of Figure 1) appears to give no definite indication of multimodality, it seems, that
some more sophisticated methods are needed in order to identify any foreign components of the
claims. Whereas many different approaches to this difficult problem are possible, we have chosen
one that shall illustrate the applicability of the EM methodology in our setting. Namely, we shall
attempt to fit a log-mixture-normal distribution to the data, that is, we shall model the logarithm
of the claim outpatient medical billing distribution as a mixture of several normal variables. The
use of normal distributions here is mostly due to convenience of the EM implementation and in
more complicated real life problems can be inappropriate. However, the principle that we shall
attempt to describe here is, ih general, applicable to any mixture of distributions, even including
non-parametric ones. 3
3The notion of fitting non-parametricdistributions via likelihood methods, which at first may seemcontradiction
in terms, has becomevery popular in statistics over the last decade. This is due to intensive researchinto the so
called empirical likelihood methods (seefor instance a recent monographby Owen 2001 and referencestherein). In
84
In order to describe our method in the context of the EM algorithm we shall again relate the
problem at hand to our EM methodology introduced in Section 1. In our current setting we shall
consider the set of logarithms of the BI claim medical bills as the incomplete data X. According to
our model assumption we identify the underlying random variable X, of which X is a realization,
as a mixture of several (say, m > 2) normal distributions 4
where Yj e (0,1) with P(Yj = 1) = r j such that ~ ~rj = 1 and the joint distribution of the
vector (Y1,--. ,Ym) is multinomial with one trial, (i.e.,)~Yj = 1). The right hand side of (8) is
sometimes known as generative representation of a mixture. Indeed, if we generate a multinomial
variable (Y1,... ,Ym) with probabilities of Yj = 1 equal to ~rj, and depending on the index j for
which outcome is a unity, deliver X j , then it Can be shown that the density of X is
~ p(xlej) (91
j=l
where p(.IO~) is a normal density with the parameter
Hence X is indeed a mixture of the Xj's. The density given by (9) is less helpful in our approach
as it doesn't explicitly involve the variables Yj's. Moreover, fitting the set of parameters 5
85
indicating whether or not x j arrives from the distribution of Xk. In this setting the complete
log-likelihood function (2) takes the form
n m
As we can see from the above formulae, in this particular case Q(O, O(i-1)) is obtained from the
complete data likelihood by substituting for the unknown yjk's their conditional expectations 6ja's
calculated under the current value of the estimates of 0. 6 The quantity 6jk is often referred to as
the responsibility of the component Xk for the observation j . This terminology reflects the fact
that we may think about final 6jk as the conditional (posterior) probability of the j-th observation
arriving from the distribution of Xk.
Once we have replaced the yjk's in (11) by the 6jk's, the maximization step of the EM algorithm
is straightforward and applied to (12) gives the usual weighted ML estimates of the normal means,
variances, and the mixing proportions (see below for the formulae). However, in order to proceed
with the EM procedure we still need to construct the initial guesses for the set of parameters (10).
A good way to do so (for a discussion, see, for instance, Chapter 8 of Hastie et al. 2001 or Xu and
Jordan 1996) is to simply choose at random m of the observed claim values as the initial estimates
of the means, and set all the estimates of the variances to the overall sample variance. The
mixing proportion can be set uniformly over all components. This way of initiating the parameters
ensures the relative robustness of the final estimates obtained via EM against any particular initial
conditions. In fact, in our BI data example we have randomly selected several initial sets of values
for the means and in all case have obtained convergence to the same set of estimates. Below we
present the detailed EM algorithm we have used to analyze the Massachusetts auto BI data. In
order to identify the number m of the mixture components in the model we have used the EM
method to obtain the estimates of the complete log-likelihood function (as the final values of (12))
for m = 2, 3, 4 (we have had determined earlier that for m > 4 the BI mixture model becomes too
cumbersome). The results are presented in Table 3. As can be seen from the last row of the table,
m = 3 is the number of components minimizing the negative of the estimated log-likelihood (12).
Henceforth we shall, therefore, take m = 3 for the BI mixture model.
6It may happen that some of the values Y3k are in fact available. In such cases, we would take 51k = yjk.
86
Table 3: Comparison of the mixture fit for the different values of m for the BI data
Parameter m = 2 m = 3 m = 4
#1 0.071 0.107 -0.01
1.110 0.874 0.218
#3 1.248 0.911
#4 1.258
~/2 1.265 1.271 1.201
al/2
0.252 0.178 1.349
G~/2
0.146 0.214
~/2
0.144
7rl 0.470 0.481 0.250
~2 0.530 0.205 0.224
~3 0.314 0.247
~r4 0.279
-2Q 819.909 811.381 811.655
Iteration -2q
1 0.22g 0.785 0.885 1.172 0.89 0.843 0.35 0.320 0.321 g73.115
5 -0.129 0.946 1.054 1.374 0.525 0.356 0.337 0.301 0.361 854.456
6 -0.131 0.953 1.083 1.357 0.49g 0.300 0.349 0.281 0.370 839.384
10 -0.041 0.917 1.137 1.324 0.456 0.223 0.396 0.217 0.387 820.903
20 0.042 0.875 1.166 1.302 0.364 0.207 0.438 0.177 0.385 817.363
30 0.064 0.876 1.184 1,2g 0.301 0.200 0.453 0.176 0.372 816.143
40 0.074 0.871 1.204 1.285 0.259 0.188 0.460 0.186 0.354 814.957
50 0.084 0.868 1.226 1.281 0.222 0.17 0.467 0.197 0.336 813.367
60 0.099 0.871 1.243 1.275 0.1g0 0.153 0.476 0.204 0.320 811.838
64 0.105 0,873 1.247 1.272 0.180 0.147 0.48 0.205 0.315 811.454
65 0.107 0.874 1.248 1.271 0.178 0.146 0.481 0.205 0.314 811.381
1. Define the initial estimate e (~ of the set of parameters (10) (see discussion above).
2. The E-step: given the current value of O (~) compute the responsibilities 6j as
87
~J~ = m (i) (1) j = 1,... ,n and k = 1 , . . . , m .
and
~d=l jk
,~ ~ ( --,uC~+z)~ 2
a('+Z)= ~'J=l ik x=J~ k ] for k = l . . . . . m.
~d=z jk
4. Repeat steps 2 and 3 until the relative difference of the subsequent values of (12) is suffi-
ciently small.
Figure 2: EM Fit. Left panel: mixture of normal distributions fitted via the EM algorithm to BI
data9 Right panel: Three normal components of the mixture. The values of all the parameters are
given in the last row of Table 4.
i =q
-2 0 2 4
Log~,~ Log~
In Figures 2 and 3 we present graphically the results of the analysis of the BI data via the
88
mixture model with nz = 3 using the EM algorithm as described above. Some selected iterations
of the EM algorithm for the three component normal mixture are presented in Table 4. In the
left panel of Figure 2 we show the fit of the normal mixture fitted to the data using Algorithm 2
(with parameters values given by the last row of Table 4) . As we can see the fit looks reasonable
and the fitted mixture distribution looks similar to the standard density estimator (solid curve in
the right panel of Figure 1). The mixture components identified by the EM method are presented
in the right panel of Figure 2 and clearly indicate non-homogeneity of the data which seems to
consist of two (in fact, three) different types of claims. This is, obviously, related to a high volume
of claims in the interval around 1.8-4.5 thousands (corresponding to the values .6-1.5 on the log
scale). This feature of the data is modeled by the two tall and thin (i.e., with small dispersion)
components of the mixture (corresponding in our notation to X2 and X3, marked as solid and
dashed curves, respectively). Let us also note the very pronounced difference (over seven-fold) in
the spread between the first and the two last components.
Figure 3: Latent risk in BI data modeled by the EM algorithm with nz = 3. Left panel: set of the
responsibilities 5j3. Right panel: the third component of the normal mixture compared with the
distribution of provider A's claims ("A" claims density estimator is a solid curve).
.~_ .-C>,...y.j..., i
-~ ~ ~ ,15 2~
~j~m~ Log~n~
In the left panel of Figure 3 we present the set of responsibilities (5j3) of the model (or
component) X3 as calculated by the EM algorithm superimposed on the histogram of the BI data.
The numerical values of the responsibilities for each data point are also listed in the last column of
the table in Appendix B. The relationship between the set of responsibilities obtained via the EM
procedure and the apparent lack of homogeneity of the data, demonstrated by Figure 2, is easy to
see. The high responsibilities are clustered around the claim values within two standard deviations
of the estimated mean (1.25) of the tallest distribution X3. Hence the plot of responsibilities
89
superimposed on the data distribution again uncovers the non-homogeneity or the risk factor
which was initially hidden. As we can see from the right panel in Figure 3 the observed non-
homogeneity may be attributed largely, as initially expected (and as the illustration intended), to
the high kurtosis of "A" claims. Indeed, the superimposing of the distribution of "A" claims (solid
curve) on the component X3 (dashed curve) in the right panel of Figure 3 reveals a reasonably close
match in the interval (.8, 1.7) or so. Outside this interval the normal approximation to the provider
A's claims fails, mostly due to the fact that the normal mixture model employed is not sufficiently
"fine tuned" in its tails to look for this particular type of distribution. The deficiency could be
perhaps rectified in this particular case by incorporating some different (non-normal) components
into the mixture model. However, our main task in this analysis was to merely uncover hidden
factors (if any) and not necessarily to model them precisely, which should be done afterwards using
some different, more sophisticated modeling approach depending on the type of problem at hand.
See, for instance, Bilmes, (1998) who presents the extension of our Algorithm 2 to the so-called
general hidden Markov model (HMM). For a full review of some possible approaches to fitting the
finite mixtures models and the use of the EM methodology in this context, readers are referred
to the recent monograph by McLachlan and Peel (2000) which also contains some descriptions of
the currently available software for fitting a variety of non-normal mixtures via the EM method.
90
mixing distribution (i.e., distribution of X3). Indeed, the a-cut at about 0.5 provides us with a good
indication that "A" arises from the third mixing distribution (corresponding to the value 75% in the
table) but not from the first one (corresponding to 8% value only). These findings are consistent
with those illustrated by Figure 3. In contrast, the second mixing distribution (distribution of
X2) does not allow us to classify correctly "A" and "other" in our three-mixture model. The
low proportion of "A" claims assigned to the model X2 indicates that they are generally unlikely
to arrive from X2 which may be an indication of some further non-homogeneity among claims,
even after adjusting for the type "A". The X2 component could be, therefore, the manifestation
of some additional hidden factors, which again confirms the findings summarized in the previous
section, s
9]
bootstrap method outlined in Algorithm 3 below. The method can be shown to be asymptotically
equivalent to the normal approximation approach and is known to be often more reliable for smaller
sample sizes or for the heavily biased estimators (which will often be the case for the responsibilities
(13)). The algorithm below describes how to obtain confidence intervals for the parameters given
by (10) and (13) using bootstrap. For some more examples and further discussion see, for instance_,
McLachlan and Peel (2000) or the forthcoming paper by Rempala and Szatzschneider (2002) where
also the issue of the hypothesis testing for the number of mixture components via the parametric
bootstrap method is discussed.
AIsorithm 3 (Bootstrap confidence intervals)
1 Using the values of the model parameters (10) obtained from the EM algorithm generate
the set of pseudo-data 2#* (typically of the same length as the original data 2~).
2 With 2#* at hand, use Algorithm 2 in order to obtain a set of pseudo-values e*.
3 Using the set of the original data values 9 and @* from step 2 above, calculate the pseudo-
responsibilities 6~k as in Algorithm 2 step 2.
5 Use the empirical quantiles of the distributions of pseudo-values O* and 6~k to obtain con-
fidence bounds for e and 6jk.
For illustration purpose we present the set of confidence intervals for the three-mixture-normal
model parameters and the responsibilities (of X3) obtained via the above algorithm for the BI data
in Tables 6 and 7 below. The term "bootstrap estimate" in the tables refers to the average value
of the B bootstrap pseudo-values obtained in steps 2 or 3.
92
Table 6: Accuracy of the parameter estimates for the BI data with B=1000
their abusive claims, utilizing the EM algorithm. The usefulness of the EM output for classification
purpose and its connections with fuzzy logic techniques were discussed. Namely, the EM algorithm
output of posterior probabilities called responsibilities were reinterpreted as fuzzy set membership
function in order to bring the machery of fuzzy logic to bear in the classification problem. The
Monte-Carlo based method of assessingthe accuracy of the model parameters fitted via the EM
algorithm, known as the parametric bootstrap was also presented and the appropriate algorithm
for its implementation was developed. The set of functions written in the statistical language
R, implementing the EM algorithms discussed in the paper, have been included in Appendix A to
allow readersto try different actuarial situations where missing data and hidden components might
be found. A large variety of actuarial and financial applications of the presented methodology are
possible, including its incorporation into models of Risk Based Capital, Value at Risk, and general
Dynamic Financial Analysis. We hope that this paper shall promote enough interest in the EM
methodology for further exploration of those opportunities.
93
References
Abrahamse, Alan F. and Stephan J. Carroll (1999). The Frequency of Excess Claims for Auto-
mobile Personal Injuries, in Automobile Insurance: Road Safes New Drivers, Risks, Insurance
Fraud and Regulation, Dionne, Georges and Claire Laberge-Nadeau, Eds., Kluwer Academic
Publishers, pp. 131-150.
Bilmes, Jeff A. (1998). Gentle Tutorial of the EM Algorithm and its Application to Parameter
Estimation for Gaussian Mixture and Hidden Markov Models, International Computer Science
Institute. UC Berkeley.
Brockett, Patrick L., Xiaohua Xia and Richard A. Derrig (1998). Using Kohonen's Self-Organizing
Feature Map to Uncover Automobile Bodily Injury Claims Fraud, Journal of Risk and Insurance,
June, Volume 65, No. 2.
Campbell, John W., Andrew Y. Lo, and Archie Craig MacKinlay (1996). The Econometrics of
Financial Markets. Princeton University Press.
Crocker, Keith J., and Sharon Tennyson (1999). Costly State Falsification or Verification? Theory
and Evidence from Bodily Injury Liability Claims, in Automobile Insurance: Road Safety, New
Drivers, Risks, Insurance Fraud and Regulation, Dionne, Georges and Claire Laberge-Nadeau,
Eds., Kluwer Academic Publishers, pp.119-131.
Cummins, J. David, and Sharon Tennyson (1996). Controlling Automobile Insurance Costs, Journal
of Economic Perspectives, Spring, Volume 6, No. 2. pp. 95-115.
D'Arcy, S. P.,Gorvett, R. W., Herbers, J. A., and Hettinger, T. E. (1997). Building a Dynamic
Financial Analysis Model that Flies. Contingencies November/December, 40-45.
Dempster, Allan P., N. M. Laird and D. B. Rubin (1977). Maximum likelihood from incomplete
data using EM algorithm, Journal of Royal Statistical Society Series B, Volume 39, No. 1. pp.
1-38.
Derrig, Richard A. (2002). Insurance Fraud. Journal of Risk and Insurace, Volume 69, No. 3. pp.
271-287.
Derrig, Richard A. and K.M. Ostaszewski (1999). Fuzzy Sets Methodologies in Actuarial Science,
Practical Applications of Fuzzy Technologies, Hans-Jurgen Zimmerman Eds, Kluwer Academic
Publishers, Boston, (November).
Derrig, Richard A., K.M. Ostaszewski, and G.A. Rempala (2001). Applications of Resampling
Methods in Actuarial Practice, Proceedings of the Casualty Actuarial Society, Volume LXXXVII,
pp. 322-364.
Derrig, Richard A., and K.M. Ostaszewski (1995): Fuzzy Techniques of Pattern Recognition in
Risk and Claim Classification, Journal of Risk and Insurance, Volume 62, No. 3. pp. 447-482.
94
Derdg, Richard A., Herbert I. Weisberg and Xiu Chen (1994). Behavioral Factors and Lotteries
Under No-Fault with a Monetary Threshold: A Study of Massachusetts Automobile Claims",
Journal of Risk and Insurance, June, Volume 61, No. 2. pp. 245-275.
Embrechts, Paul, Alexander McNeil and Daniel Straumann (2000). Correlation and Dependence
in Risk Management: Properties and Pitfalls, in Extremes and Integrated Risk ManaEement,
Paul Embrechts, Ed. pp 71-76. Risk Books. London.
Hastie, Trevor, Robert Tibshirani and Jerome Friedman (2001). The Elements of Statistical Learn-
inE, Springer-Verlag New York, Inc., New York.
Little, Roderick J.A., and Donald B. Rubin (1987). Statistical Analysis with MissinE Data, John
Wiley & Sons, Inc., Canada.
McLachlan, Geoffrey and David Peel (2000). Finite mixture models, Wiley-lnterscience, New York.
McLachlan, Geoffrey and Thriyambakam Krishnan (1997). The EM algorithm and extensions,
Wiley, New York.
Major, John A., and Dan R. Riedinger (1992). EFD: A Hybrid Knowledge~Statistical -Based
System for the Detection of Fraud, International Journal of Intelligent Systems. Volume 7, pp.
687-703 (reprinted Journal of Risk and Insurance, Vol 69, No.3, pp 309-324 September, 2002).
Owen, Art, B. (2001). Empirical Likelihood. Chapman and Hall. New York.
Rempala, Grzegorz A. and Konrad Szatzschneider (2002). Bootstrapping Parametric Models of
Mortality. Manuscript, to appear in Scandinavian Actuarial Journal.
Wang, Shaun (1998). Aggregation of Correlated Risk Portfolios: Models and Algorithms Proceed-
ings of the Casualty Actuarial Society, Volume LXXXIII, pp 848-939
Wu, Jeff A (1983). On the convergence properties of the EM algorithm, Annals of Statistics,
Volume 11, No. 1. pp. 95-103.
Xu, Li and Mike I Jordan (1996). On convergence properties of the EM algorithm for gaussian
mixtures. Neural computation, Volume 8, pp. 129-151.
95
Appendix A. R Functions
We present here the implementation of Algorithms 1 and 2 in statistical software R which is a
freeware version of the award winning statistical software S+ and is available from http://www.r-
project.org. The functions below were used in the numerical examples discussed in the text.
em.buck <-function(d,dl,d2,B=500,eps=.0001) {
n<-length(d[,l]);
nl<-length(dl);
n2<-length(d2);
m<-apply(d,2,mean);
R<-cov(d);
rho<-cor(d)[1,2]; nLL.old<-eps; nLL.new<-100;
w<-rbind(d,cbind(dl,rep(m[2] ,nl)),cbind(rep(m[1] ,n2),d2));
i<-I; #mainloop#
while (abs(nLL.new-nLL.old)/nLL.old>eps && i<=B){
Tl<-sum(w[,l]); T2<-sum(w[,2]); T12<-sum(w[,l]*w[,2]);
Tll<-sum(w[(n+nl+l):(n+nl+n2),l]~2
+B[l,l]*(l-rho ^ 2))+sum(w[-((n+nl+l):(n+nl+n2)),l] ^ 2);
T22<-sum(w[(n+l):(n+nl),2]" 2+R[2,2]*(1-rho" 2))+sum(w[-((n+l):(n+nl)),2] ^ 2);
R<-array(c(Tll-Tl" 2/(n+n1+n2),T12-Tl*T2/(n+nl+n2),T12-T1*T2/(n+nl+n2),
T22-T2" 2/(n+nl+n2))/(n+nl+n2), c(2,2));
m<-c(Tl/(n+nl+n2),T2/(n+nl+n2));
rho<-R[l,2]/sqrt(~[l,1]~R[2,2]);
w[(n+l):(n+nl),2]<-m[2]+R[l,2]~(w[(n+l):(n+nl),l]-m[l])/R[1,1];
W[(n+nl+l):(n+nl+n2),l]<-m[l]+R[1,2]~(w[(n+nl+l):(n+nl+n2),2]-m[2])/R[2,2];
nLL.old<-nLL.new;
s<-0; for (k in l:(n+nl+n2)) s<-(w[k,]-m)~inv(R)~(w[k,]-m)+s;
nLL.new<-2~(n+nl+n2)~log(2~pi)+s+(n+nl+n2)~log(abs(det(R)));
i<-i+l; }; #end mainloop#
print(paste("n=", n, nl, n2, "Theta estimates=", m[1],m[2], R[I,1],
R[1,2], R[2,2], "itsr=" ,i-I, "-2LL=" ,nLL.new,"rho=" ,rho))
return(list(re=m, R=R,iter=i-l,LL=nLL.new,w--w)) ; }
#output parameters: list of objects (m,R,iter,LL,w)
96
# m -vector of estimated means
# R -estimated eovariance matrix
# iter -number of iterations until convergence
# w-concatenated datafrume of d,dl,d2 along with imputed missing values
# Algorithm2: E M f o r n o r m a i m i x t u r e s ##############################
#auxiliary function
isum<-function(a,p.new,m,s){ k<-length(m); ss<-O;
for (i in l:k) ss<-ss+p.new[[i]]*dnorm(a,m[[i]],s[[i]]);
return(ss)}
# facilitates calculation of LL in the main procedure below
# input parameters:
# a -any list of numeric data
# pi -initial estimate of mixing proportions (default value: uniform over three components)
# eps -desired convergence accuracy (default value .0001)
# B -maximal number of iterations allowed (default value 100)
# m -initial values of means estimates (default value: random selection from a)
#################################################################
em.multnorm<-function(a, pi=c(i/3,1/3,1/3),eps=.OOOI,B=lOO,m=sort(sample(a,3)))
{n<-length(a); k<-length(m); s<-rep(sd(a),k);
i<-1; p.new<-pi;
mO<-m;
logl.old<-l;
logl.new<-sum(log(isum(a,p.new,m,s)));
#mainloop#
while (abs((logl.new-logl.old)/logl.old)>eps && i<=B)
{g<-NULL;
for (t in 1:k) g<-rbind(g, p.new[[t]]*dnorm(a,m[[t]],s[[t]])/isum(a,p.new,m,s));
m<-gX*Xa/gX*Xrep(1,n);
s<-sqrt(g~*Xa'2/g%*~rep(l,n)-m~2);
p.old<-p.new; p.new<-gX*Xrep(1,n)/n; i<-i*1;
logl.old<-logl.new;
logl.new<-sum(log(lsum(a,p.new,m,s)));
};
#end mainloep#
print(paste("Theta estimates",m,s,"pi=",p.new,"iter=",i-1,"-2LL=",-2~logl.new));
return(list(m--m,s=s~pi=p.new~iter=i-1,start--m~,~gl=-2*l~g~.new,res~=t(g),data=a)) }
97
Appendix B. Massachusetts Auto Insurance Bodily Injury Liability Data
Below we present the set o f A u t o Insurance Data discussed in the paper. Medical bill claim amounts
are given in thousands. Responsibilities 6j3 are calculated according t o Algorithm 2.
98
No Claimed Log(Amt) Provider Resp. I No Claimed Log(Amt) Provider Resp.
Amt 6j3 1 Amt 6j3
99
No Claimed Log(Amt) Provider Resp. I No Claimed Log(Amt) Provider Resp.
Amt 6j3 I Amt <~j3
175 2.543 0,933 Other 0.14 176 2.559 0.g40 Other 0.15
177 2,572 0,945 Other 0.16 178 2.593 0.953 Other 0.18
179 2,601 0.956 Other 0.19 180 2.616 0.962 Other 0,20
181 2.619 0.963 Other 0,20 182 2.63 0,967 Other 0.21
183 2,635 0.969 Other 0.22 184 2,635 0,969 Other 0.22
185 2,653 0,976 Other 0,24 186 2.655 0.976 Other 0.24
187 2.675 0.984 Other 0.26 188 2,679 0.985 Other 0.26
189 2.697 0,992 Other 0.28 190 2.718 1,000 Other 0.31
191 2,73 1.004 Other 0.32 192 2,734 1,006 Other 0.32
193 2,755 1,013 Other 0,35 194 2.758 1.015 Other 0.35
195 2.773 1,020 Other 0,37 196 2.775 1.021 Other 0.37
197 2,78 1,022 Other 0,38 198 2,785 1,024 A 0.38
199 2.795 1.028 Other 0.40 200 2.805 1.031 Other 0.41
201 2,805 1.031 Other 0.41 202 2.808 1,032 A 0,41
203 2.88 1.058 Other 0.49 204 2.881 1.058 Other 0.50
205 2,881 1,058 A 0.50 206 2.924 1.073 A 0.54
207 2,93 1.075 Other 0.55 208 2.934 1.076 A 0,55
209 2.94 1.078 Other 0,56 210 2,972 1.089 Other 0,59
211 2.975 1.090 Other 0.59 212 3 1.0gg Other 0.62
213 3 1.099 A 0.62 214 3,025 1.107 Other 0,64
215 3,058 1,118 Other 0.67 216 3.082 1,126 A 0,68
217 3,085 1.127 Other 0,69 218 3.095 1.130 Other 0.6g
219 3.1 1.131 Other 0,70 220 3,102 1,132 A 0.70
221 3.106 1.133 Other 0.70 222 3.135 1.143 Other 0,72
223 3.17 1.154 Other 0.74 224 3,187 1,15g Other 0.75
225 3,192 1,161 A 0.75 226 3.193 1.161 Other 0.75
227 3.2 1.163 Other 0.76 228 3,21 1,166 Other 0.76
229 3.23 1.172 Other 0,77 230 3,23 1.172 Other 0.77
231 3.23 1.172 A 0,77 232 3,232 1.173 Other 0.77
233 3.235 1,174 Other 0.78 234 3.243 1.176 A 0,78
235 3.248 1.178 A 0.78 236 3.249 1,178 Other 0,78
237 3.26 1.182 Other 0.79 238 3.261 1,182 Other 0,79
239 3.272 1.185 A 0.79 240 3.29 1.191 Other 0.80
241 3,295 1.192 Other 0.80 242 3,304 1.195 Other 0.80
243 3,332 1,204 A 0.81 244 3.333 1.204 Other 0.81
245 3.338 1.205 Other 0.81 246 3,34 1.206 Other 0,82
247 3,341 1,206 A 0,82 248 3,349 1.209 A 0.82
249 3.349 1.209 A 0.82 250 3,349 1.209 A 0,82
251 3,353 1.210 A 0.82 252 3.36 1,212 Other 0,82
253 3,378 1.217 A 0.83 254 3,385 1,219 A 0.83
255 3.387 1,220 A 0.83 258 3.416 1.228 Other 0.84
257 3.429 1.232 A 0.84 258 3.438 1.235 A 0.84
259 3.444 1.237 A 0.84 260 3,469 1.244 A 0.85
261 3,473 1,245 A 0.85 262 3,473 1.245 A 0.85
263 3.475 1,246 A 0.85 264 3.477 1,246 A 0.85
265 3.505 1,254 Other 0.85 266 3.517 1,258 A 0,85
267 3.518 1.258 Other 0.85 268 3.527 1.260 A 0,85
100
No Claimed Log(Amt) Provider Resp. I No Claimed Log(Amt) Provider Resp.
Amt 6j3 I Amt ~j~
101
102
Where is My Market? How to Use Data to Find
and Validate New Commercial Lines Market
Niches
103
Where is My Market?
How-to Use Data to Find and Validate New Commercial Line-q
Market Niches
By Lisa Sayegh, MBA, ARM
Entering a new insurance market is not a decision to be taken lightly. Market segment
analysis is a lengthy process, and finding the right data is just the beginning. Being able to
make meaningful comparisons of data from various sources and across insurance lines is
the key to identifying profitable markets. Fortunately, there are data sources and tools
available that can help with the analysis, as well as provide quantifiable assessments of
your niche-market recommendations. Here are some critical elements to keep in mind as
you go through the process.
Why se.qment?
The segmentation process guides the marketer through defining market characteristics to
enable comparison across slices of the market or segments and from which, attractive
markets can be decided on. Consider two examples. A major insurer successfully used
segmentation to identify optimum markets with established distribution channels by
overlaying selected segments to agent territories. Another company learned that an
adjacent state was not nearly as attractive for growth as expected, for reasons other than
the regulatory environment of the state. Their initial expectation of the adjacent state was
based solely on lower distribution costs. It turned out that by segmentation analysis, they
found tremendous promise in a third region. This new segment was so attractive that it
justified increased distribution costs to access the market.
104
analysis should help your business canvas the markets and identify the most profitable
segments. Once these initial segments are proposed, together with your business
expedence and knowledge of your business, data can be collected and monitored to help
justify your selections. The data provides management with quantifiable assessments of
markets and eases their decision process. It also should be emphasized that it is
necessary to review current market penetrations on an ongoing basis to identify changing
business environments.
All companies intuitively segment to a certain degree, but conducting the delineated
segmentation analysis in this article can result in higher profitability. As per the example
above, a company identified a profitable niche and decided to extend their program to the
adjacent state. Via this analysis, the insurer found that expanding into a third region - not
their first choice - provided a better opportunity. Particulady in these changing times, a
company can easily miss profitable opportunities without segmentation analysis.
There are many sources of industry standard data that marketers can use to benchmark
against, much of it not insurance-specific. Some brokers will provide comparative
performance data of your company against their entire book. Many insurance associations
may have pertinent information. One that comes to mind is the National Restaurant
Association that conducts its own survey of insurance costs. (Note that only participants in
105
the survey can access the information). The Risk and Insurance Management Society
(RIMS) conducts a survey of the self-insured market that provides some industry slices.
Industry can be a critical segment definition as it enables you to compare results across
line of insurance. Industry premium and loss information is typically categorized according
to risk classification code that is specific to line of insurance. This risk classification-code-
based data does not lend itself to easy market analysis. Comparison across individual
lines of insurance with their different classifications is difficult and so, it is difficult to identify
trends that can be applied to their industries. A prime example is the commercial auto line,
which is categorized by weight of vehicle and distance of travel. If you identify a particular
truck as very profitable, how can you apply that to other industries when you are not sure
which industries use that type of truck?
The validation of the selected segments generally starts with non-insurance data as it is
more abundant and cheaper. As things get refined, it moves to insurer data which is more
scarce and more expensive.
Now that the segments have been identified, the marketer needs to support his/her
findings. At this stage, it is important to use insurance specific information for the
validation. Few industries have to calculate expenses that are paid over a period of years
when determining profitability. Business demographics are valuable up to a point. A
proposed market segment may show a large number of potential customers, but this may
or may not be correlative to premium potential in the market. The number of
establishments shows the number of potential customers. Compare the premium to
number of establishments to ascertain the average premium and decide if that is a market
segment that fits your business approach. Let us say your distribution channel supports
companies with higher premium policies. Then you wouldn't be interested in a segment
with a large number of establishments resulting in low average premiums. Make sure to
check that the average premium per establishment is reflective of the market segment
activity. A few outliers such as a few jumbo companies can skew the data and distort an
average for a segment that may still be an option for your company.
106
It has always been a challenge for insurance marketers to obtain profitability data for
analyzing markets. Pricing for policies is based on aggregated insurance information using
classification codes.
Classification codes reflect insurance risk and since dsk characteristics vary by line of
insurance, classification codes vary by line. As an example, ISO's classification scheme for
General Liability has different class codes for restaurants with or without cooking, and/or
with alcohol. The Commercial Fire and Allied line has a single code applying to all
restaurants - including bars as well. For analyzing markets, class codes may not provide
optimum information to enable comparison across niches. Even after identifying a
classification code that provides good results, it would be difficult to apply these results to
a particular industry and then apply to other industries. Therefore, it is helpful to be able to
study the data by industry segment. To compensate for the lack of actual insurance
statistics by industry, models were developed to "translate" insurance data collected by
classification code to reflect an industry group.
One product that offers data for industry specific loss ratios by line of business for the
three major lines is ISO's Market Profiler derived from modeling. This data can be used for
profitability segment analysis.
The graph analyzes one year of activity. Other factors play a role in selecting optimum
segments for your business, but at a glance, the segments for SIC 5947 in tllinois and
again in illinois for SIC 5845 show large premium opportunity with a comparatively
reasonable loss ratio. Also attractive is SIC 5942 in Minnesota showing high profitability
but with lower premium potential. The go/no go decision may depend on the ease of
access to the Minnesota market.
107
P R E M I U M VS. LOSS RATIO
m3mK I 81(3
I ILl.IT
~00CK A
A IL~I6
IVN~I~
P ~0~K
r
e -~- CHIntz
m ~la3o:K II
i 9~ IL~I~I2
X t
u~0mK --X- IVN~I~47
m
-m- cHm~
$1030<
-Ill- w ~
~ - C~m~
I I I
40 00 80 I00 140
La~F~o
108
Other analyses that can be conducted include evaluating trends - both historical and
forecasted. Does the segment show reasonable sustained growth over a period of time or
less reliable spiked growth? Forecasted data tells you if a trend is expected to continue.
Some forecasters concentrate on industry, some on geography. Especially of interest to
insurers when selecting markets would be expected premium growth or contraction.
Trend By Line
Conventional Premium- ($000's3
1997 199._..88 199_...99 2000 2001 200....~2 200....~3
Line Of Business SpRE M :~p][tEM SPREM SPREM SpREM $PREM $PREM
Workers' Compensation 5,742.$$ 6,196.22 6,689.67 6,527.55 6,580.21 6,727.10 6,822.86
Annual % increase 7.45% 7.890/0 7.96% -2.42% 0.81% 2.23% 1.42%
General Liability 6,606.47 6,759.85 7,328.01 7,827.93 8,204.38 8,652.13 9,106.23
Annual % increase 11.24% 2.32% 8.40~ 6.82% 4.81% 5.46% 5.25%
Commercial Property 13,350.30 14,014.55 14,880.96 15,725.05 16,161.39 16,743.64 17,326.78
Annual % increase 1.96% 4.98% 6.18% 5.67% 2.77% 3.60% 3.48%
These analyses raise "red flags" that warrant further investigation - the marketers should
draw conclusions based on multiple data results and business sawy.
109
conducting the analysis 10 years ago, would Internet companies have even showed up
on the radar.?
An easy "sell" to management is cross selling. Can your identified segments or
programs utilize the current customer base? The familiar is the most comfortable.
110
and other valuable workers compensation information; and the Bureau of Labor
Statistics for a wide variety of other information.
~, Association information - As mentioned above, RIMS is a source for self-insured
information for those marketers targeting very large accounts. The National
Restaurant Association is an example of an industry group that collects insurance
information.
Forecasting - Economic forecasting information can be used when assessing future
trends. Even if not insurance specific, it can help identify areas of growth. For
premium forecasting, ISO's Marketwatch tracks the change in renewal pricing which
can be used for a trend analysis.
News
Your own data - Your own company data has tremendous value. If the information
you need is not available in your current system, check to see if the information is
on applications being received by the company but not coded.
When ready to integrate data across data sources, identify a denominator that applies to
both sources. The easiest seems to be industry identification. Currently we use the
government defined Standard Industrial Code known as SIC. There are over 900 4-digit
SICs or industry slices available. SICs are being replaced by North American Industry
Classification System (NAICs). The design of NAICs codes is expected to be easier to
apply to insurance applications as it is more process oriented as well as more relevant to
our current economy.
Integration across multiple data sources can be on an aggregate level and still provide
very useful information. Other denominators for integration can be business size and
geography. The Census Bureau - County Business Pattern defined over 10 sizes that are
commonly accepted in the industry.
111
available include postal data which offers zip code-to-county mapping. And as mentioned
above, you may be capturing valuable data already such as length in business, etc.
Conclusion
As lead time for successful retums from a new program or market decreases and more
stress is placed on pre-qualifying new market choices, the benefit of accessing accurate
data and insurance industry data about market segments is increasingly important. A
marketer can utilize the data most effectively by following a plan that includes defining and
analyzing the needs and strengths of the company, and then identifying additional market
segments that would most benefit from those strengths.
112
Does Credit Score Really Explain Insurance
Losses? Multivariate Analysis from a Data
Mining Point of View
113
Does Credit Score Really Explain Insurance Losses?
Multivariate Analysis from a Data Mining Point of View
by
Abstract
114
About the Authors
Cheng-Sheng Peter Wu, F.C.A.S, A.S.A., M.A.A.A., is a director in the Advanced
Quantitative Services practice of Deloitte & Touche's Actuarial and Insurance Consulting
Group. He is based in the Los Angeles, CA office. Mr. Wu received his masters degrees
in chemical engineering and statistics from the Pennsylvania State University. Mr. Wu
has published several papers in automotive engineering, tribology (lubrication
engineering), statistics, and actuarial science.
Messrs. Wu and Guszcza's address is: Deloitte & Touche LLP, 350 South Grand
Avenue, Los Angeles, CA 90071
Acknowledgments
The authors would like to thank Matt Carrier, Ravi Kumar, Jan Lommele, Sara
Schlenker, and an anonymous referee for their thoughtful comments.
115
Introduction
One of the more important recent developments in the U.S. insurance industry has been
the rapidly growing use of credit scores to price and underwrite personal auto and
homeowners insurance. But this development has not come without controversy.
Perhaps the most important criticism raised is that there exists no convincing causal
picture connecting poor credit history with high insurance loss potential [1-5]. Partly for
this reason, many insurance regulators and consumer advocates have expressed doubts
that the observed correlations between credit scores and insurance loss history truly
reflect an underlying reality. Some critics have suggested that these correlations might be
spurious relationships that would not survive more sophisticated (multivariate) statistical
analyses.
Given the business significance and statistical nature of this topic, it is curious that
actuaries have not participated more actively in the debate. We are aware of only two
actuarial studies that have been published so far: one published by Tillinghast, which
was associated with the NAIC credit study [4], and the other by Monaghan [5].
The aim of this paper is to review these studies and complement them with a qualitative
description of our own experiences in this area. For reasons of confidentiality, we are not
able to share detailed quantitative results in this forum. Our focus will be on the use of
credit in the line of personal auto, but many of our comments will hold true for other lines
of insurance. We will begin with several historical comments on the development of
auto classification ratemaking in the United States, and with comments on the actuarial
issues relating to the use of credit in auto ratemaking.
Personal auto ratemaking came a long way in the 20 th century [6]. Prior to World War II,
auto ratemaking involved only three classes: adult, youthful operator, and business use.
The three decades after the war saw a proliferation of new class categories such as
vehicle characteristics (symbol, model year) and refined driver classifications.
Today, a typical personal auto rating plan contains hundreds, if not thousands of classes
involving the following variables:
116
Fehicle Characteristics: this typically includes a vehicle symbol system as well
as a model year rating structure.
Miscellaneous surcharges~discounts: this is where rating plans vary the most
from company to company. Special surcharges or discounts are used to reflect
policy characteristics or advances in motor vehicle technology. Commonly seen
discounts include multi-car discounts, homeowner discounts, safe driver
discounts, anti-lock brake discounts, anti-theft discounts, affinity group factors,
and so on.
In addition to the above class variables, a typical rating plan is not complete without a
tier rating structure. A tier structure is designed to address rating inadequacies that an
insurer believes exists in a class plan. For example, an insurer might create three
companies for its preferred, standard, and high-risk books, and the rate differential for
such companies can range from -20% to 20%. Such differentials are typically applied at
the policy level, across all coverages. Tier rating factors can include characteristics that
are not used in the class plan, such as how long an insured has been with the insurer.
They can also include certain interactions of class factors, such as youthful drivers with
poor driving records.
As class plan structures have become more complex, the problem of estimating rates for
each combination of class variables has become more difficult. This is because many of
the variables used to define rating factors are not statistically independent. For this
reason, factors based on univariate analyses of the variables are not necessarily
appropriate for a multi-dimensional rating structure. Some form of multivariate analysis
is called for.
To take a concrete example, suppose that an existing rating plan charges youthful drivers
3 times that of mature drivers. Furthermore, we analyzed loss (pure premium) relativities
by driver age group, and noticed that the youthful driver group has losses per exposure 4
times that of the mature driver group. But it does not follow that the youthful driver
rating factor should be raised to 4. This is because other variables used in the class plan
might be correlated with age group variable. For example, youthful drivers have more
accidents and violations; they are more likely to drive sports cars; they are more likely to
be unmarried, and so on. They are therefore likely to be surcharged along these other
dimensions of the rating plan. To give them a driver age rating factor of 4 would
possibly be to over-rate them.
This issue -- that non-orthogonal rating variables call for multivariate statistical analyses
-- lies at the heart of the debate over credit. In addition, this issue is perhaps the key
theme in the methodological development of classification ratemaking since the 1960's.
117
bias procedure, involves assuming a mathematical relationship between the rating factors
and pure premium.
The mathematics of minimum bias is pure algebra: Bailey and Simon derived their
models without positing statistical models. In his 1988 paper, Robert Brown [10] showed
that commonly used minimum bias formulas could be derived from statistical models via
maximum likelihood. Stephen Mildenhall's 1999 paper [11] is the most rigorous
examination to date of the statistical underpinnings of the minimum bias method. Thanks
to Brown, Mildenhall, and others [12, 13], it is now abundantly clear that Bailey-type
actuarial analyses are in fact special cases of Generalized Linear Models. Multi-
dimensional classification ratemaking projects should therefore be viewed as exercises in
multivariate statistical modeling.
During 1970's and 1980's, when classification ratemaking was undergoing its
methodological development, no major rating variables were introduced. This changed
in the late 1980's and 1990's when credit scores were introduced to personal lines
insurance [ 1].
Raw credit information is supplied by several major credit bureaus, including Choice
Point, TransUnion, and Experian. These companies collect individuals' credit data and in
turn sell this data in the form of credit reports. Credit reports contain a wealth of
information that can be grouped into four classifications:
9 General information
9 Trade line information
9 Inquiries
9 Public Records and Collections
The raw fields on these reports can be combined in many ways to create a plethora of
random variables. Examples include number of trades, months since oldest trade, amount
past due, trade line balance-to-limit ratio, number of inquiries, number of collections, and
number of lawsuits. Using various statistical techniques (such as multiple regression,
principal components analysis, clustering, Classification and Regression Trees) these
random variables can in turn be combined to create credit scores.
Using credit scores to segment risks is hardly a new idea. For many years the lending
industry has used such scores to underwrite loan applications. The Fair, Isaac Company
is a leading vendor of one such score, called the FICO score.
118
Linking credit scores to personal auto and homeowners profitability, however, was a new
idea, when they were introduced to the insurance industry approximatelyl 5 years ago. A
typical credit score used in personal lines insurance might be calculated based on 10 to 30
variables. Conning's latest report [1] indicates that today more than 90% of insurance
companies use credit scores or credit information in one way or another.
As noted above, the growing use of credit scores in insurance underwriting and
ratemaking has garnered controversy along many fronts. We will set aside the political
and social aspects of the debate and focus on the more purely actuarial issue: do credit
scores really help explain insuraneeprofitability? As we will discuss further, answering
this question in the affirmative involves more than simply demonstrating a correlation
between credit and loss ratio.
In the remainder of this paper, we will review the answers given to this question by the
Tillinghast [4] and Monaghan [5] studies, and then add our own perspective. But first, it
would be good to briefly discuss some general actuarial and statistical issues.
Loss (Pure Premium) Relativity vs. Loss Ratio (Profitability) Relativity: The distinction
between these concepts might not be clear to a non-actuarial audience, but it is absolutely
critical. Because premium reflects all of the components of a rating plan, a correlation
between a new variable (say, credit score) and loss ratio indicates the degree to which this
variable can explain losses not already explained by the existing rating plan. For
example, a critic might question the power of credit scores by claiming that credit is
correlated with driver age. Since driver age is already in the class plan, there is no need
to include credit as well. This argument would have some validity if it were in response
to a pure premium relativity analysis. However, it would have much less validity if the
relativity were based on loss ratios. Returning to the above example, the premium for
youthful drivers is already 3 times that of mature drivers. Therefore a correlation
between credit and loss ratio indicates the extent to which credit explains losses not
already explained by the youthful driver surcharge.
Non-Independent Rating Variables: We believe that this is the key issue of the debate
over the explanatory power of credit score. Intuitively, independence means that
knowing the probability distribution of one variable tells you absolutely nothing about the
other variable. Non-independence is common in insurance data. For example, youthful
drivers have more accidents and violations than do mature drivers; mature drivers have
more cars on their policies t,han do youthful drivers; number of drivers are correlated with
number of vehicles. We can therefore expect that credit score will exhibit dependences
with other insurance variables, such as driver age, gender, rating territory, auto symbol,
and so on.
119
Univariate v. Multivariate Analyses: In the case of independent random variables,
univariate analyses of each variable are entirely sufficient -- a multivariate analysis
would add nothing in this case. Failure of independence, on the other hand, demands
multivariate analysis. Furthermore, the results of multivariate analyses can be surprising.
Below, we will give a hypothetical example in which an apparently strong relationship
between credit and loss disappears entirely in a multivariate context.
With these general remarks in hand, let us turn to the Tillinghast [4] and Monaghan [5]
studies.
Tillinghast's Study
Tillinghast's credit study was undertaken on behalf of the Fair, Isaac Company for use in
its discussions with the National Association of Insurance Commissioners (NAIC). The
purpose of the study was to establish a relationship between Insurance Bureau credit
scores with personal auto and homeowners insurance. Tillinghast received the following
information for each of nine personal lines insurance companies:
For the most part, the credit score intervals were constructed to contain roughly equal
amounts of premium. The results for these 9 companies are given in Exhibit 1.
Clearly, the information provided to Tillinghast only allowed for a univariate study, and
this is all Tillinghast set out to perform. Tillinghast's report displays tables containing
each interval's loss ratio relativity alongside the interval's midpoint. These numbers are
also displayed graphically. The report comments, "From simply viewing the graphs.., it
seems clear that higher loss ratio relativities are associated with lower Insurance Bureau
Scores."
No detailed information is provided on the data used, or about the 9 companies that
provided the data. Therefore we cannot comment on how credible the results are. The
loss ratio relativity curves are somewhat bumpy for certain of the 9 companies; and the
loss ratio spreads varies somewhat from company to company. But the patterns are clear
120
enough to strongly suggests that the relativity spreads are robust, and not merely
company-specific fluctuations in the data.
Furthermore, the relativities produced by credit are fairly large. The 10% of the
companies' books with the best credit have anywhere from -20% to -40% loss ratio
relativities. The worst 10% have relativities ranging from +30% to +75%. These loss
ratio spreads compare favorably with those resulting from traditional rating variables.
For example, based on our experience, about 20% to 30% of a standard auto book will
have point surcharges for accidents or violations. The average surcharge might range
from 15% to 40%. Therefore, the loss ratio spread indicated in the study is no less than
the accident and violation point surcharge. In addition, the credit loss ratio spread can
largely support the commonly seen rate differentiation for the tier rating. Examples such
as this make it clear why insurers are embracing the use of credit scores.
Suppose it was reported 1100 men and 1100 women applied for admission to Berkeley in
1973. Of these people, 210 men were accepted for admission, while only 120 women
were accepted. Based on this data, 19% of the men were accepted, while only 11% of the
women were accepted. This is a univariate analysis (somewhat) analogous to
Tillinghast's, and it seems to prove decisively that there was serious gender bias in
Berkeley's 1973 graduate admissions.
But in fact this univariate analysis does not tell the whole story. When the admissions
were broken down by division (suppose for simplicity that there were only two divisions:
Arts & Sciences and Engineering) the data looked more like this:
121
Now our analysis is multivariate, by virtue of the fact that we are including division
applied to, in addition to gender. The multivariate analysis quite clearly shows that the
acceptance rate for men and women within each division was identical. But because a
greater proportion of women applied to the division with the lower admission rate (Arts
& Sciences), fewer women overall were accepted.
This is a very simple example of what can go wrong when one's data does not contain all
relevant variables: an apparent correlation between two variables can disappear when a
third variable is introduced.
In order to make the link to regression analysis, let us analyze this data at the un-grouped
level. The reader can reproduce the .following results with a simple spreadsheet exercise.
Create 2200 data points with a {0,l }-valued target variable (ACCEPTED) and two {0,1 }-
valued predictive variables (MALE, ENGINEERING). 1000 of the points are males who
applied to engineering {MALE=l, ENGINEERING=l}. For 200 of these points
ACCEPTED=l, for the remaining 800 ACCEPTED=0, and so on.
Beta t-statistic
Intercept .1091 10.1953
MALE .0818 5.0689
As expected, this univariate regression analysis indicates that gender is highly predictive
of acceptance into graduate school, and indeed it is: a greater proportion of males were
accepted! However this analysis is potentially misleading because it does not help
explain why males are accepted at a higher rate.
Beta t-statistic
Intercept .1 9.1485
MALE 0 0
ENGINEERING .1 3.8112
When the truly relevant variable is introduced, the spurious association between gender
and acceptance goes away (the beta and t-statistics for MALE are both 0). This multiple
regression approach on un-grouped data is illustrative of our data mining work involving
credit and other predictive variables.
(Of course logistic regression is usually a more appropriate way to model a binary target
variable such as application acceptance or auto claim incidence. But such an analysis
could not easily be replicated in a spreadsheet. Because ordinary multiple regression
gives the same results in this simple case, it is sufficient for our illustrative purpose.
122
Readers are encouraged to try logistic regression, from which precisely the same
conclusion will be reached.)
Returning to the Tillinghast study, consider the following scenario: suppose our credit
variable has two levels (good/bad). Rather than academic division, suppose that the
"true" confounding variable is urban/rural (territory). Thus good/bad correspond to
male/female in the Berkeley example, and urban/rural corresponds to arts/engineering.
Rather than acceptance into school, the target variable is now having a personal auto
claim. Now our data is:
# Claim
Exposures Claims Freq
Rural Urban Total Rural Urban Total Rural Urban Total
Good credit 1000 100 1100 100 20 120 10% 20% 11%
Poor credit 100 1000 1100 10 200 210 10% 20% 19%
If we similarly re-label the terms of our regressions, we will again see that (in this purely
hypothetical example) the GOOD_CREDIT indicator loses its apparent significance once
the URBAN indicator is introduced.
Of course, given our discussion of the difference between a pure premium study and a
loss ratio study, it is not entirely fair to call the Tillinghast study "univariate". Recall
that Tillinghast's target variable was loss ratio relativity, not claim frequency. In the
above example, suppose all claims have a uniform size of $1000, and further suppose that
the territorial rates are $2000 for urban territories, and $1000 for rural territories. Now
the loss ratio relativity in each cell will be exactly 1.0. In this (again, purely
hypothetical) case, Tillinghast's methodology would (correctly) show no relationship
between credit and loss ratio relativity.
In other words, to the extent that all possible confounding variables are perfectly
accounted for in premium, Tillinghast's "univariate" analysis is implicitly a multivariate
analysis, and is therefore convincing. But realistically, this may not be the case. For
example, in our work we regularly regress loss ratio on such zip code-based variables a~
population density and median population age. If territory were entirely accounted for in
premium, such variables would never appear statistically significant. But in fact they
sometimes do. Therefore a true multivariate study is desirable even if loss ratio is used as
the target variable.
123
Monaghan's Study
Monaghan's study for auto is based on three calendar years of data (1993-95). Each
record in his database contains premiums and losses accumulated over this entire three-
year period. So each record may have different length for the tenn. Losses are evaluated
at 6/30/1995. For this reason, losses on different records might be evaluated at varying
states of maturity. Losses include reserves, salvage and subrogation recoveries, and
allocated loss adjustment expenses. The credit information used in this study was a
"snapshot view" taken at the policy inception date. Approximately 170,000 records were
used in the analysis. The total premium and loss in these records were $393 million and
$300 million, respectively.
The amount of data in Monaghan's study is very large. While we don't know all the
details about the data, the large amount of premium indicates that it is probably based on
a countrywide population. Our experience on auto data indicates that on average there
will be 150 to 400 claims per $1 million in premium, depending on the geographic
concentration, program type, and policy type (liability only vs. full coverage) represented
in the data. This suggests that there will be on the order of a hundred thousand claims in
Monaghan's study. According to actuarial credibility theory [17], Monaghan's data
should provide very credible results.
While not conclusive for the reasons given above, this part of Monaghan's study is
helpful in that it unpacks credit score into its component variables. The relationship
between credit score is not entirely the result of some mysterious or proprietary
interaction of the components credit variables. Rather, each of these component variables
is individually somewhat predictive of insurance losses. For the record, the results
Monaghan reports in this section are consistent with our experience working with credit
data.
124
Note that these univariate results -- as well as Monaghan's multivariates to be described
below -- are in terms of loss ratio relativity. Therefore, Monaghan's work (like the
Tillinghast study) indicates the degree to which credit is able to capture loss variation not
captured by the existing rating plan.
Monaghan displays several two-way tables showing loss ratio relativity by credit group
and an underwriting variable. The auto underwriting variables he displays in conjunction
with credit include past driving record, driver age, territory, and classical underwriting
profile. The last variable is a composite variable combining marital status, multicar,
homeowner, and clean driving record. (Monaghan supplies similar tables for
homeowners rating variables. We will not review the specifics of these tables here.)
In no case did Monaghan's inclusion of the rating factor cause the relationship of credit
with loss ratio to disappear (as in the Simpson illustration above). Indeed, Monaghan's
tables contain some very telling relationships. For example, the loss ratio relativity of
drivers with clean driving histories and poor credit was 1.36. In contrast the relativity for
drivers with good credit and poor driving records was only 0.70!
The results of the GLM analyses are striking, and they buttress Monaghan's claims. For
example, the multiplicative Bailey factors arising from the credit/driving record analysis
are 1.709, 1.339, 1.192, and 1.0 for credit groups A-D. These are quite close to the
univariate loss ratios relativities that can be calculated from Monaghan's data (1.757,
1.362, 1.204, 1.0). This is excellent confirmation that credit is largely uncorrelated with
driving record: the multiplicative Bailey factors are almost the same as the factors that
would arise from a univariate analysis!
125
Furthermore, the GLM parameter estimates are quite large relative to their standard
errors. Also, the Chi-squared statistics for the four credit groups are high, and the
associatedp-values are very low. These observations add statistical rigor to the claim
that the loss ratio "lift" resulting from credit score is "real". These observations hold
equally well for the other two variables as well. Finally, performing an additive Bailey
analysis (normal/identity GLM - not shown) produces qualitatively similar results.
Monaghan reports that he produced such two-way tables for a large number of other
traditional underwriting characteristics. He says, "there were no variables that produced
even roughly uniform results across the credit characteristics."
For several years, we have applied data mining methodology and a range of predictive
modeling techniques to build insurance profitability and underwriting models for writers
of both commercial and personal lines insurance. Credit variables and credit scores are
typically included along with a comprehensive set of other traditional and non-traditional
insurance variables. Because of the truly multivariate context in which we employ credit
information, our findings lend further support to the conclusions reached in the
Tillinghast and Monaghan studies. For reasons of confidentiality, we are not at liberty to
share quantitative results in this paper. However, we shall describe our methodology and
modeling results in a qualitative way.
By the end of this process, literally hundreds of predictive variables will have been
created from the internal and external data sources. The goal is to create upfront as many
variables as possible that might be related to insurance loss and profitability. These
variables represent a wide range of characteristics about each policyholder.
Typically we design our analysis files in such a way that each data record is at a policy-
term level. For example, personal auto policies usually have a six-month term. Ira
policy has two years of experience in our study, we will generate four 6-month term data
points in the study. This design, which is different from that of Monaghan's study, will
give each record equal weight for the term in the analysis process. All of the predictive
126
variables, including the credit variables, are evaluated as of the beginning of the term-
effective date.
Target variables, including loss ratio, frequency, and severity, are created in parallel with
the predictive variables. Losses are usually evaluated a fixed number of months from the
term effective date. The reason for this is to minimize any chance of bias appearing in
the target variables due to varying loss maturities. In addition, we will incorporate
various actuarial techniques that we deem necessary to adjust the target information.
Such adjustments include loss trending, premium on-leveling, re-rating, loss capping, cat
loss exclusion, and so on.
Once the generation of target and predictive variables has been accomplished, we will
merge all the information together to produce a policy-term level database. This database
contains all of the predictive variables, as well as such target information as claim
frequency, claim severity, loss ratio, capped loss ratio, and so forth. The database is then
used to produce univariate reports showing the relationship of each predictive variable
with the target information. This is essentially a collection of reports containing one
Tillinghast-type study for each of the hundreds of predictive variables. This database is a
useful exploratory data analysis (EDA) prelude to the multivariate modeling phase of our
projects.
This database of univariate results also provides invaluable information for multivariate
modeling regarding (1) whether to discard the variable right away because it has no/little
distribution or because there is any business or other reason to do so; (2) how to cap the
variable either above or below; (3) what to do with missing values; and (4) whether to
treat the variable as a continuous or categorical random variable. Other needed
transformations might be suggested by this univariate study.
Once the Exploratory Data Analysis stage is completed, we are ready to begin the
modeling process. The first sub-phase of this process is to search for an optimal multiple
regression model. Criteria used to judge "optimality" include (but are not limited to)
strong t-statistics, parameter estimates that agree with business intuition, and not
overfitting data used to estimate the parameters. This model serves as a useful
benchmark for comparison purposes. In addition, the parameter estimates, and the t- and
F-statistics generated by regression models are useful for such interpretive issues as the
topic of this paper.
Once the optimal regression model has been selected, we tum to more advanced model
building techniques such as Neural Networks [ 18-20], Generalized Linear Models [8-13],
Classification and Regression Trees (CART) [21] and Multivariate Adaptive Regression
Splines (MARS) [22]. These more advanced techniques can potentially provide more
accurate predictions than a multiple regression model, but this additional predictive
power often comes at a cost: more complex models can be harder to interpret and explain
to underwriters, upper management, and insurance regulators.
127
We use a train/test methodology to build and evaluate models. This means that the
modeling dataset is randomly divided into two samples, called the training and test
samples. A number of models are fit on the training sample, and these models are used to
"score" the test sample. The test sample therefore contains both the actual loss ratio (or
any other target variable) as well as the predicted loss ratio, despite the fact that it was
not used to fit the model. The policies in the test sample are then sorted by the score, and
then broken into (for example) ten equal-sized pieces, called deciles. Loss ratio,
frequency, and capped loss ratio are computed for each decile. These numbers constitute
lift curves. A model with a low loss ratio for the "best" decile and a very high loss ratio
for the "worst" decile is said to have "large lift". We believe that the lift curves are as
meaningful for measuring the business value of models as such traditional statistical
measures as mean absolute deviation or R 2. The purpose of setting aside a test set for
model evaluation is to avoid "overfit". (Of course a lift curve can also be computed on
the training dataset. Naturally, this lift will be unrealistically large.) A third sample,
called a validation sample, sometimes will also be set aside to produce an unbiased
estimate of the future performance of the final selected model.
We have performed several large data mining projects that included credit variables and
credit scores. Similar to the Tillinghast study and Monaghan's study, we have studied
data from various sources, different distribution channels, and different geographic
concentrations. Our studies are very large in size, similar to Monaghan's study, usually
with several hundred thousand data points that contain a total of hundreds of millions of
dollars of premium. Our approach is tailored to the use of large datasets, the use of
train/test methodology, the use of lift curves to evaluate models, and the exploratory use
of a variety of modeling techniques. These are all hallmarks of the data mining approach
to statistical problems. We believe that our analyses are true multivariate analyses that
yield very robust and credible results. It is precisely this kind analysis that makes it
possible to decisively answer the question: does credit really help explain insurance
losses and profitability?
First, through our univariate databases we note that composite credit score and many of
its associated credit variables invariably show strong univariate relationships with
frequency, severity, and loss ratio. Our univariate experience is entirely consistent with
that of Tillinghast and Monaghan.
Turning to our multivariate modeling work, the estimates and statistics coming from our
multiple regression models are useful for evaluating the importance of credit relative to
the other variables considered in our model building process. Several points are worth
making. First, credit variables consistently show up as among the most important
variables at each step of the modeling process. As noted by Tillinghast and Monaghan,
they dependably show strong univariate relationships with loss ratio. Furthermore, they
are typically among the first variables to come out of a stepwise regression analysis.
128
Second, the parameter estimates for credit variables are consistently among the strongest
of the parameters in our regression models. As illustrated in the Simpson's paradox
example, credit score would have a small beta estimate and t-statistic were it a mere
proxy for another variables or some combination of other variables. But this is not the
case. Rather, we have repeatedly seen that credit adds predictive power even in the
presence o f a comprehensive universe of traditional and non-traditional predictive
variables, all used in conjunction with one another, on a large dataset.
We are basing our conclusion in part on the t-statistics of the credit variables in our
underwriting/pricing regression models. To this one might object: "but one of the
assumptions of regression analysis is a normally distributed target variable. It is obvious
that loss ratio is not normally distributed, therefore your t-statistics are meaningless." In
response, it is true that loss ratios are not normally distributed. Nevertheless, the models
we build using regression analysis reliably produce strong lift curves on test and
validation data. Therefore, our models do "work" (in the sense of making useful
predictions) in spite of the lack of normality.
It is also true that because of the lack of normality, we cannot use our models' t-statistics
to set up traditional hypothesis tests. But neither our analyses nor our conclusions are
based on hypothesis tests. We interpret t-statistics as measures of the relative importance
of the variables in a model. Consider ranking the variables in a regression model by the
absolute value of their t-statistics. The resulting order of the variables is the same as the
order that would result from ranking the variables by their marginal contribution to the
model's R 2 (in other words the additional R 2 that is produced by adding the variable after
all of the other variables have been included in the model). This interpretation oft-
statistics does not depend on the normality assumption.
Our models effectively predict insurance losses. The evidence for this is repeated,
unambiguous empirical observations: these models dependably distinguish
profitable from unprofitable policies on out-of-sample data. In other words, they
produce strong lift curves on test and validation datasets.
Furthermore, credit variables are among the more important variables in these
models. This is evidenced by the following observations: (i) the univariate
relationship between credit and loss ratio is as strong or stronger than that of the
other variables in the model. (ii) Credit variables reliably appear in a stepwise
regression performed using all of the available variables. (iii) Credit variables
typically have among the largest t-statistics of any of the variables in the model.
9 Supporting the above observations, removing the credit variable(s) from a model
generally results in a somewhat dampened lift curve.
129
The implication of the above two bullets is that credit variables add measurable
and non-redundant predictive power to the other variables in the model.
Therefore, we believe that the observed correlation between credit and loss ratio
cannot be explained away as a multivariate effect that would go away with the
addition of other available variables.
Furthermore, this is true not just of the final selected regression model, but of most or all
of the models produced along the way. In addition, we have noticed this result applies in
all different lines of insurance, in both personal lines and commercial lines. For this
reason, we feel comfortable saying that credit bears an unambiguous relationship to
insurance loss, and is not a mere proxy for other available kinds of information.
It is beyond the scope of this paper to comment on the societal fairness of using credit for
insurance pricing and underwriting. From a statistical and actuarial point of view, it
seems to us that the matter is settled: credit does bear a real relationship to insurance
losses.
Our experience does indicate that credit score is a powerful variable when it is used alone
for a standard rating plan. In addition, our large-scale data mining results suggest that
just about any model developed to predict insurance profitability will be somewhat
stronger with credit than without credit. Typically credit score, when added to an
existing set of non-credit predictive variables, will be associated with a relatively large
beta estimate and t-statistic. Consistent with this, the resulting model will have higher
"lift" than its counterpart without credit.
The results we have described might create an impression that credit variables are an
essential part of any insurance predictive modeling project. But this would be an
130
exaggeration. Our experience also shows that pricing and underwriting models created
without credit variables can still be extremely good. The key to building a non-credit
predictive model is to fully utilize as many available internal data sources as possible,
incorporate other types of external information, use large amount of data, and apply
multivariate modeling methodologies. Given all the regulatory and public policy issues
surrounding insurers' use of credit, such non-credit models provide the insurance industry
with a valuable alternative to using credit scores for pricing and underwriting.
Our data mining projects are multivariate predictive modeling projects that involve
hundreds of variables being used to analyze many thousands of records. Many of these
variables are credit variables, which play an important role even in this broad context.
Our experience using credit scores and credit variables in a truly multivariate statistical
setting has allowed us to add a new perspective to the debate over credit.
The use of credit in insurance underwriting and ratemaking might seem like a rather
specialized topic. But we believe the issue reflects two important trends in the
development of actuarial science. First, credit scores come from a non-traditional data
source. The advent of the Internet makes it likely that other new data sources will
become relevant to actuarial practice. Credit information is probably just the beginning.
The second issue is the increasingly multivariate nature of actuarial work. Credit scores
themselves are inherently "multivariate" creatures in that they are composites built from
several underlying credit variables. In addition, recall that we have reviewed and
discussed three ways of studying the relationship between credit scores and insurance
losses and profitability. Each study has been progressively more multivariate than its
predecessor. This reflects the methodological development of classification ratemaking
from univariate to multivariate statistical analyses (Generalized Linear Modeling).
In our opinion, the adoption of modern data mining and predictive modeling
methodologies in actuarial practice is the next logical step in this development. Bailey's
minimum bias method might seem like actuarial science's in-house answer to
multivariate statistics. On the contrary, Mildenhall's paper makes it clear that
conceptually, nothing separates minimum bias from work done by mainstream
statisticians in any number of other contexts. But why stop at Generalized Linear
Modeling?
We live in an information age. The availability of new data sources and cheap computing
power, together with the recent innovations in predictive modeling techniques allow
actuaries to analyze data in ways that were unimaginable a generation ago. To
paraphrase a famous logician, actuaries inhabit "a paradise of data". This, together with
our insurance savvy and inherently multivariate perspective, puts us in an excellent
position to benefit from the data mining revolution.
131
Given the success of credit scores and predictive modeling, we expect actuaries to be
enlisted to push this type of work even further. Here are examples of future questions we
anticipate being asked of actuaries:
Are we currently getting the most predictive power out of the intemal and
external information/data sources that we are currently using? Are we really
analyzing data in a rigorous multivariate fashion?
9 What other powerful variables and data sources are "out there" that we are not
aware of?. How do we go beyond credit?
Are there other ways insurance companies (and indeed other kinds of companies)
can leverage predictive modeling? For example, predictive modeling has a
proven record of success in such applications as target marketing, customer
retention/defection analysis, predicting cross-sales, customer profiling, and
customer lifetime value. These are all important projects at which actuaries can
excel. Furthermore, they are not insurance-specific. An actuary with expertise in
these areas could transfer his or her skills to other industries.
To conclude, our multivariate predictive modeling work supports the widely held belief
that credit scores help explain insurance losses, and that they go beyond other sources of
information available to insurers. However it is unclear to what extent insurers will be
permitted to used credit for future pricing and underwriting. For this reason insurers
might want to consider non-credit scoring models as an alternative to traditional credit
scores. For actuaries, the use of credit scores and predictive modeling is the beginning of
a new era in insurance pricing and underwriting.
132
References
10. Brown, R. L., "Minimum Bias with Generalized Linear Models", Proceedings of
Casualty Actuarial Society, Vol. LXXV, Casualty Actuarial Society, (1988).
11. Mildenhall, S.J., "A Systematic Relationship Between Minimum Bias and
Generalized Linear Models," Proceedings of Casualty Actuarial Society, Vol.
LXXXVI, Casualty Actuarial Society, (1999).
12. Holler, K.D., Sommer, D.B.; and Trahair, G., "Something Old, Something New
in Classification Ratemaking with a Novel Use of GLMS for Credit Insurance,"
CAS Forum, Casualty Actuarial Society, (1999).
13. Brockman, M. J., Wright, T. S., "Statistical Motor Rating: Making Effective Use
of Your Data," Journal of the Institute of Actuaries, Vol. 119, Part III.
133
15. Pearson, K., Lee, A., Bramley-Moore, L., "Genetic (Reproductive) Selection:
Inheritance of Fertility in Man", Philosophical Transactions of the Royal Society
A, 73: 534-539, (1899).
16. Bickel, P. J., Hamel, P.A., O'Connell, J.W., "Sex Bias in Graduate Admissions:
Data from Berkeley", Science, (1975).
17. Mahler, H.C., and Dean, C.G., "Credibility", Foundations of Casualty Actuarial
Science, Casualty Actuarial Society, Chapter 8, (2001).
18. Francis, L., "Neural Networks Demystified," CAS Forum, Casualty Actuarial
Society, (2001)
19. Wu, C. P., "Artificial Neural Networks - the Next Generation of Regression"
Actuarial Review, Casualty Actuarial Society, Vol. 22, No. 4, (1995).
20. Zizzamia, F., Wu, C. P., "Driver by Data: Making Sense Out of Neural
Networks", Contingencies, American Academy of Actuaries, May/June, (1998).
21. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J., "Classification and
Regression Trees", Monterrey: Waldsworth and Brooks/Cole, (1984).
134
Exhibit 1
Tilllnghast -NAIC Study of Credit Score 14]
Score Midpoint Earned Loss Ratio Score Midpoint Earned Loss Ratio Score Midpoint Earned Loss Ratio
Interval Premium Rdativity Interval Premium Relativity Interval Premium Relativity
813 or more 850,0 I 0,2% 0.657 840 or more 854.0 I 0.0./0 0.607 826 or more 845.0 10.O'A 0.723
768-812 790.0 9.9% 0.584 823-839 831.0 ] el.O% 0.813 803-826 814,5 10.0% 0.903
732-767 74$1.5 11,0% 0.692 806-822 814,0 10.0./0 0.626 782-803 792.5 10.0% 0.895
701-731 716.0 10.0"/o 0.683 789-805 79%0 I 0.0./0 1.342 759-782 770.5 10.0% 0.795
675-700 687.5 10.4% 1.184 771-788 779.5 10.0% 1,059 737-759 748.0 I 0.0./~ 1.073
651-674 662.5 9.8% 0.793 748-770 759.0 10.0"/0 1.019 710-737 723.5 I 0.0*A 0.941
626-650 638.0 9.94/4 1.332 721-747 734.0 10.0./0 1.322 680-710 695.0 10.0% 0.912
601-625 613.0 10.0% 1.280 686-720 703.0 I 0.0% 0.810 640-680 660.0 10.0*/0 I. I 15
560-600 580.0 9.4% 1.214 635-685 660.0 10.0./* 0.986 583-640 611.5 l0.0*A 1.221
559 or less 525.0 8.6% 1,752 635 or less 592.0 9.9% 1.417 583 or less 535.0 10.0% 1.421
Score Midpoint Earned Loss Ratio Score Midpoint Earned Loss Ratio Score Midpoint Earned Loss Ratio
Interval Premium Relativity Interval Premium Rela6vlty Interval Premium Relativity
832 or more 859.0 I 0,0% 0.672 845 or more 857.0 10.0*A 0.800 810 and up 837.5 19.7% 0.656
803-832 817.5 10.0% 1.027 830-845 837.5 10.0% 0.919 765-809 777.0 20.1% 0.795
tJI 767-803 785.0 10.0"/, 0.823 814-830 822.0 10.0./0 0.740 715-764 739.5 20.8*/4 0.911
739-767 753.0 10.0"/0 1.036 798-814 806.0 I 0.0*A 0.733 645-714 679.5 20.2% 1.066
720-739 729.5 10.0"/0 0.775 779-798 788.5 10.0% 0.855 Below 645 6~.0 19.2% 1.593
691-720 705.5 10.0*/e 1,000 757-779 768.0 I 0.0% 0.889
668-691 679,5 10.0% 1.041 730-757 743.5 I 0.0./* 0.993
637-668 652.5 10.0"/o 1.023 695-730 712.5 I0.0% 1.143
602-637 619.5 I 0.0./0 1.251 643-695 669.0 10.0./0 1.300
602 or less 571.0 10.0% 0.135 643 or less 600.0 10.0"/0 1.628
Score Midpoint Earned Loss Ratio Score Midpoint Earaed Loss Ratio Score Midpoint Earned Loss Ratio
Interval Premium Relativity Interval Premium Relativity Interval Premium Relativity
750 and up 795.0 21.3% 0.783 755 or more 775.0 8.9% 0.767 780 and up 815.0 16.8% 0.637
685-749 717.0 25.8% 0.900 732-754 743.0 9.3% 0.798 745-779 762.0 13.7% 0.715
63O-684 657.0 19.6% I .O83 714-731 722.5 9.6% 0.859 710-744 727.0 13.9% 0.734
560-629 594.5 19.3% 1.150 698-713 705.5 9.9% 0.969 670-709 689.5 15.0% 0.807
Below 560 520.0 13.9% 1.200 682-697 689.5 10.3% 0.922 635-669 652.0 12.1% 0.909
666-681 673.5 9,7% 0.978 590-634 612.0 11.2% 1.241
647-665 656.0 10.5% 1.070 530-589 559.5 9.8% 1.357
625-646 635.5 10.2% 1.107 Below 530 495.0 7.5% 2.533
592-624 608.0 10.7% 1.122
591 or tess 562.0 10.8"/* 1.324
Exhibit 2
Bailey Analysis of Monaghan's Two-Way Study
Credit Score vs. Driving Record
Credit Group A Credit Group B Credit Group C Credit Group D Overall Bailey
Prior Driving Record Prem LR Prem LR Prem LR Prem LR Prem LR LR Rel Factor
No incidents 28.4 93% 66.0 71% 30.70 64% 45.80 53% 170.90 68.6% 1.000 1.000
1 minor 8.0 94% 17.3 68% %50 68% 8.40 50% 41.20 69.4% 1.012 0.987
1 at-fauR accident 3.7 101% 7.7 74% 4.10 68% 5.90 65% 21.40 75.0% 1.094 1.096
1 non-fanlt accident 6.6 109% 14.8 81% 7.30 70% 9.90 70% 38.60 80.9% 1.180 1.176
2 m/noeS 2.5 86% 6.0 59% 1.90 41% 2.40 43% 12.80 58.6% 0.855 0.827
2 incidents (any) 6.5 108% 13.5 96% 6.60 82% 7.90 64% 34.50 88.3% 1.287 1.268
All other (> 2 incid.) 18.6 114% 33.7 95% 10.80 83% 11.50 66% 74.60 93.5% 1,364 1.289
ta~
Gh
* Because the log link function was used, the GLM parameter estimate must be exponentinted
Exhibit 3
Bailey Analysis of Monaghan's Two-Way Study
Credit Score vs. Driver Age
Credit Group A Credit Group B Credit Group C Credit Group D Overall Bailey
Al~e of Driver Prem LR Prem LR Prem LR Prem LR Prem LR LR Rel Factor
<25 3.8 121% 23.6 75% 1.40 51% 1.90 53% 30.70 78.2% 1.000 1.000
25-34 21 A 103% 55.8 790/0 22.60 66% 8.90 63% 108.40 79.6% 1.018 1.023
35-39 13.0 100% 21.8 81% 12.90 65% 13.00 54% 60.70 75.9% 0.970 1.007
40-44 12.4 109% 18.5 82% 10.40 76~ 15.60 52% 56.90 78.6% 1.004 1.055
45-49 9.8 93% 14.6 83% 8.20 76*/, 14.80 58% 47.40 76.1% 0.972 1.036
50-59 9.2 97% 14.4 78% 7.90 ' 68% 16.50 53% 48.00 71.4% 0.913 0.985
60+ 3.8 110% 8.3 75% 4.90 81% 20.00 67% 37.00 75.1% 0.959 1.129
* Because the log link function was used, the GLM parameter estimate must be exponentiated
Exhibit 4
Bailey Analysis of Monaghan's Two-Way Study
Credit Score vs. Classical Underwriting Profile
* Because the log link function was used, the G L M parameter estimate must be exponentiated
139
C r e d i t & S u r e t y P r i c i n g and the E f f e c t s o f F i n a n c i a l M a r k e t Convergence
By: Athula Alwis, ACAS, American Re and Chris Steinbach, FCAS, Swiss Re
Abstract:
This paper describes how the convergence of the insurance and financial markets is affecting
Credit & Surety insurance. It explains why prior experience has become an unreliable
measure of exposure and how this paradigm shift affects the pricing of Credit & Surety
products. It proposes a new exposure based method for analyzing Credit & Surety that
combines the best practices of insurance and financial market pricing theory. Discussions
about its implementation as well as sample calculations for, both primary and reinsurance
pricing are included. This paper also discusses the new breed of Commercial Surety bonds
that have been recently developed to compete with traditional financial products. Finally, the
paper addresses the need for better and more sophisticated risk management techniques for
the industry.
140
1 Introduction
There is a revolution occurring in Credit & Surety. The convergence of the insurance and
financial markets is resulting in dramatic changes to these insurance products. There has
been an explosion of new forms and some new coverages as insurers attempt to compete
with financial institutions. There is new creativity to coverage structures as insurers rethink
traditional practices and their applicability in today's environment. Increased competition by
financial institutions for business that was traditionally considered insurance is the end result.
All of these changes present new opportunities and new risks for the industry. The final
outcome must be a revolution in our practices, which affects the actuarial profession in two
ways.
First, as our products become increasingly sophisticated, our risk management practices must
keep pace. We cannot rely on na'fve diversification as much as we have in the past. This
became apparent in the past year as unprecedented credit events educated us as to the true
nature of our exposures and the weaknesses of our current risk management systems. Most
Credit & Surety insurers have since made a concerted effort to improve their credit risk
management systems to suit the new environment.
Second, convergence has resulted in competition between the insurance and financial
industries, creating arbitrage opportunities between insurance and financial markets pricing
theories. Insurance and financial markets pricing theories are very different and can produce
completely different results for the same risk. Recent experience has shown that insurers,
more often than not, are the losers when arbitrage occurs. Many insurers have witnessed
entire segments of their portfolio perform poorly, particularly with regards to their new
products. This has caused some Credit & Surety insurers to reconsider what they write and
how, and for others to reconsider whether they want to be in this business at all.
The challenges that actuaries currently face in both risk evaluation and risk management are
problems that the financial markets have already conquered. So, financial markets theory is
the natural place for actuaries to turn for solutions. Over the past few years, financial markets
theory has been finding its way into Credit & Surety insurers and reinsurers alike. This paper
describes the financial market theories that can be applied to Credit and Surety, the benefits
they bring, and how they could be implemented.
Surety is unique in the insurance industry in that it is the only three-party insurance
instrument. It is a performance obligation, meaning it is a joint undertaking between the
principal and the surety to fulfill the performance of a contractual obligation. The principal is
primarily responsible for the obligation and the surety guarantees fulfillment. If the principal
fails to fulfil the obligation, then the surety steps into the shoes of the principal to complete the
obligation. Surety obligations are divided into two general categories. Contract Surety
guarantees the completion of a construction project, such as a road or building. Contract
Surety is the largest segment of the Surety market because all government construction must
be bonded. All other Surety products are called Commercial or Miscellaneous Surety. This
covers a wide assortment of obligations, such as Bail bonds, the delivery of natural gas paid
for in advance, the environmental reclamation of a strip mine, or the proper administration of a
self-insured Worker's Compensation program. This is a smaller, but rapidly growing, segment
of the Surety market.
Credit Insurance is a demand obligation, meaning it indemnifies the insured for un-collectable
receivables if there is default. It is commonly used in retail, since many stores do not pay for
the merchandise on their shelves until they themselves sell it to consumers. Another common
141
use is in shipping, since merchandise is typically not paid for until it is delivered and inspected.
Note that the majority of the Credit Insurance market is outside the United States. However,
the US credit insurance market is growing rapidly. Inside the United States, companies
typically use banking products such as loans and letters of credit instead of credit insurance.
Financial Guarantee Insurance is a demand obligation that consists of two distinct categories.
The first involves policies that insure against defaults of financial obligations, such as
insurance guaranteeing payment of the principal and interest of a municipal bond. The
second involves insurance against certain fluctuations of the financial markets, such as
insurance ensuring a minimum performance for an investment. Note that in many states, this
second category is not permissible because it lacks a valid insurable interest. Financial
Guarantee insurance was regulated in the 1980s because of New York State's concerns that
the line's rising popularity and enormous policy limits could result in insurer defaults that would
swamp the state's insolvency fund. So, New York regulated the line, requiring that Financial
Guarantee writers be weU-capitalized mono-line insurers that are not eligible for insolvency
fund protection. Today, New York's strict regulations effectively control how Financial
Guarantees are written. However, the other states end the rating agencies also work to exert
their influence over the line, significantly complicating the regulatory landscape.
Closely related to Financial Guarantees is a collection of minor lines that often are treated
separately. These include Residual Value, Mortgage Guarantee, Credit Unemployment,
Student Loan Guarantees and many life insurance schemes. These are often considered
separate from Financial Guarantee simply because they were already regulated when the
Financial Guarantee regulations were written. However, it pays to do research when working
in these lines because Financial Guarantee regulations are still evolving and different states
have different opinions.
Credit Derivatives are financial instruments that pay when default occurs, whether or not the
default results in a loss. Credit Derivatives are financial products, and as such do not require
that a valid insurable interest exist. The most common Credit Derivative pays out on default
the notional amount of a bond in exchange for receipt of the actual bond, so the loss is the
difference between the notional of the bond and the market value of the underlying security.
Credit Derivatives can be quite complex. They do not require underlying securities, so they
are ideal hedging instruments for credit insurance risk. They also can be constructed to have
additional triggers, such as a rise in the price of gas or a fluctuation in currency rates.
Traditionally, Surety Bonds, Credit Insurance, Financial Guarantees and Credit Derivatives
have been distinct products. In the past few years, the boundaries between these products
have blurred considerably. They are now part of a continuous spectrum of products that
insure financially related obligations. They start with Surety, where the insurer is entitled to be
very active in managing the insured risk, and end with Credit Derivatives, where the insurer is
entirely passive. This blurring has enabled products from different financial sectors to
compete with each other. It also permits insurers to tailor make products of varying insurer
supervision, fiduciary duty, and regulatory control.
The biggest development in the past few years has been the explosion of the Commercial
Surety market. Commercial Surety products traditionally have been relatively simple bonds
with modest limits, such as Bail bonds and License & Permit bonds. But, recently they have
evolved into sophisticated financial products with complex triggers and limits of hundreds of
millions of dollars. One area that has generated increasing activity in recent years has been
the use of Commercial Surety to mimic other types of financial instruments. In many cases,
the Commercial Surety obligations tread very close to Financial Guarantees as defined by the
New York State Department of Insurance. For this reason, the insurance industry has begun
to call them "Synthetic Financial Guarantees."
142
The simplest financial market application of Commercial Surety involves using bonds to
secretly credit enhance financial products. For example, banks frequently are involved in the
short-term leasing industry. Banks typically securitize their lease portfolios, paying hefty rates
if the portfolio contains many poor or mediocre credits. But, in this scenario, the bank requires
the lessors to purchase Surety bonds that guarantee that all lease payments will be made.
This enhances the credit quality of the portfolio, dramatically reducing the risk of loss. The
bank now pays significantly lower rates for the securitization. This ultimately is cheaper for the
lessees because insurers charge less than the capital markets to assume this risk, allegedly
because insurers are able to wield influence over the risk. The insurance products would
appear to be standard Lease Bonds, except for the fact that the ulterior motive is to credit
enhance a financial instrument.
The applications can get significantly more complicated. But, part of the additional
complication is due to the fact that knowledge of the intricacies of insurance law is very
important. Whether the policy is enforceable in the manner for which it is intended depends
on much more than just the wording of the policy. For example, suppose Eastern Power
Company sells one year of electricity to Western Power Company for $100mm, to be paid in
advance on January 1, 2003. Simultaneously, Western Power sells the same one year of
electricity to Eastern Power for $105mm, to be paid in arrears on January 1, 2004. The two
contracts cancel each other out, resulting in no effect other than the difference in payment
terms. Western Power also purchases a Surety bond that guarantees that Eastern Power will
pay the $105mm they owe. What this deal effectively reduces to is that Western Power has
loaned Eastern Power $100mm at 5% and convinced an insurer to take the default risk. Here,
a lack of full disclosure to the insurer would be very material in the advent of a loss. Surety is
a performance obligation and the Surety could argue that the performance of the underlying
obligation was never intended, so they owe nothing.
Commercial Surety has dramatically grown in popularity. Commercial Surety is now being
used to secure letters of credit, to secure bank lines, and to enhance credit. Favorable historic
loss ratios and limited abilities to grow other parts of the book has created the incentive for
most major Surety writers to grow their Commercial Surety books. Furthermore, clients have
flocked to Commercial Surety because they offer straightforward financial protection in a
favorable regulatory environment. The ability of the market to arbitrage, the various rating
methodologies has also been a key factor.
The ability to use Commercial Surety for arbitrage purposes has revealed striking differences
in how the various markets price their products. For example, it is not unusual for insurers to
see Commercial Surety bonds that sell for a quarter of what Credit Derivatives sell for, with
nearly identical terms. The problem that causes this discrepancy is that insurers generally do
not differentiate risks as well as the financial markets do. Insurance pricing focuses primarily
on making sure that the overall rate is adequate while financial market pricing focuses more
on risk differentiation. This difference can be best demonstrated via the data each industry
uses for establishing rate relativities. Insurance rate relativities are generally based on the
company's own limited experience while financial market rate relativities are based on long
periods of rating agency (industry) data. The larger data set enables the financial industry to
calculate relativities that have a greater resolution than what the insurance industry calculates,
creating the arbitrage opportunity. The result is the fact that insurers typically overprice short
term / good credits and underprice long term / poor credits, compared to the financial markets.
Recent events, particularly those involving the use of Commercial Surety bonds to mimic
Financial Guarantees, have gone a long way toward dulling the popularity of these new
products for both insurer and insured alike. Several insurance companies are currently
addressing severe anti-selection problems in their portfolio. Several others are in court
dealing with the fact that insurers and financial institutions have different customs and
practices for their products - a fact that has significantly confused their customers. Insurers
have reacted to these problems by either pulling out of the Credit & Surety market entirely or
143
by significantly curtailing their Commercial Surety writings. But, these problems have not
killed the demand for Commercial Surety bonds, and the reduced supply probably will not last
long.
Credit risk management practices of the financial markets have always been more advanced
than that of the insurance industry, Until recently, insurance credit risk management has been
largely limited to purchasing reinsurance and managing the book to a targeted loss ratio. In
contrast, credit risk management in the financial markets is a wide collection of tools. The
similarity of Credit & Surety insurance to financial markets products permits the insurance
industry to borrow financial market's risk management techniques. Several techniques
transfer particularly well.
The first requirement is to make sure that the portfolio is not excessively exposed to any single
credit event. The most cost-effective way to do this is by implementing concentration limits by
counter party, industry sector and country. These concentration limits must take into
consideration the quality of the credit risks, Since poorer credits have higher frequencies and
severities during economic downturns, it is important to have the concentration limits be lower
for poorer quality risks. In this way it is possible to keep the expected loss for an event
relatively constant throughout the portfolio. It is also important to take correlation into
consideration when establishing concentration limits. Setting all of the concentration limits
lower than an independence analysis would suggest, or establishing a tiered system of
concentration limits can achieve this.
Since the portfolio and the economy are both always changing, it is also important to have
mechanisms for repositioning the portfolio over time. Three methods are commonly used.
First, it is important to manage bond durations by counter party, industry sector and country.
An insurer who manages their durations well can progressively reduce their exposure to a
deteriorating segment of the business by not writing new bonds. This is known as an "orderly
exit." Second, covenants can be placed in the contracts that require the insureds to post
collateral if certain thresholds are breached. These thresholds can be established to generally
coincide with the deterioration of that part of the portfolio, keeping the total exposure under the
concentration limits and withdrawing the exposure while the risks are still solvent. However,
covenants are losing their effectiveness because they are becoming too popular of a solution,
contributing to the trend of marginally solvent companies crashing dramatically into
bankruptcy. Third, the partial derivatives of the expected loss relative to changes in various
economic indices measure the sensitivity of the portfolio to macroeconomic events. In
financial market risk management theory, the partial derivatives are known as the "Greeks"
because a particular Greek letter typically represents each distinct partial derivative. Analysis
of the Greeks assists insurers in managing their risk to macroeconomic events, such as a rise
in interest rates.
Insurers are also getting better at actively managing their credit risk profile with reinsurance,
retrocessions, credit default swaps, and other financial instruments. This is a powerful
technique for managing the portfolio because it is able to change the risk profile after the fact.
But, it is more difficult to implement in practice than it appears. Reinsurance is becoming
increasingly expensive and credit markets are not that liquid. Thus, the required credit
protection is often not affordable or available. This is because the names that have exhausted
the credit capacity of the insurance company have also generally exhausted the credit
capacity of the other credit markets as well. Furthermore, the insurer always runs the risk of
having the reinsurance/swap cost more than the premium the insurer collected. For this
reason it is important that the pricing of the insurance product specifically incorporate the cost
of any risk transfer or hedging activity. Also, those using hedges must note that they are often
inefficient. The trigger for the insurance policy and the trigger for the hedge are usually
]44
slightly different. This inefficiency must be factored into the hedge and the premium charged
for the insurance product if hedges are being used. Finally, reinsurance and hedge
transactions are usually conducted with reinsurance & financial institutions. This is a problem
because most reinsurance & financial institutions are themselves peak credit risks in the
insurance portfolio. The insurance company must also manage what would happen if the
counter party goes bankrupt, causing the reinsurance or hedge to fail.
While substantial similarities between insurance companies and financial institutions enable
the insurance industry to borrow liberally from financial risk management theory, there is also
an important difference to note. For financial institutions, credit risk is highly correlated and
dominates the portfolio. Credit risk management practices focus on fencing in this risk. For
insurance companies, credit risk diversifies with the other risks that the insurer writes, such as
CAT (Catastrophe) covers. Insurance credit risk management can take advantage of this
diversification.
Credit risk management in financial institutions focuses on fencing in the potential damage
from highly correlated losses. Periodically, defaults occur in a highly correlated way and this
is known as the credit cycle. The cycle begins when bad economic news causes a large
amount of money to flee the credit markets, resulting in the cost of credit to suddenly and
dramatically increase. This causes many companies that were only barely surviving to fail
simultaneously. Their failure in turn adds additional financial stress to their creditors,
customers and suppliers, causing more failures. The failures ripple through the market, taking
out many of the financially weak and some of the financially strong. Credit cycles often center
about the specific countries and industry sectors that generated the initial bad economic news.
Financial risk management is heavily focused on quantifying the amount of loss the institution
is potentially exposed to when credit becomes scarce, causing counter-parties to go bankrupt.
It is managed in a manner very similar to the way insurers manage earthquakes.
When insurers manage the credit cycle, they have the added luxury that the credit cycle and
the underwriting cycle are natural hedges; that is, they anti-correlate. Both are driven by the
availability of capital. When capital is scarce and credit dries up, counter parties go bankrupt
and financial markets suffer catastrophic losses. However, when capital is scarce and
capacity dries up, insurance premiums rise and insurance markets are at their most profitable.
The opposite relationship holds when capital is plentiful. As a result, it is possible for insurers
to implement a risk management strategy that integrates credit-related products, other
insurance products and the investment portfolio results. This strategy aims to immunize the
portfolio by balancing the effect of the credit and underwriting cycles. Currently this idea is
more theory than practice, although several companies are implementing risk aggregation
models that would permit them to implement such a risk management strategy. These models
are effectively detailed DFA (Dynamic Financial Analysis) models of the corporation and all of
its parts.
It is often thought that the only goal of risk management is in making sure that the company
survives to see tomorrow. But, an equally important goal is to be able to determine which
products add value. Traditionally, "value" has been measured via profit & loss reports. But
this approach is really only able to identify which products are unprofitable or under-
performing relative to historic norms. It is not able to reliably identify the products that create a
drag on the stock price. In order to identify these products, a system that measures a
product's contribution to the ROE (Return on Equity) is required. Several companies are
experimenting with such a measurement system.
The reason most insurance companies today are not able to identify the products that
decrease their ROE is because most companies do not have the risk measurement systems
capable of quantifying the amount of capital each risk requires. The question becomes
particularly complicated for Credit & Surety products since these have many risk
characteristics that the other lines can often downplay, such as correlation and hedging
145
activities. The most immediate obstacle is the fact that some companies do not yet capture
the information required for such an analysis. The new risk management techniques are
extremely data intensive, and sophisticated inventory systems are required to implement
them.
The risk profile for Credit & Surety is extremely complicated. It is easy for the insurer to
accidentally take on an unacceptable amount of risk, requiring the utilization of an
unacceptable amount of capital. It is also easy to make simple mistakes, such as paying more
for reinsurance than the amount you collected, quickly dooming the insurer to certain loss.
Measuring profitability requires understanding how much capital the risk requires relative to
the profit that the risk generates and including the effects of all of the reinsurance and hedging
purchased. This comparison can only be done within the framework of an advanced risk
management system.
4. Pricing
Credit & Surety has always been viewed as a form of property insurance because it shares
the defining characteristics of property business. Most important is the fact that the severity
distribution is relative to the limit of insurance. But, it also shares numerous other
characteristics: there is a very wide variation in the limits commonly purchased (several orders
of magnitude), Credit & Surety is subject to large shock losses, as well as catastrophes
(known as the "loss cycle").
However, Credit & Surety does have distinguishing characteristics of its own. First is the fact
that the loss cycle (the catastrophe) is not random, but appears at eight to twelve year
intervals. This means that the way the loss cycle is incorporated into the pricing is different
than regular property catastrophe pricing. Second is the fact that Credit & Surety underwriting
requires more judgment than other types of property underwriting because insurers
understand the causes of fire much better than the causes of insolvency. As a result, there
can be enormous variations in experience from one insurer to another as underwriting
practices differ. Third is the fact that Credit & Surety is "underwritten to a zero loss ratio."
This does not mean that insurers have years without losses. What it implies is that the goal of
Credit & Surety is to actively monitor the risks and to proactively respond to problems in order
to prevent losses from happening. As a result, Credit & Surety underwriting focuses primarily
on reducing loss frequency (i.e. default risk). Therefore, when the loss experience of two
primary companies differs, most of the difference is in their frequencies.
Because of the similarities between Credit & Surety and property, most actuaries approach
Credit & Surety pricing in the same manner as the other property lines. Both experience
rating and exposure rating methods are commonly used. The benefit of having two different
methodologies is that when the assumptions underlying one methodology fail, then the other
methodology can generally be relied upon. But for Credit & Surety, the assumptions
underlying both methodologies are equally questionable. Thus, Credit & Surety pricing has
always been somewhat of an art form. The fact that Credit & Surety pricing requires this
judgment is a particular weakness during a soft market, because it is not unusual for market
pressures to compromise actuarial judgement.
Experience rating is theoretically appealing because it calculates the correct rate for a portfolio
based on its own experience. This means that we do not need to make many assumptions
about the applicability of the data when we price. But experience rating does have
weaknesses. Primarily, it is a demanding methodology with regards to data quantity and
quality. It requires reasonably extensive data, restricting its applicability to larger volumes of
business and longer time intervals. It also requires reasonably good quality data. Shock
146
losses need to be massaged to match long-term expectations and the loss cycle needs to be
carefully built in. But shock losses and loss cycles are rare, so actuaries must chose between
long time periods full of ancient data or short time periods that lack credible experience. The
adjustments required to get around these problems are judgmental and threaten the credibility
of the analysis. It is unfortunate that experience rating's major weakness is Credit & Surety's
major characteristic.
Exposure rating is theoretically appealing because it permits the use of industry experience.
This permits the experience of shorter time periods to be more credible. Furthermore, the way
Credit & Surety is underwritten means that industry severity data should not need to be
manipulated when applied in an exposure rating. Only the frequency estimates should require
judgement. However, both are difficult to calculate in practice. First, sharing of data is not
common in the Credit & Surety industry. There is not a lot of experience available, and those
who do have books large enough to have credible experience want to use this as a
competitive advantage over those who do not. Furthermore, when experience is shared (for
ex. Surety Association of America (SAA) or reinsurers), it usually is without the corresponding
exposure values. So, it is difficult to compile industry data. The industry also does not have a
uniform standard for recording data. There are a wide variety of definitions for "loss" and an
even wider variety of definitions for "exposure." So, if and when one is able to compile a
collection of industry experience, it does not have quite as much meaning as we would like.
This reduces the selection of exposure rating parameters to an act of judgement.
Historically, experience rating has been the approach used by insurers when reviewing the
rate adequacy of their book and by reinsurers when pricing reinsurance. This is because the
lack of credible exposure rating parameters is generally a greater problem than the judgement
required for experience rating. Primary companies have always used some form of exposure
rating for pricing individual insureds, but they seldom even look at this data when reviewing
the profitability of their entire portfolio. This is partly due to the fact that exposure rating
systems typically require so many soft factors that the results are unsuitable for the purpose of
portfolio analysis. Portfolio reviews are almost exclusively performed via experience rating.
This is in stark contrast to the financial markets that rely heavily on the exposure rating when
performing portfolio analyses.
Combining traditional exposure rating with modern financial markets pricing theory results in a
Credit & Surety pricing methodology that is considerably more flexible than traditional
insurance pricing methodologies. This development is made possible by the fact that insurers
and reinsurers are now adopting financial markets risk management methodologies, making
new data available for pricing. The mixed approach combines the best practices of both
theories.
The characteristic that most distinguishes financial markets pricing theory from insurance
pricing theory is the way exposure is measured. Credit & Surety insurance currently follows
the property tradition by using the policy limit or a PML (probable maximum loss) as the
exposure base. The financial markets use an exposure base that is significantly more
sophisticated. As with insurance, it starts with the policy limit modified to reflect the value
realistically exposed. This effectively gives us a PML. The financial markets then further
modify the quantity to reflect the credit rating of the counter party. Better credit ratings have
lower losses with respect to the amount exposed. Finally, correlation is introduced to get the
correct measure of aggregate risk.
The goal of the financial markets approach to exposure measurement is to precisely quantify
the expected loss of a risk with as little subjective judgement as possible. This would appear
to be impossible when you consider all of the qualitative risk assessments that must go into
147
the analysis. For this reason, the financial markets have established the use of public credit
ratings as a way of validating the judgement of the analysts. The credit ratings contain all of
the judgmental factors so that the other components can be entirely objective. Making the
credit ratings public knowledge permits analysts to be able to compare their assessments with
those of other analysts and ensure that their assessments are not wildly different from the rest
of the market. The consistency in approach and public application compensates for the
necessary subjectivity of financial markets pricing. Credit & Surety pricing could benefit from
this approach towards pricing. The change would also enable insurers to incorporate their
Credit & Surety exposures into their credit risk management framework, giving them a more
complete picture of the risk their company has to stress from the financial markets.
However, there are differences between the two markets that hinder the combining of their
theories. Two important differences are the fact that insurance often has triggers that differ
materially from simple financial default and insurers have significantly more control over the
risk. For example, if a construction company defaults, then the Surety will look for ways to
keep projects going forward either by loaning money to the contractor or by finding a
replacement contractor. The loss will emerge over time according to decisions the Surety
makes about how to handle it. It is even possible for default to ultimately result in no loss at
all. On the other hand, in the financial markets, default results in payment according to the
obligation. Therefore, while the risk profiles of insurers are strongly correlated with the risk
profile of the financial market, they are also markedly different.
Another important difference between insurance and financial markets theory that must be
reflected is the fact that insurance companies regularly review their base rate while the
financial markets do not. Financial markets price each risk separately, but they do not review
the portfolio in total and calculate base rate changes. Financial markets pricing theory
focuses more on differentiating risks than on making sure that the aggregate return is
adequate. In financial markets theory, the company does not attempt to set the average
return but rather lets market forces dictate what that return should be. It is assumed that the
rate is adequate because market forces will push all unprofitable business into the lower credit
ratings. The goal then is to provide the risk differentiation information required to make the
market operate efficiently. In order for this theory to be useful to insurers, the financial
market's ability to differentiate risks must be married with the insurance market's ability to
measure whether the correct aggregate premium is collected.
The differences between insurance and the financial market can be incorporated into the
pricing by using the financial markets pricing as a benchmark and adding a deviation factor for
the insurance differences. In its most general form, the expected loss as an insurance product
can be represented as follows:
Here, er represents a factor measuring the differences in the loss triggers and other
advantages of writing insurance. The value for cr should vary with the type of product being
modeled. The function used to apply r could model expected loss in total, frequency and
severity separately, or som_e variation thereof. Refer to the Appendix 1, Page 2 for a simple
example.
Establishing a value for e~ that is appropriate for the insurance product is critical to this
exercise. Two main approaches are possible. The first is to use the historical data of that
product and to back into the cr that reconcile the experience and benchmark. In other words,
estimate the expected loss as a financial product (using historical exposures and ratings at a
given point in time) and compare that number to historical surety losses developed to ultimate
for the same time period. The second is to establish cr judgmentally by comparing that
148
product to others for which ct is known. For example, if f(E[L],ct ) = ctE[L], then a general rule
of thumb is: For high risk Commercial Surety (bonds that act as financial instruments), ct is
one. For very low risk Commercial Surety, ct is close to zero. For Contract Surety and Credit
Insurance, ct is somewhere in the middle.
Credit Default Swaps are the preferred benchmark because Credit Default Swaps are
standardized and actively traded on the open market. This permits the insurance company to
see what the market's consensus opinion of that credit's risk is without having to adjust the
data for specialized terms and conditions. This is called the =price discovery process." The
riskiness implied by the market price can then be compared to the riskiness as measured by
the commercial credit ratings and the riskiness as measured by the insurer's own credit
models. Another benefit is that the market reacts to information faster than any other credit
rating process. This makes the pricing more responsive to current events and the ability to
keep up with the market prevents arbitrage opportunities. Finally, Credit Default Swaps are
becoming an increasingly popular tool for hedging credit risk exposures. Using Credit Default
Swaps as the basis for the pricing and risk management process makes the hedging
calculations easier.
The calculations can be accomplished with varying sophistication. This paper presents a
simplistic approach that can be applied to any Credit & Surety product. The calculations in
this paper will be based on the following definitions:
Please note that we used a somewhat narrower definition of default rates in this paper.
Moodys and S&P, at times, use a broader definition of default rates depending on the purpose
of the exercise.
149
Expected Loss = Notional Amount x Default Rate x c~ x (1- Recovery Rate)
This is the expected loss for the principal. Including c~ permits us to reflect the bond's unique
characteristics in the pricing. Omitting (~ gives us the expected loss for a comparable credit
default swap.
When applying these formulae for risk management purposes, it is important to take into
account the correlation within and between industry sectors. Correlation also exists between
regions/countries and between Credit & Surety insurance and the company investment
portfolio. It is important to include all sources of credit risk in this calculation, including all
corporate bonds that your investment department has purchased. There are several
methodologies to perform this calculation. Two are frequently used:
a. Downgrade the credit ratings of securities in sectors that have exceeded specific
concentration thresholds. For ex: 10% concentration ,,) one-notch down grade, 15%
concentration -.), two-notch down grade, et cetera. This gives correlation a cost (the cost of
the required hedge) enabling underwriters to manage correlation within the pricing formulae
and creating a disincentive for adding more of this risk.
b. Create a simulation model that accurately reflects the characteristics of the original
portfolio, including correlation. Note that many different options exist for the design of the
correlation engine.
Finally, it is important to note that pricing is not independent of risk management. The outputs
of the pricing exercise are the inputs for the risk management system. For this reason, it is
important to design any pricing system so that both needs are met. In general, the output
from the pricing exercise should contain the following:
9 Average portfolio default rate and rating
9 A distribution of default rates (or ratings)
9 Average notional amount
9 Expected loss
9 A distribution of losses
9 Expected excess loss
9 A distribution of excess losses
Most primary companies use industry based rating tables for small risks, such as the Surety
Association of America's Surety Loss Cost tables, and their own proprietary rating systems for
large risks. Increasingly, these proprietary systems refer to the credit rating of the risk being
insured.
One can use a modified credit default swap pricing methodology as the approach for pricing
bonds. Consider the example of an insured that wants to insure a $25mm receivable from a
power company, payable in 5 years. (Appendix 1, Page 1) It is a high-risk bond that behaves
very much like a financial guarantee. Suppose the power company has a Moody's credit
rating of Baa3. Referring to Moody's Default Rate table, the five-year default rate for Baa3
credits is 3.05%. Since the insurance policy behaves similarly to a financial guarantee, (~ is
chosen to be one. The expected loss is thus $686K. Reflecting five years of investment
income gives us a discounted expected loss of $538K.
150
4.4 Proportional Reinsurance Application
Both quota share and surplus share reinsurance are common in Credit & Surety. Quota share
reinsurance is the easiest to price since cedants are able to provide all of the pricing
information listed in section 4.2. The most difficult part of the quota share pricing exercise is in
modeling the commission terms, since they generally are a function of the treaty results.
When computing the appropriate aggregate loss distribution, it is critical to accurately reflect
the correlation within the portfolio. Surplus share reinsurance is more difficult to price because
the ceded amount varies with the bond limit. A standard way to approach this is to restate the
exposure profile to reflect the surplus share terms and to then price the treaty as if it were a
100% quota share.
It is increasingly common for reinsurers to request a complete listing of all of the credits in the
portfolios so that the reinsurer can incorporate the information in their credit risk management
system. This detail of data also enables the reinsurer to independently assess the adequacy
of the primary company risk evaluation and management process. Lead reinsurers typically
review the historical accuracy of the cedant's pricing relative to the results that insurer
experienced. The reinsurer calculates a cedant specific (z in addition to the cr s it uses for
the products to adjust for the primary's underwriting quality. The pricing then proceeds as
described above.
The credit default swap approach can also be an effective way to approach the pricing of
excess reinsurance. Consider the example of a portfolio presented in Appendix 1, Page 2.
Excess reinsurance covers losses occurring this year, so the term is always one no matter
what the length of the underlying obligations are. (Beware of the optional tail coverage!) We
look up this product in the pricing tables to get the default values for ~ and the recovery rate.
This information is then enough to compute the expected loss for the excess reinsurance
layers.
A major benefit of this methodology is that the relationship between pricing and risk
management is considerably clearer. It is now obvious to the cedant how different risk
management rules will affect their reinsurance costs. For example, two observations are
immediately apparent in the sample exhibit on Appendix 1, Page 2.
First, credit A could potentially cause a loss that greatly exceeds the amount of excess
reinsurance purchased. The insurer did not purchase enough excess reinsurance to
adequately protect itself. But, high layer reinsurance is expensive, especially if it is not well
used. Just like lines of credit issued by banks, reinsurers typically charge capacity fees for
excess layers that have low activity because they might be used. A more cost effective way to
manage the portfolio is to not let any one risk get that large in the first place.
Second, credits A, B and C all have inadequate credit quality for their size. Notice how the
expected loss for the 5x5 layer is almost entirely due to these three risks. It is unlikely that
these three risks are able to support all of a reinsurer's capital and frictional costs by
themselves. So, the 5x5 layer is probably uneconomical for the cedant. A more cost effective
way to manage the portfolio is to place lower maximum limits on poor credits and to establish
an orderly exit process to address deteriorating credits.
Incorporating additional information into the exhibit enables us to perform even more
analyses. For example, comparing the direct premiums collected for the bonds with the
ground up expected loss calculation gives us a diagnostic for reviewing c~. Other information
151
that is potentially useful includes bond type, industry group, collateral, hedges, retrocessions,
and the Greeks.
Aggregate stop loss reinsurance (Agg-stops) is the insurance version of a collateralized debt
obligation (CDO). Both involve collecting a large portfolio of risks and then the slicing the
portfolio into horizontal tranches. Since correlated risk exacerbates aggregate loss, most of
the correlated risk ends up residing in the upper tranches while most of the uncorrelated risk
ends up residing in the lower tranches. Therefore, the function of Agg-stops for Credit &
Surety portfolios is to strip the correlation from the portfolio. And since the correlated risk is
the largest consumer of credit capacity, Agg-stops release significant amounts of credit
capacity for the primary insurer, but at the cost of consuming significant amounts of credit
capacity from the reinsurer.
Since the risk in an Agg-stop treaty is almost entirely correlated risk, it is critical that the model
used for pricing Agg-stops has a sophisticated treatment of correlation. Typically a simulation
model is used such that industry group could accurately model each necessary term, such as
sub limits. Simulations also permit the reinsurer to analyze the effect of including the treaty in
their portfolio. This permits the reinsurer to more accurately assess their cost of capital loads
for the treaty.
5 Specific Issues
5.1 Frequency
The ability to get improved frequency estimates is a key reason why many insurers have
begun to adopt the credit scoring algorithms of the rating agencies. One major advantage
rating agency algorithms have over other pricing methodologies is that the rating agencies
have the most complete and longest running histories publicly available. A second major
advantage is that the rating agencies are relatively quick to reflect any changes in probabilities
of default in their credit ratings. This allows financial companies to continually revise their
assessment of the quality of their portfolio without having to continually re-rate all of the credits
themselves. Alternatively~ an even more responsive indicator of change is the credit spreads
in the market. A credit spread is the difference between the rate for treasury notes and the
rate for a similar bond issued by the company. Since credit spreads rise monotonically as
credit ratings fall, the market's spreads can be used to establish the market's consensus credit
rating. The credit derivative market is a common place for financial institutions to get
consistent information about credit spreads. It is estimated that up to 90% of the activity on
the credit derivative market is solely for "price discovery" purposes.
152
While the credit scoring approach has its benefits, it also has its limitations. For example, it is
important to remember that credit ratings are designed for assessing the pricing of debt
instruments, not insurance. Also, rating agencies have also been known to approach the
same calculations in different ways for different publications, depending on what the
information is intended for. Rating agency information must be used with caution.
If pulling default statistics from a publication, it is important to note precisely what the statistics
measure. This is not always clear. Sometimes the statistics are pure frequency statistics and
sometimes they measure expected ~oss costs. Furthermore, since frequency and severity
strongly correlate, the different default statistics are not always easily distinguished. A
detailed knowledge of how the statistics were calculated and the assumptions underlying them
is necessary before attempting to use them in pricing.
Credit ratings have a fair amount of subjectivity to them. Rating agencies judgmentally
segregate the credits into rating categories and then calculate statistics on the categories.
The subjectivity of the data means that there are trends that must be identified and
compensated for. For example, from 1984 to 1991, the annual default rate for Moodys B1
rated securities always stayed within the range of 4.36% to 8.54%. From 1992 to 2000, the
annual default rate for Moodys rated B1 securities always stayed within the range of 0.00% to
4.57%. Was this the result of a changing economic environment or a change in the definition
of a B1 security? A review of the aggregate default rate for all corporate bond issuers
demonstrates that the two periods were not significantly different. Therefore, we can conclude
that the change in experience is due, at least in part, to a change in the definition of a B1
security.
While cradit-scoring models can be used to improve credit default rate predictions, they
cannot always produce accurate frequency predictions for the insurance products we are
pricing. This is because the insurance industry's definitions of credit default can differ
markedly. To a rating agency, default means the failure to service debt. To an insurer it can
mean many things, such as the failure to pay a bill or the failure to fulfil a bonded obligation.
Insurance default rates can be greater or less than commercial debt default rates, depending
on the nature of the insured obligation. For this reason, it is best to use credit-scoring models
with great care in order to be useful in surety pricing.
5.2 Severity
Severity (recovery rates) can be analyzed using data and models similar to those used for
frequency. Recovery rates vary with both credit rating and debt seniority, thus they are
specific to the insured and the instrument being priced. Severity distributions are harder to fit
than frequency distributions because they are more complex. For frequency, we only need to
be concerned about the average probability of default. While for severity, we need the full
distribution. However, the increased complexity of fitting severity curves is partly mitigated by
the fact that Credit & Surety underwriting places an overwhelming focus on frequency.
When pricing for retentions or excess layers, it is important to put a distribution around the
average recovery rate. Then, the expected loss cost for each exposure in the portfolio is
calculated using the Limited Expected Values of the recovery rate distribution. Typically, a
Beta distribution is used if it is impossible for the loss to exceed the notional amount. If it is
possible for the loss to exceed the notional amount (i.e.: Contract Surety), then property
distributions typically are used. If the list of exposures is not known, then a LogNormal
distribution is typically used for representing the distribution of potential exposure sizes.
For primary insurers, an accurate representation of the recovery rate distribution is essential if
the insured has a significant retention or posts significant collateral. The recovery rate curves
will determine how much credit to give to the collateral and retention. Inaccurate curves
153
increase the risk of over/under pricing the business. A review of the insurer's hit ratios by
retention would indicate whether inaccuracies exist.
Establishing accurate recovery rate distributions for new products pose a particularly difficult
challenge. This is generally accomplished by borrowing distributions from other related
products. Insurers can improve the variety of their severity distributions in the following
manner: First, fit a recovery rate distribution for each category that has sufficient experience.
Use the same form of distribution for each fit so that the equations are identical and only the
parameters change. Plot the parameters onto a grid, labeling each point with the product that
generated it. Then, recovery rate curves for new products can be selected judgmentally from
the grid by placing a point onto the grid that makes sense relative to the existing portfolio of
products.
For reinsurers, the exact shape of the recovery rate distribution is often not that important. For
most reinsurance applications, only the mean and variance of the recovery rate distributions
are significant. This is because we are applying the same curve to a large number of
exposures and the Law of Large Numbers smoothes out the inaccuracies of the distribution. It
is important that the first two moments are correct, but the higher moments are often
smoothed out. However, note that the Law of Large Numbers breaks down in the high excess
layers. If pricing these layers, it is important that the tail of the recovery distribution be
adequately represented otherwise the layers will be under-priced.
Reinsurers must also pay attention to whether the distribution of exposures is changing or can
change. A trend in average exposure size will materially effect the excess severity
distribution. Furthermore, the existence of excess reinsurance often provides the incentive for
cedants to put the coverage to greater use. An excess layer that does not currently have
many exposures in it may have significantly more by the end of the term. Therefore, it is
common for reinsurers to charge a capacity fee (similar to the fee banks charge for keeping
lines of credit open) for excess layers that are lightly exposed. This pays for the potential for
the cedant to write more bonds that expose the layer. Note that such a fee is not required if
the layer is written on a cessions basis.
The loss cycle is when Credit & Surety loss activity dramatically increases. During a loss
cycle, loss ratios typically are double or treble their historic levels. Loss cycles generally are
caused by credit cycles but may also be caused by other contagious events, such as a rapid
contraction in the amount of government spending on capital projects. Loss cycles typically
focus around a particular industry and region, meaning that there are multiple overlapping
cycles that could potentially affect an insurer's results. Preparing the insurance company for
future loss cycles is one of the most difficult tasks a Credit & Surety actuary must perform.
Surety & Credit has two main loss cycles. First, large contractors and most non-construction
companies finance their operations through credit. A contraction of the credit market causes
the less stable corporations to fail. However, small contractors tend to finance their operations
by kiting funds from one job to the next. A reduction in the amount of new construction has
the same effect on this market as a contraction of credit has on the market as a whole.
Typically, the availability of credit drives the loss cycle for the large Contract Surety,
Commercial Surety and Credit insurance markets while the amount of new construction drives
the loss cycle for the small Contract Surety market.
Financial Guarantees and Credit Derivatives also have loss cycles that are largely driven by
the credit cycle, and thus are strongly correlated with large contractors and Commercial
Surety. However, a large part of the Financial Guarantee market is municipals and these
154
behave very differently. The risk for municipals is that the politicians do not want subject the
public to the pain required for them to maintain their financial obligations.
The existence of the loss cycle complicates the pricing of these products. The actuary must
keep both the long term and short-term horizons in mind when pricing. For example, if the
loss ratio averages 30% in normal years and 80% during a loss cycle, and if the loss cycle
comes once every decade and lasts for two years, then the long-term loss ratio is 40% (= 0.8
x 30% + 0.2 x 80%). Therefore, in order to make money over the long term, an insurance
company must charge between 10% and 33% more for its products than it would if it was
taking purely a myopic view towards pricing (depending on whether expenses are loaded as
fixed or variable). The market does not easily support such pricing. Thus, strict discipline by
actuaries and underwriters is required.
In reinsurance, managing the horizons also consists of paying attention to the "banks" that
insurers have developed with the reinsurer. The bank is the amount of excess funds that the
reinsurer has collected over the good years in order to pay for the bad. Without the building
up of banks, reinsurers cannot be profitable over the long term. Reinsurance rates should
reflect the size of the bank that the insurer has. Returning to the above example, if the insurer
has a fully funded bank, then the reinsurer can charge a rate contemplating a 30% loss ratio.
But, if the insurer has no bank at all, then the reinsurer should charge a rate contemplating a
40% loss ratio (or higher).
Even if the insurer/reinsurer intends to withdraw from the market when the loss cycle begins,
they generally do not get a chance to withdraw until their contracts end. That means that the
insurers & reinsurers must first witness the beginning of the loss cycle before knowing it is
time to withdraw, and by then it is generally too late to avoid the bulk of it. The loss cycle is
relatively short - it is over before much action can be taken.
Loss cycles also have another insidious side that make identifying them particularly difficult.
Loss cycles have the tendency to be devastating to insureds that already have open credit-
related claims, meaning that these claims are severely exacerbated and the resulting
extraordinary loss is recorded with the date of the original claim. Therefore, the loss cycle is
actually much shorter than the actuarial loss experience suggests. For example, in the most
recent loss cycle, losses grew in 2000, peaked in 2001 and may have begun to decline in
2002. But, the loss cycle was not apparent to the market until late 2001. Most of the losses in
2001 and all of the losses in 2000 are due to the aggravation of losses that already were in
claim. In the context of the actuarial loss history, the loss cycle was not identifiable until it was
already half over.
The loss cycle has often been compared to hurricanes and other natural catastrophes and
they both are managed in similar ways. But, they are very different to a pricing actuary.
Unlike most other catastrophes, the loss cycle is not a Poisson process. If we have a loss
cycle this year, then it will be a few years before we have another one. Loss cycles require
weak companies and excessive competition. It takes time for these economic conditions to
redevelop once a loss cycle occurs. However, the fact that there are many different types of
loss cycles makes the time between loss cycles very difficult to predict. For example, the time
between the last two Contract Surety cycles was about 13 years (1987 to 2000). But, if we
include Commercial Surety, the period drastically shortens. The last Commercial Surety loss
cycle (credit cycle) was in 1992. The fact that we have multiple different types of loss cycles
does add some Poissoo-style risk to the pricing, but does not make it a full Poisson process.
Understanding the loss cycle is a vital part of the pricing process. It ultimately determines
whether the insurer makes money or not. It is a particularly difficult component to price
because the long time periods which separate loss cycles limits the usefulness of loss
histories. Predictions are as much art as science and crystal balls invariably find their way into
the process. Some actuaries have expressed their confidence in the new economy and that
155
the durations between loss cycles are increasing. Others point to the increasing reliance
businesses have on the credit markets as a fundamental destabilizing force, which should
shorten the durations between loss cycles and increase their severities. Today, there is no
consensus. The only general conclusion that can be drawn is the fact that insurers tend to be
too optimistic. Historically, too many have found themselves with inadequate banks when the
loss cycle begins.
6 Conclusion
There are many new products at the intersection of the insurance and financial markets, and
some of the traditional insurance products now have financial flavors. The traditional
insurance methods for evaluating and managing these risks have become out-dated. The
goal of this paper is not to give a definitive proposal, but to invite actuaries, underwriters and
senior managers to look at these products from a new perspective. The biggest danger to
insurance is in not changing. This was made very evident by the enormous exposures
insurers had to Enron and by the fact that many of the resulting claims by Enron's obligees
were entirely unanticipated. In conclusion, we strongly believe that following the lead of
financial markets could help the insurance industry quantify and manage Credit & Surety risks
more effectively and more efficiently. This will ensure the long-term availability of sufficient
capital, and thus capacity, for this line of business.
Acknowledgements
The authors would like to gratefully acknowledge Scott Orr, Frank Bonner, Marianne Wilbert,
Scott MacColl, John Boulton, Kathleen Miller-Boreman, Roger Amrein for reviewing early
drafts of this work and offering their feedback. We would like to thank Nick Pastor, David
Smith and Chris Walker of the CAS Ratemaking Committee for providing constructive
feedback. Finally, we thank our spouses Dr. Letha Cherath and Ms. Akiko Ito for their
support and encouragement. The authors remain solely responsible for errors that still exist
in the paper.
156
Appendix 1
Page 1
(5) Duration 5
[A singlepaymentof $25mmis due in 5 years.]
(8) Cr 100%
[Bondis a no-recoursedemandobligaton.]
157
Appendix
Page 2
Credit & Surety Pricing and the Effects of Financial Market Convergence
(0) (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11 ) (12) (13)
Reinsurance Loss
Name of Notional S&P Moody Selected Recovery Ground Up
Index Credit Amount Term Rating Rating Rating Default Rate ct Rate ExpectedLoss 1M X 1M 3M X 2M 5M X 5M
I A 20,000,000 1.000 B+ B2 B2 7.160% 70% 10% 902,160 50,120 150,360 250,600
2 B 12,000,000 1.000 B+ BI BI 4.680% 70% 10% 353,808 32,760 98,280 163,800
3 C 9,000,000 1.000 B B3 B3 I 1.620% 70% 10% 658,854 81,340 244,020 252,154
4 D 8,000,000 1.000 BB+ Bal Bal 0.870*/, 70*/0 10% 43,848 6,090 18,270 13,398
5 F 7,500,000 1.000 BBB Boa2 Boa2 0.170% 70% 10% 8,033 1,190 3,570 2,083
6 G 7,000,000 1.000 BB+ Bal Bal 0.870% 70% 10% 38,367 6,090 18,270 7,917
7 H 7,000,000 1.000 A A2 A2 0.01 I% 70% 10% 479 76 228 99
8 I 6,800,000 1.000 BB+ Bal Bal 0.870% 70% 10% 37,271 6,090 18,270 6,82 I
9 J 6,500,000 1.000 BB+ Bal Bal 0.870% 70% 10% 35,627 6,090 18,270 5,177
OO
10 K 6,000,000 1.000 A A2 A2 O.011% 70% 10% 411 76 228 30
II L 5,000,000 1.000 BB Bal Bal 0.870% 70% 10% 27,405 6,090 15,225
12 M 5,000,000 1.000 BBB Baa2 Boa2 0.170./0 70% 10% 5,355 1,190 2,975
13 N 5,000,000 1.000 BBB- Boa3 Boa3 0.420% 70~ 10% 13,230 2,940 7,350
14 O 4,000,000 1.000 BB- Bal Bal 0.870% 70*/, 10./0 21,924 6,090 9,744
15 P 3,500,000 1.000 B+ B1 BI 4.680% 70% 10% 103,194 32,760 37,674
16 Q 2,000,000 1.000 AA Aa2 Aa2 0.001% 70% 10% 17 8
17 R 2,000,000 1.000 BB+ Bal Bal 0.870% 70% 10% 10,962 4,872
18 S 1,500,000 1.000 BB Ba2 Ba2 1.560% 70% 10% 14,742 3,822
19 T 1,500,000 1.000 B B2 B2 7.160% 70% 10% 67,662 17,542
20 U 1,500,000 1.000 BBB- Boa3 Boa3 0.420% 70% 10% 3,969 1,029
161
Abstract
This paper presents a methodology that represents a significant enhancement to current pricing
practices. The goal of this methodology is to estimate the impact that a rate change will have on
a company's policyholder retention and the resulting profitability of this transformed book of
business. The paper will present the basics of this methodology as well as where future work
will need to be done to bring this methodology into mainstream pricing. The work that the
authors have done in this area has focused on Private Passenger Auto Insurance but these
techniques could be applied to other lines of business.
Introduction
There is a wealth of actuarial literature regarding appropriate methodologies for using exposure
and claims data in order to calculate indicated rates. Techniques have been developed to address
difficult issues such as small volumes of data, years that are particularly immature and high
excess layers of coverage. All of these techniques ultimately produce a set of actuarially
indicated rates and rating factors. When it comes to deciding on the rates and rating factors that
will actually be used in the marketplace, however, a new dynamic begins to enter the picture.
A revised set of rates will impact the profitability of the companies' book of business in a number
of different ways. There is the obvious impact that the revised rates will have on the premiums
that policyholders are paying. There is also the more intangible impact of the policyholder
reaction to the rate change. A rate change exceeding a certain threshold will likely send a
customer shopping for an alternate insurer. Depending on the alternative premiums that are
available in the market, that customer may decide to insure with another company. If a rate
change produces a large number of such non-renewals within the company's book of business,
the revised rates could impair the intended benefits of the rate change. Alternatively, if the non-
renewals that occur are in classes of business that are particularly unprofitable for the company,
its profitability could actually be enhanced by the non-renewal activity.
Companies often have a number of ad hoc "rules of thumb" for determining the amount of a rate
change that the market will bear, but very few rigorous models exist that attempt to estimate the
likely customer reaction to a rate change. An approach to pricing that considers not only the
impact of the new rates on the average premium charged, but also on the renewal behavior of
policyholders can thus be a significant step forward for determining appropriate prices and likely
future profitability. The question that must be asked is "Are there a family of models that can
model the renewal behavior of policyholders?"
This paper will present methodologies that will allow the consideration of the impact of
policyholder retention in the pricing process. As such, the rates that are being considered in such
an approach may not be the same as the actuarially indicated rate. However, since no actuarial
162
method produces an indicated rate that is precisely correct in all situations, there will always be a
reasonable range of actuarially sound rates. The methodologies presented in this paper
demonstrate how the decision regarding which rate to implement can be made with more rigor
than is possible with the current approaches used in the industry.
A. What is It?
A family of techniques that has been successfully applied to model similar behavior in the past is
called Agent Based Modeling. Simply put, in using these techniques, models are built which
contain factors, agents and rules. Factors are the quantitative measures of the system that is
being modeled. In the example of modeling customer reaction to rate changes, the factors would
encompass the rates and rating factors for a company and its competitors. It would also include
the loss potential of various classes of business that would be used to determine profitability of
those business classes. The agents in the model are the units between which interactions take
place. In the modeling of rate change reactions, agents would consist of customers, competitors,
insurance agents, etc. The rules describe how the different agents in the model will interact.
One of the problems encountered in applying agent based modeling to insurance is with
nomenclature. We have agents in the model and agents who are selling policies. To complicate
matters even further, the insurance agents are one of the agents in the model. Throughout this
paper, in order to assure that the terminology is succinct, an agent in the model will be referred to
as an economic agent while an agent selling insurance will be referred to as an insurance agent.
Economic agent behavior is assigned rules, based on a combination of historical data, surveys,
focus groups, and analysis. The models are run under various scenarios and the results can be
used to help determine a strategic direction with insights that cannot be discerned with the
current "rules of thumb" type approach.
The actual experience however, was somewhat different. Peoples' average retirement ages did
move towards younger ages but the transformation took nearly three decades, which was much
longer than expected. What was missing from the original estimations that caused the actual
experience to differ so greatly?
163
The original models were predicated on the assumption that the primary factor that influenced
retirement age was the laws surrounding Social Security benefits. Reality, however, is more
complicated than that. After researching the factors that influenced retirement ages, it was
determined that a major factor in the decision to retire is the retirement decisions of other people
in an individual's social network. By constructing models that consider these social interactions,
a set of economic agents and rules were developed that accurately predicted the retirement age
decisions of a population of individuals. This seemingly complex decision making of individuals
could thus be accurately modeled, in the aggregate, with the proper alignment of economic
agents and straightforward decision rules.
If models can be constructed that accurately predict the retirement decisions of a population, it is
not difficult to imagine the construction of models that accurately predict the decisions of a group
of policyholders to remain with their current insurer or to switch from their current insurer to
another. In order to construct such a model, the first step is to describe the process that an
insured will utilize in deciding to renew his policy or switch to another insurer.
1. The insured receives his renewal notice approximately 45 days prior to policy expiration.
2. If the premium decreases or increases modestly, the insured will likely renew the policy with
his current insurer.
3. If the premium increases significantly, the insured will likely begin shopping around.
4. The insured will do some market research by calling other insurance agents or getting quotes
over the phone or internet.
5. Depending upon the savings that can be realized, the insured will either stay or move.
This is a rather simple model as it relies solely on price as the factor upon which the decision is
made. In reality, the process is more complex as other factors, such as quality of service, brand
name recognition and financial stability enter into the decision as to where to buy insurance.
However, many recent studies have shown that price is the most significant factor. Thus, once
models can be constructed that accurately model behavior based on price, more complex models
can subsequently be constructed that would consider elements other than price. Methods used to
modify the basic model in consideration of these other elements will be discussed further in
section II.F.
Throughout this paper, private passenger auto insurance will be used as an example line of
business. These techniques could apply to other lines as well.
164
C. Economic Agent Based Approach versus Current Approach
The current "rules of thumb" approach may have been good enough at one time. It may also be
true that this approach will be acceptable today in a situation where the rate change is simple. An
example would be a rate change that applies only to the base rates. However, one of the trends
for virtually all lines of business is that rate structures have become more refined over time.
Using automobile insurance as an example, the number of different possible combinations of rate
classes is so great that it is not possible to assess all of the changes that individual policyholders
will experience in a rate change where base rates, territorial factors, driver classification factors
and accident surcharges all change at the same time.
The economic agent based approach requires a model that analyzes the impact of a rate change at
the individual insured level, taking into account class, territory, etc. The rate impact on detailed
classifications can be assessed and thus the likely behavior of members of each of the
classifications can also be assessed. By combining this retention information with information
regarding the profitability of each of these individual classes, a powerful tool is built. This tool
can be used to test a number of different rate scenarios in order to determine an optimal
combination of profitability and retention.
For the application we created, the ABM modeling approach has advantages over traditional
economic approaches to estimating buyer elasticity of demand. Traditional approaches would
require empirical studies of policyholder reaction to rate changes and then the construction of
elasticity curves from this analysis. While traditional approaches are useful during both a stable
economic and competitive environment these conditions rarely exist for an extended period of
time. The ABM approach allows for the ability to separate the impacts of the economy on a
policyholder's propensity to shop and the level of price competition on the policyholder's ability
to find an alternate policy at a lower price.
Another advantage of the ABM approach is that it allows for the modeling of emergent behavior.
These are behavioral impacts, which may seem irrational at an individual level but are exhibited
when the behavior of a group is analyzed as a whole. An example of this phenomenon is the
observed behavior of groups of insured to leave when they are presented with a rate decrease.
This seems irrational at an individual level but this phenomenon is accepted as regularly
occurring
An additional key issue to note is that, while the model operates at the individual insured level,
the goal of the model is to project the aggregate behavior of an entire book of policyholders.
Thus, precise modeling of the behavior of each individual insured is not required in order to
accurately model the overall behavior of a book of policyholders.
165
II. An Actual Model in Operation
In constructing such a model, the first decision that must be made is "what are the appropriate
economic agents to include in the model?" In the case of the retention/profitability model, there
are four economic agents in the model. These economic agents are
1. The policyholders
2. The company that is considering changing its rates
3. The other companies that form the competition in the state
4. The insurance agents who are selling policies
As previously mentioned, factors are the quantitative measures of the system that is being
modeled. For a retention model, factors would comprise the companies new and old rate sets as
well as the rates of market competitors. In addition, the claims frequencies and severities by
major risk class will also need to be entered into the model. The methodology used to process
this information will be described more fully in section 1I D.
Once the economic agents in the model are determined, the rules for interaction must then be
determined. Using the structure of the model described above, the rules required for the model
can be developed.
1. Policyholder/Company Interaction
When the policyholder receives his renewal notice, a number of factors will determine his
likelihood of shopping for an alternate insurer. These include the amount of a rate increase
that he sees, his satisfaction with the handling of a claim (if this occurred during the most
recent policy period), his satisfaction with policyholder service that he may have received
throughout the policy period (e.g. for a change in vehicle), past rate changes that the
policyholder experienced and position in the underwriting cycle. The focus of this paper is the
amount of rate increase that the policyholder experiences and the impact that the change has
on the policyholders propensity to shop and switch his policy.
The likelihood to shop is related to the concept of the price elasticity of demand. Since