Insurance
Fraud claim
detection-
Predictive
model
NI S HC HI TA , M A NI S H ,
VA I S HNAVI
Problem Statement
Global Insure, a leading insurance company, processes thousands of claims annually. However, a
significant percentage of these claims turn out to be fraudulent, resulting in considerable financial
losses. The company’s current process for identifying fraudulent claims involves manual
inspections, which is time-consuming and inefficient. Fraudulent claims are often detected too
late in the process, after the company has already paid out significant amounts. Global Insure
wants to improve its fraud detection process using data-driven insights to classify claims as
fraudulent or legitimate early in the approval process. This would minimise financial losses and
optimise the overall claims handling process.
Objective
The objective is to build a model to classify insurance claims as either fraudulent or legitimate
based on historical claim details and customer profiles. By using features such as claim amounts,
customer profiles, claim types and approval times, the company aims to predict the claims that
are likely to be fraudulent before they are approved.
Comparison Summary
Comparison Summary
Metric Linear Regression Random Forest
Accuracy 76.19% 80.95%
Precision 54.43% 60.87%
Recall 59.72% 77.78%
Specificity 82.09% 82.09%
F1 Score 56.95% 68.29%
Conclusion:
• Random Forest clearly outperforms Linear Regression for this binary fraud classification task
across all key metrics.
• Linear Regression can serve as a baseline, but it is not reliable due to its assumptions and poor
classification ability.
• For better fraud detection:
• We need Improve data preprocessing.
• Optimize hyperparameters.
• Consider more robust models like XGBoost, Logistic Regression with regularization, or cost-sensitive
learning techniques.
How can we analyse historical claim data to
detect patterns that indicate fraudulent claims ?
Approach:
Exploratory Data Analysis (EDA): Checked for unusual distributions, high claim amounts, repeated IP
addresses, missing documentation, Correlations between variables , dropping of variables which did not
help in predictions .
Feature Engineering: Create derived variables like “Claim Amount / Annual Income” or “Number of Prior
Claims”.
Anomaly Detection: Use statistical techniques like box plots to spot outliers or rare behaviours.
Model Training: Apply classification models to learn historical patterns of fraud vs non-fraud. Used both
Linear regression and Random forest technquies
Which features are the most predictive of fraudulent
behaviour?
Based on feature importance from Random Forest :
collision_type_Rear Collision ,
collision_type_Side Collision ,
claim_per_vehicle ,
insured_hobbies_bungie-jumping etc
These features are Statistical significance in the logistic regression model (very low p-values),
Based on past data, can we predict the
likelihood of fraud for an incoming claim?
Yes — the trained model is capable of predicting fraud likelihood for incoming claims,
provided those claims have the same structured input fields and preprocessing steps (e.g.,
encoding, scaling) applied as during training.
The notebook uses a Random Forest Classifier and a Logistic Regression model to identify
fraudulent [Link] Model is trained based on historical labeled data provided on base
data set.
What insights can be drawn from the model
that can help in improving the fraud detection
process?
1. Fraud is Predictable Based on Certain Features like "Total Loss" or "Minor Damage"
2. Class Imbalance Is a Challenge like overfitting of the data and catching fraud even at the
expense of a few false alarms
3. Certain Incident Types Drive Higher Fraud Rates: Single vehicle and multi-vehicle collisions have
a notably higher fraud rate compared to parked car or theft incidents.
4. Business teams can understand and refine the logic based on domain knowledge.