0% found this document useful (0 votes)
5 views9 pages

Insurance Fraud Claim Detection - Predictive Model

Global Insure aims to enhance its fraud detection process by developing a predictive model to classify insurance claims as fraudulent or legitimate using historical data. The Random Forest model outperforms Linear Regression in key metrics, suggesting the need for improved data preprocessing and consideration of more robust models. Insights from the model indicate that certain features and incident types are predictive of fraud, highlighting the importance of addressing class imbalance in the data.

Uploaded by

Ashirvad Vatsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views9 pages

Insurance Fraud Claim Detection - Predictive Model

Global Insure aims to enhance its fraud detection process by developing a predictive model to classify insurance claims as fraudulent or legitimate using historical data. The Random Forest model outperforms Linear Regression in key metrics, suggesting the need for improved data preprocessing and consideration of more robust models. Insights from the model indicate that certain features and incident types are predictive of fraud, highlighting the importance of addressing class imbalance in the data.

Uploaded by

Ashirvad Vatsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Insurance

Fraud claim
detection-
Predictive
model
NI S HC HI TA , M A NI S H ,
VA I S HNAVI
Problem Statement
Global Insure, a leading insurance company, processes thousands of claims annually. However, a
significant percentage of these claims turn out to be fraudulent, resulting in considerable financial
losses. The company’s current process for identifying fraudulent claims involves manual
inspections, which is time-consuming and inefficient. Fraudulent claims are often detected too
late in the process, after the company has already paid out significant amounts. Global Insure
wants to improve its fraud detection process using data-driven insights to classify claims as
fraudulent or legitimate early in the approval process. This would minimise financial losses and
optimise the overall claims handling process.
Objective
The objective is to build a model to classify insurance claims as either fraudulent or legitimate
based on historical claim details and customer profiles. By using features such as claim amounts,
customer profiles, claim types and approval times, the company aims to predict the claims that
are likely to be fraudulent before they are approved.
Comparison Summary

Comparison Summary
Metric Linear Regression Random Forest

Accuracy 76.19% 80.95%

Precision 54.43% 60.87%

Recall 59.72% 77.78%

Specificity 82.09% 82.09%

F1 Score 56.95% 68.29%


Conclusion:
• Random Forest clearly outperforms Linear Regression for this binary fraud classification task
across all key metrics.
• Linear Regression can serve as a baseline, but it is not reliable due to its assumptions and poor
classification ability.

• For better fraud detection:


• We need Improve data preprocessing.
• Optimize hyperparameters.
• Consider more robust models like XGBoost, Logistic Regression with regularization, or cost-sensitive
learning techniques.
How can we analyse historical claim data to
detect patterns that indicate fraudulent claims ?

Approach:

Exploratory Data Analysis (EDA): Checked for unusual distributions, high claim amounts, repeated IP
addresses, missing documentation, Correlations between variables , dropping of variables which did not
help in predictions .

Feature Engineering: Create derived variables like “Claim Amount / Annual Income” or “Number of Prior
Claims”.

Anomaly Detection: Use statistical techniques like box plots to spot outliers or rare behaviours.

Model Training: Apply classification models to learn historical patterns of fraud vs non-fraud. Used both
Linear regression and Random forest technquies
Which features are the most predictive of fraudulent
behaviour?

Based on feature importance from Random Forest :


collision_type_Rear Collision ,
collision_type_Side Collision ,
claim_per_vehicle ,
insured_hobbies_bungie-jumping etc
These features are Statistical significance in the logistic regression model (very low p-values),
Based on past data, can we predict the
likelihood of fraud for an incoming claim?

Yes — the trained model is capable of predicting fraud likelihood for incoming claims,
provided those claims have the same structured input fields and preprocessing steps (e.g.,
encoding, scaling) applied as during training.

The notebook uses a Random Forest Classifier and a Logistic Regression model to identify
fraudulent [Link] Model is trained based on historical labeled data provided on base
data set.
What insights can be drawn from the model
that can help in improving the fraud detection
process?
1. Fraud is Predictable Based on Certain Features like "Total Loss" or "Minor Damage"

2. Class Imbalance Is a Challenge like overfitting of the data and catching fraud even at the
expense of a few false alarms
3. Certain Incident Types Drive Higher Fraud Rates: Single vehicle and multi-vehicle collisions have
a notably higher fraud rate compared to parked car or theft incidents.

4. Business teams can understand and refine the logic based on domain knowledge.

You might also like