Auto Insurance Fraud detection using K-NN Machine learning
Fraud is one of the largest and most well-known problems that insurers face. This article focusses on claim data of a car insurance company. Fraudulent claims can be highly expensive for each insurer. Therefore, it is important to know which claims are correct and which are not. It is not doable for insurance companies to check all claims personally since this will cost simply to much time and money. In this article, we will take advantage of the largest asset which insurers have in the fight against fraud: Data. We employ various attributes about the claims, insured people and other circumstances which are included in the data by the insurer. Separating different groups of claims and the corresponding rates of fraud within those groups provide new insights. Furthermore, we use machine learning to predict which claims are likely to be fraudulent. This information can narrow down the list of claims that need a further check. It enables an insurer to detect more fraudulent claims. We start gathering the data and exploring the data using Microsoft Power BI. This is an easy business intelligence tool that enables us to make insightful visualizations on the data. The k-nn machine learning algorithms are implemented in the software package R, which is more flexible and provides us with almost endless opportunities in performing actions on the given data.
All analysis in this article is performed on a dataset which includes claims for a car insurance company in the United States. The data consists of 1000 individual claims. The most important variable of interest is reported_fraud. This variable is labelled 1 if a certain claim is reported to be fraudulent and 0 otherwise. Each claim in the data is described by 40 different attributes, which are represented as columns. We can divide the descriptive attributes into four main categories: the insured person, the policy of that person, description of the incident and characteristics of the car which is involved in the incident. The data consists of attributes of both numeric and categorical nature. Examples of attributes are the age of the insured, the insured amounts and premiums, the job of the insured, the number of vehicles involved in the incident and the brand of the car for which the claim was made. The data consists of numerical as well as categorical variables.
We load the dataset into Power BI to clean and analyze it. The first statistic we want to explore is the fraud rate of the whole dataset(the percentage of all claims which is reported as fraudulent), which appears to be 24,7%. The fact that almost a quarter of the claims is reported as fraud indicates that fraudulent claims are a very serious problem for this insurer. The next step is to make some simple visualizations of the data to gain more insight. Certain averages are set apart for claims that are reported as fraud or not in figure 1. We do not observe very large differences between fraudulent and non-fraudulent claims in this figure. The most notable is that the claim amounts of fraudulent claims are higher on average for injury claims as well as claims for vehicle damage. The difference is largest for the latter. It makes sense that the amounts of fraudulent claims are somewhat higher since people intend to make money off those claims. Furthermore, the umbrella limit is on average quite a bit higher for fraudulent claims. An umbrella insurance policy is an extra liability insurance coverage that goes beyond the limits of the insured’s auto insurance. The umbrella policy kicks in after the regular limit has been reached. It is most useful for people with assets with a high total worth. It must be noted that the insured person has an umbrella policy for only 20% of the claims in the data. For all others, the umbrella limit is set to zero. We observe no significant difference in the number of months an insured has been a client before making a claim, as is shown in the bottom-right. So this gives no direct evidence that customers are committing more fraud in the first months of their contract.
The information in figure 2 is more focused on the incidents itself rather than the characteristics of the insured. At the bottom left the fraud rate is displayed for each type of incident. It must be noted that the fraud rate for single-vehicle collision and multivehicle collision is approximately three times as high as for parked car and vehicle theft incidents. To place these numbers into perspective, we added a chart which shows the proportion of each incident type in all claims in the data. Single and multi-vehicle collision are together over 80% of the total number of claims. It can be concluded that these incidents must be considered carefully in the remainder of the analysis. Even more outstanding is the graph which shows to fraud rate for different severities of incidents. The claims which are about major damages are more than 5 times as high as all other severities. The fact that this group of claims accounts for 27 per cent of the claims makes it an important one to keep in mind.
THE K-NN ALGORITHM EXPLAINED
After some interesting visualizations to obtain a better understanding of the data at hand, we now move to the prediction of fraudulent claims. We make use of the K-nearest neighbours(K-NN) algorithm. The dependent variable of interest is the fraud variable, which is “yes” if fraud is reported and “no” if there is no fraud reported. Because this value is of categorical nature, the problem at hand is a classification problem. The goal is to develop a model that predicts whether a claim is fraudulent or not based on the attributes of that claim. In this section, we will explain the basics of the K-NN algorithm. Since our data has a variable included that indicates whether it is fraudulent or not, the data is labelled. This gives us the opportunity to use a supervised learning algorithm. This type of algorithm uses a labelled training dataset to learn to predict unlabelled data based on certain attributes. The main idea of the KNN algorithm is to predict the class(fraud or non-fraud) of a certain observation(in our case a claim) based on the K nearest neighbours, where K is a certain number that can be set. Nearness is based on a certain distance measure, which evaluates the distance between the attributes of two observation. When the K nearest neighbours are determined, the mode of the labels of those neighbours is predicted as the class for the observation of interest. This process is repeated for each observation. A very simple example of the nearest neighbours is shown in figure 3. The 3 nearest neighbours of the unknown(the star) are examined and belong all to the red class, therefore the star is predicted to belong to the red class too.
The next question is how to determine that distance. Different measures of distance are possible, but we use the most common measure: the Euclidean distance. This measure is based on the Pythagorean formula and computes the distance of a straight line between two points:
and measures the distance of a straight line between two points. For a 2-dimensional space, the formula for Euclidean distance is just:
Where q is the first and p is the second point and the subscripts are representing the x- and y-coordinate in a 2-dimensional plane. So, when we use only two attributes in our K-NN algorithm it can be calculated by this formula. When we include more attributes, this formula will just be extended to:
where n is the number of attributes.
Now the main principle behind K-NN is clear, we can start preparing our data the algorithm. Before we can apply K-NN, we find that our dataset has three problems. The scale of the variables differs, some variables are categorical instead of numeric and the data consists of more non-fraudulent than fraudulent claims. In this section will be explained why these are problems for K-NN and how we address those problems. After that, we will apply K-NN to the data. We use the software package R to perform all operations on the data in this section.
We start looking only to the numeric attributes. Examples of those are the annual premium, the number of cars involved in the incident and the amount of money that is claimed. One can imagine that the absolute number of claim amounts is way higher than the number of cars involved. More importantly, the differences in claim amounts can be thousands of euros, where the number of cars involved is always between zero and five. As result, the K-NN algorithm will weigh the claim amounts much more than the number of cars. Since it uses only measures of the distance. To address this problem we normalize all numeric variables to values between 0 and 1 by applying the following formula:
to each value x* of each attribute x. In this way, we keep the relative distances between values of an attribute and all numeric attributes are weighed equally by the algorithm.
The second problem has to do with the categorical variables. Unfortunately, we cannot directly measure the distance between different genders or different types of incidents. These attributes do simply not include numbers. The easiest and most common solution to this problem is to create dummies for each variable. A dummy is 0(false) or 1(true) which corresponds to whether a claim belongs to that certain category. If an attribute consists of 3 different categories, we need to create 2 dummy variables, which both correspond to one of the categories. If both dummies are 0, the claim belongs to the 3rd category. By this process, we can include categorical variables to the K-NN algorithm. As we could observe in figure 2, categorical variables such as the incident type could be helpful to determine whether a claim is fraudulent or not.
After normalizing the data and creating dummies, we divide the dataset in a training set and a test set. The division train/test is 80%/20%. The training set is used to make the model. That model is used to predict whether the claims in the test set are fraudulent or not, and the results of that prediction are compared to the real classification to assess the performance of the model. We run the algorithm for values of K ranging from 1 to 40 since it cannot be known beforehand which value would give the best prediction. Some performance metrics are displayed in the graphs in figure 4. The overall accuracy is maximal around K=12 and is around 80 per cent. The is the percentage of all predictions that is correct. A statistic that is especially important for the problem at hand is the true-positive rate. This is the percentage of fraudulent cases which are correctly predicted as fraudulent. We can see this value reaches a maximum of around 40%. Hence, the model can only identify 40% of the fraudulent cases as fraud. Since it is way more costly to the insurer to incorrectly identify a fraudulent claim as non-fraudulent, this is a major drawback of the model. The true negative rate is high, so the model is very good in predicting the non-fraud cases as non-fraud. Unfortunately, this is not the central objective of our analysis.
This brings us to the last of the three problems stated above. We have to deal with so-called ‘imbalanced’ data. This means that the observations in the data are unequally divided over the different classifications of the dependent variable. The car insurance data consists of approximately 75% non-fraudulent claims and 25% fraudulent claims. For the K-NN this means intuitively that the chance of a non-fraudulent near-neighbour is higher than that of a fraudulent claim, simply because the data is imbalanced. This could bias the algorithm towards predicting more non-fraud than it should. This is exactly what we observe in figure 4. To address this problem, we want to make sure that the dataset is more balanced. Random oversampling is a suitable solution for the problem at hand. This method involves randomly duplicating observations from the minority class(fraud) and adding them to the training dataset until the fraud/non-fraud ratio is 50/50. We perform no actions on the test set, to make sure that the performance can be measured correctly.
Now we run K-NN for all values of K between 0 and 60 for the oversampled training dataset. Predicting for the test set gives us the following results:
As can be observed at the bottom-left, the true positive rate is increased to over 75% for higher values of K. The price that is paid for this is the decreased True Negative rate, which is now under 75% for K=40, where it was approximately 99% for the training set without oversampling. For that same K, the True Positive rate increased from 10% to 75%, which is a very important improvement. The goal of our analysis is in the end to identify fraudulent cases. It is way more costly to miss cases of fraud than to have to investigate some cases that do not appear to be fraudulent afterwards.
Our last step in fitting a model to the car-insurance data is determining a good value for K. Our vision is that the true positive rate is the most important performance metric for this problem. We find the highest true positive rate to be 84,4% for K=6. In figure 3 can be observed that overall accuracy as well as the True negative rate is very low for this value of K. The accuracy is 63% and the true negative rate is 55%. Therefore, we decided to do not take the peak at K=6 but look for another high value of the true positive rate. The highest value for the true positive rate for K>10 is found to be 82,6% at K=17. The overall accuracy is 72,5% and the true negative rate is 68,3% which are both significant improvements compared to the values for K=6, although the true positive rate is only slightly less. For this reason, we decide to use the model of K=17.
Fraud identification is an always existing issue for all insurers. We have built a case study for the car-insurance industry by using data which has a lot of information on all claims. At first, we gathered and cleaned the data in Power BI. Then, we set apart different groups of claims and looked for notable differences in fraud rates between those by making some simple insightful visualizations. We believe it is crucial to develop an understanding of the data before applying machine learning algorithms that can make predictions based on many different characteristics of the claims. We chose to deploy the K-nearest neighbours’ algorithm to predict whether a claim would be fraudulent or not. We chose the K-NN for various reasons. One of them is that the algorithm is intuitively to understand, which makes it easy to explain to non-data scientists at a company to believe in it and develop an idea on what is going on. Another reason is that it does not need a training period, it just stores the data and uses it to predict. Some disadvantages of the algorithm are that it is sensitive to imbalanced data, it is developed for only numerical attributes and the data needs to be scaled. We managed to solve all these problems and managed to identify 75% of the fraudulent cases. Of course, there is always room for improvement and further development. An important question that should be asked is whether this data was labelled correctly in the first place. How can we be sure that all fraud cases were discovered in the past. To address this issue, we could think of using an unsupervised algorithm, which does not use labels. The steps used in this article appear to be working good for this dataset of a car-insurance company, but we believe they could be easily applied to fraud prediction in other types of insurance and other industries that have to deal with large quantities of data and fraud.
Provost, F., & Fawcett, T. (2013). Data Science for Business(1st ed.). Sebastopol, California: O’Reilly.
R (Version 3.6.3) [Software]. (2020). Retrieved from https://www.r-project.org