Detecting transaction frauds
On this guide you'll learn how to detect frauds on transactions using a banking dataset through a logistic regression model
Financial datasets are important to many researchers and companies to explore the fraud detection domain.
Here, it's presented a synthetic dataset generated using the simulator called PaySim. PaySim uses aggregated data from private datasets to generate one that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.
PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.
This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle.
The database can be found here.
Below we show an example of one of the dataset rows, explaining each one of the fields:
1, PAYMENT, 1060.31, C429214117, 1089.0, 28.69, M1591654462, 0.0, 0.0, 0, 0
- step: maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).
- type: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
- amount: amount of the transaction in local currency.
- nameOrig: customer who started the transaction
- oldbalanceOrg: initial balance before the transaction
- newbalanceOrig: new balance after the transaction
- nameDest: customer who is the recipient of the transaction
- oldbalanceDest: initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).
- newbalanceDest: new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).
- isFraud: This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.
- isFlaggedFraud: The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.
Very often datasets are imbalanced. That is, the number of instances for each of the classes in the target variable that you want to predict is not proportional to the real importance of each class in your problem. Usually, the class of interest is not the majority class. Imagine a dataset containing clickstream data that you want to use to create a predictive advertising application. The number of instances of users that did not click on an ad would probably be much higher than the number of click-through instances. So when you build a statistical machine-learning model of an imbalanced dataset, the majority (i.e., most prevalent) class will outweigh the minority classes. These datasets usually lead you to build predictive models with suboptimal classification performance. This problem is known as the class-imbalance problem and occurs in a multitude of domains (fraud prevention, intrusion detection, churn prediction, etc).
A simple solution to cope with imbalanced datasets is re-sampling. That is, undersampling the majority class or oversampling the minority classes. In the Data Intelligence Module, you can easily implement re-sampling by using multi-datasets and sampling for each class differently.
Another way to not dismiss any information and actually work closer to the root of the problem is to use weights. That is, weighing instances accordingly to the importance that they have in your problem.
So as you can see, weighting helped make us aware of outcomes of predicted classes that are under-represented in the input data that otherwise would be shadowed by over-represented values.
To do it select balance weights as you can see below.
Right after the training, we can visualize the decision tree in the form of a tree graph. Hovering the mouse over the nodes opens a prediction path next to it, with the values and fields. In this example we can see that the possibility of fraud is around 93% based on the values on the side. It is worth remembering that the model has not yet been tested and is only preliminary information whose purpose is more visual than accurate.
After training the dataset we need to calculate its performance. It is somewhat challenging to understand the performance of an unbalanced database, even if balanced in the training process. Of course, we must understand the best metric that faithfully translates the reality of the proposed problem and get its resolution.
Among the options below, I choose for accuracy because it allows to understand the number of instances correctly classified on the total of measured instances.
In other words, the trained model can classify the classes, not fraud (0) and fraud (1), with 98.3% accuracy. In its matrix of confusion the numbers of correctly and erroneously identified classes are shown in numbers.
Making predictions and interpreting the result is the main machine learning goal so we can generate value for our business. Therefore, we must point out in advance that it is possible to see in the image below how much our model assigned importance or proportion to the fields used for training.
Note that the TYPE field affects the prediction by about 41%, which means that the impact of the TYPE field on the fraud identification aid is nearly half of all fields used for learning. That is, if a transaction was made through debit or payment for example, it has enormous relevance for the identification of fraud.
In this way it follows the other fields, and it is possible to note that common sense would say that the amount (AMOUNT) transferred would perhaps be the most important, according to the model, in fact it is not, it only corresponds to about 10% of all the prediction. This would then be a new and intelligent way using data and statistics in order to identify and base all business decisions.