Introduction: This report explores a credit card fraud detection dataset from Kaggle.com. The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. It contains only numerical input variables which are the result of a PCA transformation. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Summary the dataset:
## Time V1 V2
## Min. : 0 Min. :-56.40751 Min. :-72.71573
## 1st Qu.: 54202 1st Qu.: -0.92037 1st Qu.: -0.59855
## Median : 84692 Median : 0.01811 Median : 0.06549
## Mean : 94814 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:139321 3rd Qu.: 1.31564 3rd Qu.: 0.80372
## Max. :172792 Max. : 2.45493 Max. : 22.05773
## V3 V4 V5
## Min. :-48.3256 Min. :-5.68317 Min. :-113.74331
## 1st Qu.: -0.8904 1st Qu.:-0.84864 1st Qu.: -0.69160
## Median : 0.1799 Median :-0.01985 Median : -0.05434
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 1.0272 3rd Qu.: 0.74334 3rd Qu.: 0.61193
## Max. : 9.3826 Max. :16.87534 Max. : 34.80167
## V6 V7 V8
## Min. :-26.1605 Min. :-43.5572 Min. :-73.21672
## 1st Qu.: -0.7683 1st Qu.: -0.5541 1st Qu.: -0.20863
## Median : -0.2742 Median : 0.0401 Median : 0.02236
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.3986 3rd Qu.: 0.5704 3rd Qu.: 0.32735
## Max. : 73.3016 Max. :120.5895 Max. : 20.00721
## V9 V10 V11
## Min. :-13.43407 Min. :-24.58826 Min. :-4.79747
## 1st Qu.: -0.64310 1st Qu.: -0.53543 1st Qu.:-0.76249
## Median : -0.05143 Median : -0.09292 Median :-0.03276
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.59714 3rd Qu.: 0.45392 3rd Qu.: 0.73959
## Max. : 15.59500 Max. : 23.74514 Max. :12.01891
## V12 V13 V14
## Min. :-18.6837 Min. :-5.79188 Min. :-19.2143
## 1st Qu.: -0.4056 1st Qu.:-0.64854 1st Qu.: -0.4256
## Median : 0.1400 Median :-0.01357 Median : 0.0506
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.6182 3rd Qu.: 0.66251 3rd Qu.: 0.4931
## Max. : 7.8484 Max. : 7.12688 Max. : 10.5268
## V15 V16 V17
## Min. :-4.49894 Min. :-14.12985 Min. :-25.16280
## 1st Qu.:-0.58288 1st Qu.: -0.46804 1st Qu.: -0.48375
## Median : 0.04807 Median : 0.06641 Median : -0.06568
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.64882 3rd Qu.: 0.52330 3rd Qu.: 0.39968
## Max. : 8.87774 Max. : 17.31511 Max. : 9.25353
## V18 V19 V20
## Min. :-9.498746 Min. :-7.213527 Min. :-54.49772
## 1st Qu.:-0.498850 1st Qu.:-0.456299 1st Qu.: -0.21172
## Median :-0.003636 Median : 0.003735 Median : -0.06248
## Mean : 0.000000 Mean : 0.000000 Mean : 0.00000
## 3rd Qu.: 0.500807 3rd Qu.: 0.458949 3rd Qu.: 0.13304
## Max. : 5.041069 Max. : 5.591971 Max. : 39.42090
## V21 V22 V23
## Min. :-34.83038 Min. :-10.933144 Min. :-44.80774
## 1st Qu.: -0.22839 1st Qu.: -0.542350 1st Qu.: -0.16185
## Median : -0.02945 Median : 0.006782 Median : -0.01119
## Mean : 0.00000 Mean : 0.000000 Mean : 0.00000
## 3rd Qu.: 0.18638 3rd Qu.: 0.528554 3rd Qu.: 0.14764
## Max. : 27.20284 Max. : 10.503090 Max. : 22.52841
## V24 V25 V26
## Min. :-2.83663 Min. :-10.29540 Min. :-2.60455
## 1st Qu.:-0.35459 1st Qu.: -0.31715 1st Qu.:-0.32698
## Median : 0.04098 Median : 0.01659 Median :-0.05214
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.43953 3rd Qu.: 0.35072 3rd Qu.: 0.24095
## Max. : 4.58455 Max. : 7.51959 Max. : 3.51735
## V27 V28 Amount Class
## Min. :-22.565679 Min. :-15.43008 Min. : 0.00 0:284315
## 1st Qu.: -0.070840 1st Qu.: -0.05296 1st Qu.: 5.60 1: 492
## Median : 0.001342 Median : 0.01124 Median : 22.00
## Mean : 0.000000 Mean : 0.00000 Mean : 88.35
## 3rd Qu.: 0.091045 3rd Qu.: 0.07828 3rd Qu.: 77.17
## Max. : 31.612198 Max. : 33.84781 Max. :25691.16
The raw dataset consists of 284807 transcation records of which 492 records are fraudulent. There is no missing value in the dataset. I also found 1081 duplicate records in the dataset. These duplicates will be removed when I create a t-SNE plot so as to prevent erroneous messages.
Explore the Class:
The dataset is highly imbalanced, the fraudulent records account for only 0.172% of all transactions.
Explore the Time:
Histogram of Time per minute
Histogram of Time per hour
The density of Time
The largest number of Time is 172792 second which roughly equals to 48 hours. It looks like there are two peaks as well as two saddles during these two days. I assume the peak time occurs at daytime and the saddle period occurs at night. I wonder if I can transform the Time into hour, a categoical variable to represent the hours in one day. Assuming the time starts from 12:00am.
The time 9:00-22:00 is a rush hour when most of the transaction committed.
Explore the Amout:
Histogram of Amount
The distribution of Amount is highly skewed. After plotting on a log scale, it appears a normal-like bimodal distribution.
Let’s plot the histograms of V1-V28.
Explore V1
## [1] "V1 kurtosis=35.486088 skewness=-3.280650"
The red line is a density function of normal distrubtion with V1’s mean and V1’s standard deviation. Obviously V1 is not normally distributed. V1 is highly left-skewed so I can’t view the details. Let’s make a transformation log10(-x+3) that converts the long tail into a better shape.
## [1] "V1_A kurtosis=2.401071 skewness=-0.128408"
This plot shows the V1 after transformation. The value of kurtosis and skewness are much reduced. It appears three peaks where the data cluster.
Explore V2
## [1] "V2 kurtosis=98.771404 skewness=-4.624841"
V2 has a high kurtosis distribution. I am intereted in the center of V2, how does it looks like while the values are around the mean? Let’s plot another histogram with some outliers removed.
There are two peaks in the center of V2.
Explore V3
## [1] "V3 kurtosis=29.619062 skewness=-2.240144"
V3 also has a high kurtosis distribution around mean zero. It looks like V3 is much closer to a normal distribution.
Explore V4
## [1] "V4 kurtosis=5.635388 skewness=0.676289"
V4 looks the closest to normal distribution as far as I see. V4 has several peaks but I don’t see anything attractive.
Explore V5
## [1] "V5 kurtosis=209.900906 skewness=-2.425889"
V5 also has a high kurtosis distribution. Let’s explore the center of V5.
V5 has a smooth distrubtion in the center. The center looks much closer to a normal distribution.
Explore V6
## [1] "V6 kurtosis=45.641724 skewness=1.826571"
V6 also has high kurtosis but with a small subpeak on the right tail. Let’s zoom in the center again.
Explore V7
## [1] "V7 kurtosis=408.600275 skewness=2.553894"
V7 has an extremely high kurtosis. Plot the center again.
V7 looks like a normal distribution but with three peaks in the center.
Explore V8
## [1] "V8 kurtosis=223.583080 skewness=-8.521899"
V8 has an extremely high kurtosis. Plot the center again.
The center of V8 is right-skewed.
Explore V9
## [1] "V9 kurtosis=6.731224 skewness=0.554677"
V9 Looks like a normal distribution.
Explore V10
## [1] "V10 kurtosis=34.987656 skewness=1.187134"
Looks like V10 has several subpeaks. Let’s see the center again.
V10 has two subpeaks on the right tail.
Explore V11
## [1] "V11 kurtosis=4.633872 skewness=0.356504"
It looks like V11 is very close to a normal distribution.
Explore V12
## [1] "V12 kurtosis=23.241493 skewness=-2.278389"
V12 is a bit close to a normal distribution.
Explore V13
## [1] "V13 kurtosis=3.195275 skewness=0.065233"
The density of V13 almost matches a normal distribution. I believe V13 is normally distributed.
Explore V14
## [1] "V14 kurtosis=26.879022 skewness=-1.995165"
V14 is also close to a normal distribution.
Explore V15
## [1] "V15 kurtosis=3.284743 skewness=-0.308421"
It looks like V15 also comes from a normal distribution.
Explore V16
## [1] "V16 kurtosis=13.418927 skewness=-1.100960"
V16 is very close to a normal distribution.
Explore V17
## [1] "V17 kurtosis=97.798034 skewness=-3.844894"
V17 is also close to a normal distribution.
Explore V18
## [1] "V18 kurtosis=5.578275 skewness=-0.259879"
V18 is also very close to a normal distribution.
Explore V19
## [1] "V19 kurtosis=4.724918 skewness=0.109191"
V19 is also very close to a normal distribution.
Explore V20
## [1] "V20 kurtosis=274.011334 skewness=-2.037145"
V20 has a high kurtosis. Let’s explore the center.
The center of V20 looks like a normal distribution.
Explore V21
## [1] "V21 kurtosis=210.283380 skewness=3.592972"
V21 also has a high kurtosis. Let’s explore the center.
The center of V21 also looks like a normal distribution.
Explore V22
## [1] "V22 kurtosis=5.832896 skewness=-0.213256"
V22 is also close to a normal distribution.
Explore V23
## [1] "V23 kurtosis=443.080912 skewness=-5.875109"
V23 also has an extremely high kurtosis. Let’s explore the center.
The center of V23 also looks like a normal distribution.
Explore V24
## [1] "V24 kurtosis=3.618839 skewness=-0.552496"
V24 has many subpeaks.
Explore V25
## [1] "V25 kurtosis=7.290316 skewness=-0.415790"
V25 is also close to a normal distribution but it has two peaks in the center.
Explore V26
## [1] "V26 kurtosis=3.918969 skewness=0.576690"
V26 is a bit close to a normal distribution.
Explore V27
## [1] "V27 kurtosis=247.984919 skewness=-1.170203"
V27 has a very high kurtosis. Let’s look at the center.
The center of V27 is pretty close to a normal distribution.
Explore V28
## [1] "V28 kurtosis=936.381095 skewness=11.192032"
V28 has an extremely high kurtosis. Let’s see the center.
The center of V28 has a right-skewed distribution.
Boxplot of V1-V28
Can’t see the box? Let’s make another boxplot of V1-V28 with most outliers removed.
The plots show most distributions are low skewness with zero mean, some of them are high kurtosis e.g. V27 V28, some distributions are close to normal distributions e.g. V13.
The dataset contains 284807 transaction records in two days. The transactions are ordered by Time. The fraudulent transactions account for only 0.172% of all transactions. The median and mean of the transaction amount are both less than 100, the maximum amount is 25691.16, the minimum amount is 0. V1-V28 are zero mean distributions with either high skewed or high kurtosis.
I observe V1-V28 containing many outliers. Removing these outliers is better to concentrate on the center of the distribution. However, since this study is intent to detect outliers, I would like to keep all raw data in my analysis.
The main features are Class and all independent variables are potentially useful features to predict the frauds. I am interested if the frauds have different patterns on Amount as well as the features V1-V28.
The Time may help support to detect the frauds. I wonder if the fraud has a different distribution compared to normal transaction, for example, more frauds occur at night.
The Time counts the second elapsed between the current transaction and the first transaction. I need to transform the Time to a meaningful variable other than just counting number. The peak time of transactions seems periodic with a 24-hour cycle. So I create a categorical Hour variable which extracts the calcuated hour of a day from the counting time, assuming the first transaction occurs on 12:00am.
I log-transformed the left-skewed V1 and right-skewed Amount to visualize the data easily. The transformed V1 appear a distribution with three peaks. For the features having multiple peaks e.g. V24, I don’t know how to transform the distribution to a smoother shape.
Explore Time vs. Class
The blue bins are normal transactions of which the number grows at day and reduce at night. The red bins are fraudulent transactions, they seems not to have a day and night pattern. The number of frauds on daytime is a bit higher than at night, but it does not have a significant drop-down at night. I believe the frauds have a different distribution on Time compared to normal transactions. Let’s see the density plot.
The plot shows the frauds have a higher density at night than the normal transactions.
Explore Hour vs. Class
If the frauds have different distribution on Time, what about Hour? Let’s plot the graphs again on Hour.
Clearly, the frauds spread out in a day regardless of time. However, the count number of frauds at night is approximately no more than 100 while there are still ten thousands of normal transactions at night. The Hour or Time is not sufficient to tell a fraud, I need to explore more features. Let’s see Amount.
Explore Amount and transformed Amount vs. Class
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 9.25 122.21 105.89 2125.87
I can see the Amount of most frauds are very small. It seems the distributions of transaction amount of frauds or nonfrauds are similar. I think the Amount is still not sufficient to predict a fraud.
Explore V1-V28 vs. Class
Since V1-V28 are PCA vectors with no meaning as well as having similar distributions. I will plot the density distributions of V1-V28 by Class in a group. I am interested if the distributions vary between frauds and nonfrauds.
Let’s summary what I found.
V1-V4 V9-V12 V14 V16-V18 have apparently different distributions for the frauds. There are red areas under the density function of the frauds without much overlapping the blue areas. I think if a detector focuses on the transactions in those areas, it can catch many frauds.
V5-V7 V19 V21 have different distributions for the frauds but the density functions of the frauds have less areas without overlapping the density of nonfrauds. These features seem less important to identify the frauds.
The area under the density functions of V8 V13 V15 V20 V22-V28 are almost overlapped by the nonfrauds. I think these features might not be useful to detect the frauds, but I will keep exploring these features in the multivariate analysis section.
Explore Amount vs. Hour
Intuitively, the Amount would be higher at daytime since people have more activities. But let’s see the boxplot.
The plot shows most of transactions amount are less than 200, the median amount is various around 20 during a day. As I expect, the Amount is higher at daytime.
It seems the frauds can occur uniformly anytime in a day, not relied on day and night. Since the number of normal transactions drops down at night, the probablity that a transaction is a fraud will slightly increase at night.
The smallest amount of fraud is 0 and the larget amount of fraud is 2126. I don’t see any specific amount that has a significantly higher probability indicating it is a fraud.
The features V1-V28 seem more informed because a portion of these features show different distributions between frauds and nonfrauds. I would like to explore the interactions among V1-V28 as well as Amount and Time, to see if any hidden pattern exist only in a higher dimension plot.
The median amount of transactions at day is higher than night. The daytime transactions tend to have higher both number and amount.
The features V1-V4 V9-V12 V14 V16-V18 have apparently distinct shapes of density across two Class. I think these features are very important to detect the frauds.
Amount vs. Time by Class
Let’s plot the time series of the transactions.
The red points show the occurance of a fraud. It looks like the reds points are always surrounded by white points so that we can’t conclude any patten that frauds behave differently from normal transactions. I see almost all the high Amount transactions (>3000) are committed at daytime. It’s funny that none of the frauds is higher than 3000.
Time series plot V1-V28 by Class
Let’s plot more time series on other features.
Let’s summary what I found from the plots above:
Looking at the red points, if they are not surrounded by or far away from any white point, I think a surpervised learning model is able to draw a boundary to separate the frauds. Based on the plots above, I would like to select the features V1-V5 V7-V12 V14 V16-V18 which clearly separated the most red points from the white point clusters.
The features V9 V11-V15 V17 that have a clear shift during a specific time in a day. I am curious about the hours when the shift occurs.
The features V4 V26 have a day shift. The second day of V4 has a larger variance than the first day. The second day of V26 has a lower mean than the first day. Looking at the first day of V4, there are many frauds outside the cluster of nonfrauds, but in the second day because the shift occurs, many frauds are no longer outstanding far away from the nonfrauds cluster. Hence, I would like to create a new feature Day to represent the day shift.
Create a new feature Day
Let’s see the summary of Day.
## 0 1
## 144786 140021
There are 144786 transactions on the first day while 140021 transactions on the second day.
Explore V9 V11-V15 V17 vs. Time by Hour
Let’s explore the hour shifts inside V9 V11-V15 V17.
We see that all the shifts occur everyday from 1:00 to 7:00. Interestingly, those transactions ‘forget’ to shift V12 value back to normal at daytime, are probably being regarded as frauds.
Pairs plot of all features by Class
## [1] "Time" "V1" "V2" "V3" "V4" "V5"
## [7] "V6" "V7" "V8" "V9" "V10" "V11"
## [13] "V12" "V13" "V14" "V15" "V16" "V17"
## [19] "V18" "V19" "V20" "V21" "V22" "V23"
## [25] "V24" "V25" "V26" "V27" "V28" "Amount"
## [31] "Class" "Hour" "Amount_A" "V1_A" "Day"
The image size is very large, I’ve saved a high resolution version here
The pairs plot shows that the normal transactions do not have significant correlation between features. However, the frauds have some features correlated.
Explore correlations
Let’s make a heat matrix plot to better understand the correlation.
The plots show that the features are almost not correlated for the normal transactions, while the frauds have strong correlations among these features V1-V5 V7 V9-V12 V14 V16-V19. I think the correlations help to reduce redundant features but may not be useful for classifying the frauds.
t-SNE plot
## Read the 10473 x 40 data matrix successfully!
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
## - point 0 of 10473
## - point 10000 of 10473
## Done in 14.17 seconds (sparsity = 0.012366)!
## Learning embedding...
## Iteration 50: error is 97.800035 (50 iterations in 6.41 seconds)
## Iteration 100: error is 87.909937 (50 iterations in 6.75 seconds)
## Iteration 150: error is 83.627513 (50 iterations in 6.76 seconds)
## Iteration 200: error is 82.766357 (50 iterations in 6.62 seconds)
## Iteration 250: error is 82.395014 (50 iterations in 6.56 seconds)
## Iteration 300: error is 2.997858 (50 iterations in 6.12 seconds)
## Iteration 350: error is 2.573443 (50 iterations in 6.22 seconds)
## Iteration 400: error is 2.332048 (50 iterations in 6.26 seconds)
## Iteration 450: error is 2.170681 (50 iterations in 6.03 seconds)
## Iteration 500: error is 2.053819 (50 iterations in 6.10 seconds)
## Iteration 550: error is 1.966327 (50 iterations in 6.03 seconds)
## Iteration 600: error is 1.897763 (50 iterations in 6.15 seconds)
## Iteration 650: error is 1.845017 (50 iterations in 6.11 seconds)
## Iteration 700: error is 1.804008 (50 iterations in 6.14 seconds)
## Iteration 750: error is 1.773503 (50 iterations in 6.15 seconds)
## Iteration 800: error is 1.750443 (50 iterations in 6.31 seconds)
## Iteration 850: error is 1.735577 (50 iterations in 6.17 seconds)
## Iteration 900: error is 1.723695 (50 iterations in 6.22 seconds)
## Iteration 950: error is 1.715242 (50 iterations in 6.18 seconds)
## Iteration 1000: error is 1.709080 (50 iterations in 6.54 seconds)
## Fitting performed in 125.83 seconds.
I choose the features V1-V5 V7 V9-V12 V14 V16-V18 Hour and Day to run a t-SNE algorithm since these features show up stronger fraud patterns. The t-SNE plot contains all fraud points and 10000 samples of nonfrauds. The plot shows two major clusters of frauds (upper and right) as well as other individual frauds whose patterns or features may look very similar to normal transactions so as hard to be identified.
The time series plots of features are more helpful to see the transaction distribution vary when the time elapses. I also observe the hour shifts and day shifts of some features. The plots confirm the useful features I found from bivariate analysis section.
From the correlation heat matrix, I see some features are highly correlated e.g. V16-V18. I would not consider dropping some features before I build up a baseline model.
Finally I would like to select the most useful features to build a model: The features V1-V5 V7 V9-V12 V14 V16-V18 have distinct and separated distributions between frauds and nonfrauds; Hour has interactions with features V9 V11 V12 V14 v17; Day has an interaction with V4.
The features like V12 V13 have a periodic shift at 1:00-7:00 everyday, also the distributions are various when the shift occurs. V4 and V26 have a day shift, so each day has a different distribution.
Yes. I create a script based on Python. The script is to build up a baseline neural network model for fraud detection.
The model scores around 0.8 AUPRC and is able to detect about 80% of frauds without interfering many customers. However, increasing the rate above 80% is very difficult because a huge number of customers would be inspected while only a few more frauds would be discovered.
Plot one shows the Amount of transaction during two days, the red points are fraudulent transactions.
Plot two indicates a distribution shift on V12 from 1:00 to 7:00.
The t-SNE plot reduce the high feature dimension into two. The plot shows two clusters of red points which are fraudulent transactions.
The creditcard data set contains two days of transaction within only 0.172% frauds. I start by exploring individual features and the relationships on multiple features, eventually select the best features into a model. I also build up a baseline model which is able to detect 80% of frauds without interfering many customers.
I struggled selecting the best features that can distinguish frauds as much as possible. Some features are strongly correlated but I don’t have any background information besides Time and Amount to explain the correlations. I am still looking for high dimension visualization tools to better see any hidden fraud pattern across all features.
Due to the frauds are very rare, I am using AUPRC as the metric to evaluate a model. My model can achieve average 0.8 score as well as detect 80% of frauds. Anyway I think it’s very difficult to make a breakthrough above this score. The remaining 20% of frauds, unfortunately they do a nice job on camouflage, of which the values of V1-V28 are all close to zero the mean of normal transactions. Hence, I assert the existing features are not sufficient to uncover all frauds. Collecting more features and more transaction records on different days are recommended to make a better classification model.
The future work I think will investigate the fraudulent cases that are failed to be detected by the model.