Credit Card Fraud Detection Exploration and Analysis by Pengchong Tang

Introduction: This report explores a credit card fraud detection dataset from Kaggle.com. The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. It contains only numerical input variables which are the result of a PCA transformation. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Univariate Plots Section

Summary the dataset:

##       Time              V1                  V2           
##  Min.   :     0   Min.   :-56.40751   Min.   :-72.71573  
##  1st Qu.: 54202   1st Qu.: -0.92037   1st Qu.: -0.59855  
##  Median : 84692   Median :  0.01811   Median :  0.06549  
##  Mean   : 94814   Mean   :  0.00000   Mean   :  0.00000  
##  3rd Qu.:139321   3rd Qu.:  1.31564   3rd Qu.:  0.80372  
##  Max.   :172792   Max.   :  2.45493   Max.   : 22.05773  
##        V3                 V4                 V5            
##  Min.   :-48.3256   Min.   :-5.68317   Min.   :-113.74331  
##  1st Qu.: -0.8904   1st Qu.:-0.84864   1st Qu.:  -0.69160  
##  Median :  0.1799   Median :-0.01985   Median :  -0.05434  
##  Mean   :  0.0000   Mean   : 0.00000   Mean   :   0.00000  
##  3rd Qu.:  1.0272   3rd Qu.: 0.74334   3rd Qu.:   0.61193  
##  Max.   :  9.3826   Max.   :16.87534   Max.   :  34.80167  
##        V6                 V7                 V8           
##  Min.   :-26.1605   Min.   :-43.5572   Min.   :-73.21672  
##  1st Qu.: -0.7683   1st Qu.: -0.5541   1st Qu.: -0.20863  
##  Median : -0.2742   Median :  0.0401   Median :  0.02236  
##  Mean   :  0.0000   Mean   :  0.0000   Mean   :  0.00000  
##  3rd Qu.:  0.3986   3rd Qu.:  0.5704   3rd Qu.:  0.32735  
##  Max.   : 73.3016   Max.   :120.5895   Max.   : 20.00721  
##        V9                 V10                 V11          
##  Min.   :-13.43407   Min.   :-24.58826   Min.   :-4.79747  
##  1st Qu.: -0.64310   1st Qu.: -0.53543   1st Qu.:-0.76249  
##  Median : -0.05143   Median : -0.09292   Median :-0.03276  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.59714   3rd Qu.:  0.45392   3rd Qu.: 0.73959  
##  Max.   : 15.59500   Max.   : 23.74514   Max.   :12.01891  
##       V12                V13                V14          
##  Min.   :-18.6837   Min.   :-5.79188   Min.   :-19.2143  
##  1st Qu.: -0.4056   1st Qu.:-0.64854   1st Qu.: -0.4256  
##  Median :  0.1400   Median :-0.01357   Median :  0.0506  
##  Mean   :  0.0000   Mean   : 0.00000   Mean   :  0.0000  
##  3rd Qu.:  0.6182   3rd Qu.: 0.66251   3rd Qu.:  0.4931  
##  Max.   :  7.8484   Max.   : 7.12688   Max.   : 10.5268  
##       V15                V16                 V17           
##  Min.   :-4.49894   Min.   :-14.12985   Min.   :-25.16280  
##  1st Qu.:-0.58288   1st Qu.: -0.46804   1st Qu.: -0.48375  
##  Median : 0.04807   Median :  0.06641   Median : -0.06568  
##  Mean   : 0.00000   Mean   :  0.00000   Mean   :  0.00000  
##  3rd Qu.: 0.64882   3rd Qu.:  0.52330   3rd Qu.:  0.39968  
##  Max.   : 8.87774   Max.   : 17.31511   Max.   :  9.25353  
##       V18                 V19                 V20           
##  Min.   :-9.498746   Min.   :-7.213527   Min.   :-54.49772  
##  1st Qu.:-0.498850   1st Qu.:-0.456299   1st Qu.: -0.21172  
##  Median :-0.003636   Median : 0.003735   Median : -0.06248  
##  Mean   : 0.000000   Mean   : 0.000000   Mean   :  0.00000  
##  3rd Qu.: 0.500807   3rd Qu.: 0.458949   3rd Qu.:  0.13304  
##  Max.   : 5.041069   Max.   : 5.591971   Max.   : 39.42090  
##       V21                 V22                  V23           
##  Min.   :-34.83038   Min.   :-10.933144   Min.   :-44.80774  
##  1st Qu.: -0.22839   1st Qu.: -0.542350   1st Qu.: -0.16185  
##  Median : -0.02945   Median :  0.006782   Median : -0.01119  
##  Mean   :  0.00000   Mean   :  0.000000   Mean   :  0.00000  
##  3rd Qu.:  0.18638   3rd Qu.:  0.528554   3rd Qu.:  0.14764  
##  Max.   : 27.20284   Max.   : 10.503090   Max.   : 22.52841  
##       V24                V25                 V26          
##  Min.   :-2.83663   Min.   :-10.29540   Min.   :-2.60455  
##  1st Qu.:-0.35459   1st Qu.: -0.31715   1st Qu.:-0.32698  
##  Median : 0.04098   Median :  0.01659   Median :-0.05214  
##  Mean   : 0.00000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.43953   3rd Qu.:  0.35072   3rd Qu.: 0.24095  
##  Max.   : 4.58455   Max.   :  7.51959   Max.   : 3.51735  
##       V27                  V28                Amount         Class     
##  Min.   :-22.565679   Min.   :-15.43008   Min.   :    0.00   0:284315  
##  1st Qu.: -0.070840   1st Qu.: -0.05296   1st Qu.:    5.60   1:   492  
##  Median :  0.001342   Median :  0.01124   Median :   22.00             
##  Mean   :  0.000000   Mean   :  0.00000   Mean   :   88.35             
##  3rd Qu.:  0.091045   3rd Qu.:  0.07828   3rd Qu.:   77.17             
##  Max.   : 31.612198   Max.   : 33.84781   Max.   :25691.16

The raw dataset consists of 284807 transcation records of which 492 records are fraudulent. There is no missing value in the dataset. I also found 1081 duplicate records in the dataset. These duplicates will be removed when I create a t-SNE plot so as to prevent erroneous messages.

Univariate Analysis

Explore the Class:

The dataset is highly imbalanced, the fraudulent records account for only 0.172% of all transactions.

Explore the Time:

Histogram of Time per minute

Histogram of Time per hour

The density of Time

The largest number of Time is 172792 second which roughly equals to 48 hours. It looks like there are two peaks as well as two saddles during these two days. I assume the peak time occurs at daytime and the saddle period occurs at night. I wonder if I can transform the Time into hour, a categoical variable to represent the hours in one day. Assuming the time starts from 12:00am.

The time 9:00-22:00 is a rush hour when most of the transaction committed.

Explore the Amout:

Histogram of Amount

The distribution of Amount is highly skewed. After plotting on a log scale, it appears a normal-like bimodal distribution.

Let’s plot the histograms of V1-V28.

Explore V1

## [1] "V1 kurtosis=35.486088 skewness=-3.280650"

The red line is a density function of normal distrubtion with V1’s mean and V1’s standard deviation. Obviously V1 is not normally distributed. V1 is highly left-skewed so I can’t view the details. Let’s make a transformation log10(-x+3) that converts the long tail into a better shape.

## [1] "V1_A kurtosis=2.401071 skewness=-0.128408"

This plot shows the V1 after transformation. The value of kurtosis and skewness are much reduced. It appears three peaks where the data cluster.

Explore V2

## [1] "V2 kurtosis=98.771404 skewness=-4.624841"

V2 has a high kurtosis distribution. I am intereted in the center of V2, how does it looks like while the values are around the mean? Let’s plot another histogram with some outliers removed.

There are two peaks in the center of V2.

Explore V3

## [1] "V3 kurtosis=29.619062 skewness=-2.240144"

V3 also has a high kurtosis distribution around mean zero. It looks like V3 is much closer to a normal distribution.

Explore V4

## [1] "V4 kurtosis=5.635388 skewness=0.676289"

V4 looks the closest to normal distribution as far as I see. V4 has several peaks but I don’t see anything attractive.

Explore V5

## [1] "V5 kurtosis=209.900906 skewness=-2.425889"

V5 also has a high kurtosis distribution. Let’s explore the center of V5.

V5 has a smooth distrubtion in the center. The center looks much closer to a normal distribution.

Explore V6

## [1] "V6 kurtosis=45.641724 skewness=1.826571"

V6 also has high kurtosis but with a small subpeak on the right tail. Let’s zoom in the center again.

Explore V7

## [1] "V7 kurtosis=408.600275 skewness=2.553894"

V7 has an extremely high kurtosis. Plot the center again.

V7 looks like a normal distribution but with three peaks in the center.

Explore V8

## [1] "V8 kurtosis=223.583080 skewness=-8.521899"

V8 has an extremely high kurtosis. Plot the center again.

The center of V8 is right-skewed.

Explore V9

## [1] "V9 kurtosis=6.731224 skewness=0.554677"

V9 Looks like a normal distribution.

Explore V10

## [1] "V10 kurtosis=34.987656 skewness=1.187134"

Looks like V10 has several subpeaks. Let’s see the center again.

V10 has two subpeaks on the right tail.

Explore V11

## [1] "V11 kurtosis=4.633872 skewness=0.356504"

It looks like V11 is very close to a normal distribution.

Explore V12

## [1] "V12 kurtosis=23.241493 skewness=-2.278389"

V12 is a bit close to a normal distribution.

Explore V13

## [1] "V13 kurtosis=3.195275 skewness=0.065233"

The density of V13 almost matches a normal distribution. I believe V13 is normally distributed.

Explore V14

## [1] "V14 kurtosis=26.879022 skewness=-1.995165"

V14 is also close to a normal distribution.

Explore V15

## [1] "V15 kurtosis=3.284743 skewness=-0.308421"

It looks like V15 also comes from a normal distribution.

Explore V16

## [1] "V16 kurtosis=13.418927 skewness=-1.100960"

V16 is very close to a normal distribution.

Explore V17

## [1] "V17 kurtosis=97.798034 skewness=-3.844894"

V17 is also close to a normal distribution.

Explore V18

## [1] "V18 kurtosis=5.578275 skewness=-0.259879"

V18 is also very close to a normal distribution.

Explore V19

## [1] "V19 kurtosis=4.724918 skewness=0.109191"

V19 is also very close to a normal distribution.

Explore V20

## [1] "V20 kurtosis=274.011334 skewness=-2.037145"

V20 has a high kurtosis. Let’s explore the center.

The center of V20 looks like a normal distribution.

Explore V21

## [1] "V21 kurtosis=210.283380 skewness=3.592972"

V21 also has a high kurtosis. Let’s explore the center.

The center of V21 also looks like a normal distribution.

Explore V22

## [1] "V22 kurtosis=5.832896 skewness=-0.213256"

V22 is also close to a normal distribution.

Explore V23

## [1] "V23 kurtosis=443.080912 skewness=-5.875109"

V23 also has an extremely high kurtosis. Let’s explore the center.

The center of V23 also looks like a normal distribution.

Explore V24

## [1] "V24 kurtosis=3.618839 skewness=-0.552496"

V24 has many subpeaks.

Explore V25

## [1] "V25 kurtosis=7.290316 skewness=-0.415790"

V25 is also close to a normal distribution but it has two peaks in the center.

Explore V26

## [1] "V26 kurtosis=3.918969 skewness=0.576690"

V26 is a bit close to a normal distribution.

Explore V27

## [1] "V27 kurtosis=247.984919 skewness=-1.170203"

V27 has a very high kurtosis. Let’s look at the center.

The center of V27 is pretty close to a normal distribution.

Explore V28

## [1] "V28 kurtosis=936.381095 skewness=11.192032"

V28 has an extremely high kurtosis. Let’s see the center.

The center of V28 has a right-skewed distribution.

Boxplot of V1-V28

Can’t see the box? Let’s make another boxplot of V1-V28 with most outliers removed.

The plots show most distributions are low skewness with zero mean, some of them are high kurtosis e.g. V27 V28, some distributions are close to normal distributions e.g. V13.

What is the structure of your dataset?

The dataset contains 284807 transaction records in two days. The transactions are ordered by Time. The fraudulent transactions account for only 0.172% of all transactions. The median and mean of the transaction amount are both less than 100, the maximum amount is 25691.16, the minimum amount is 0. V1-V28 are zero mean distributions with either high skewed or high kurtosis.

I observe V1-V28 containing many outliers. Removing these outliers is better to concentrate on the center of the distribution. However, since this study is intent to detect outliers, I would like to keep all raw data in my analysis.

What is/are the main feature(s) of interest in your dataset?

The main features are Class and all independent variables are potentially useful features to predict the frauds. I am interested if the frauds have different patterns on Amount as well as the features V1-V28.

What other features in the dataset do you think will help support your  investigation into your feature(s) of interest?

The Time may help support to detect the frauds. I wonder if the fraud has a different distribution compared to normal transaction, for example, more frauds occur at night.

Did you create any new variables from existing variables in the dataset?

The Time counts the second elapsed between the current transaction and the first transaction. I need to transform the Time to a meaningful variable other than just counting number. The peak time of transactions seems periodic with a 24-hour cycle. So I create a categorical Hour variable which extracts the calcuated hour of a day from the counting time, assuming the first transaction occurs on 12:00am.

Of the features you investigated, were there any unusual distributions?  Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I log-transformed the left-skewed V1 and right-skewed Amount to visualize the data easily. The transformed V1 appear a distribution with three peaks. For the features having multiple peaks e.g. V24, I don’t know how to transform the distribution to a smoother shape.

Bivariate Plots Section

Explore Time vs. Class

The blue bins are normal transactions of which the number grows at day and reduce at night. The red bins are fraudulent transactions, they seems not to have a day and night pattern. The number of frauds on daytime is a bit higher than at night, but it does not have a significant drop-down at night. I believe the frauds have a different distribution on Time compared to normal transactions. Let’s see the density plot.

The plot shows the frauds have a higher density at night than the normal transactions.

Explore Hour vs. Class

If the frauds have different distribution on Time, what about Hour? Let’s plot the graphs again on Hour.

Clearly, the frauds spread out in a day regardless of time. However, the count number of frauds at night is approximately no more than 100 while there are still ten thousands of normal transactions at night. The Hour or Time is not sufficient to tell a fraud, I need to explore more features. Let’s see Amount.

Explore Amount and transformed Amount vs. Class

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    9.25  122.21  105.89 2125.87

I can see the Amount of most frauds are very small. It seems the distributions of transaction amount of frauds or nonfrauds are similar. I think the Amount is still not sufficient to predict a fraud.

Explore V1-V28 vs. Class

Since V1-V28 are PCA vectors with no meaning as well as having similar distributions. I will plot the density distributions of V1-V28 by Class in a group. I am interested if the distributions vary between frauds and nonfrauds.

Let’s summary what I found.

V1-V4 V9-V12 V14 V16-V18 have apparently different distributions for the frauds. There are red areas under the density function of the frauds without much overlapping the blue areas. I think if a detector focuses on the transactions in those areas, it can catch many frauds.

V5-V7 V19 V21 have different distributions for the frauds but the density functions of the frauds have less areas without overlapping the density of nonfrauds. These features seem less important to identify the frauds.

The area under the density functions of V8 V13 V15 V20 V22-V28 are almost overlapped by the nonfrauds. I think these features might not be useful to detect the frauds, but I will keep exploring these features in the multivariate analysis section.

Explore Amount vs. Hour

Intuitively, the Amount would be higher at daytime since people have more activities. But let’s see the boxplot.

The plot shows most of transactions amount are less than 200, the median amount is various around 20 during a day. As I expect, the Amount is higher at daytime.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

It seems the frauds can occur uniformly anytime in a day, not relied on day and night. Since the number of normal transactions drops down at night, the probablity that a transaction is a fraud will slightly increase at night.

The smallest amount of fraud is 0 and the larget amount of fraud is 2126. I don’t see any specific amount that has a significantly higher probability indicating it is a fraud.

The features V1-V28 seem more informed because a portion of these features show different distributions between frauds and nonfrauds. I would like to explore the interactions among V1-V28 as well as Amount and Time, to see if any hidden pattern exist only in a higher dimension plot.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The median amount of transactions at day is higher than night. The daytime transactions tend to have higher both number and amount.

What was the strongest relationship you found?

The features V1-V4 V9-V12 V14 V16-V18 have apparently distinct shapes of density across two Class. I think these features are very important to detect the frauds.

Multivariate Plots Section

Amount vs. Time by Class

Let’s plot the time series of the transactions.

The red points show the occurance of a fraud. It looks like the reds points are always surrounded by white points so that we can’t conclude any patten that frauds behave differently from normal transactions. I see almost all the high Amount transactions (>3000) are committed at daytime. It’s funny that none of the frauds is higher than 3000.

Time series plot V1-V28 by Class

Let’s plot more time series on other features.

Let’s summary what I found from the plots above:

Looking at the red points, if they are not surrounded by or far away from any white point, I think a surpervised learning model is able to draw a boundary to separate the frauds. Based on the plots above, I would like to select the features V1-V5 V7-V12 V14 V16-V18 which clearly separated the most red points from the white point clusters.

The features V9 V11-V15 V17 that have a clear shift during a specific time in a day. I am curious about the hours when the shift occurs.

The features V4 V26 have a day shift. The second day of V4 has a larger variance than the first day. The second day of V26 has a lower mean than the first day. Looking at the first day of V4, there are many frauds outside the cluster of nonfrauds, but in the second day because the shift occurs, many frauds are no longer outstanding far away from the nonfrauds cluster. Hence, I would like to create a new feature Day to represent the day shift.

Create a new feature Day

Let’s see the summary of Day.

##      0      1 
## 144786 140021

There are 144786 transactions on the first day while 140021 transactions on the second day.

Explore V9 V11-V15 V17 vs. Time by Hour

Let’s explore the hour shifts inside V9 V11-V15 V17.

We see that all the shifts occur everyday from 1:00 to 7:00. Interestingly, those transactions ‘forget’ to shift V12 value back to normal at daytime, are probably being regarded as frauds.

Pairs plot of all features by Class

##  [1] "Time"     "V1"       "V2"       "V3"       "V4"       "V5"      
##  [7] "V6"       "V7"       "V8"       "V9"       "V10"      "V11"     
## [13] "V12"      "V13"      "V14"      "V15"      "V16"      "V17"     
## [19] "V18"      "V19"      "V20"      "V21"      "V22"      "V23"     
## [25] "V24"      "V25"      "V26"      "V27"      "V28"      "Amount"  
## [31] "Class"    "Hour"     "Amount_A" "V1_A"     "Day"

The image size is very large, I’ve saved a high resolution version here

The pairs plot shows that the normal transactions do not have significant correlation between features. However, the frauds have some features correlated.

Explore correlations

Let’s make a heat matrix plot to better understand the correlation.

The plots show that the features are almost not correlated for the normal transactions, while the frauds have strong correlations among these features V1-V5 V7 V9-V12 V14 V16-V19. I think the correlations help to reduce redundant features but may not be useful for classifying the frauds.

t-SNE plot

## Read the 10473 x 40 data matrix successfully!
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
##  - point 0 of 10473
##  - point 10000 of 10473
## Done in 14.17 seconds (sparsity = 0.012366)!
## Learning embedding...
## Iteration 50: error is 97.800035 (50 iterations in 6.41 seconds)
## Iteration 100: error is 87.909937 (50 iterations in 6.75 seconds)
## Iteration 150: error is 83.627513 (50 iterations in 6.76 seconds)
## Iteration 200: error is 82.766357 (50 iterations in 6.62 seconds)
## Iteration 250: error is 82.395014 (50 iterations in 6.56 seconds)
## Iteration 300: error is 2.997858 (50 iterations in 6.12 seconds)
## Iteration 350: error is 2.573443 (50 iterations in 6.22 seconds)
## Iteration 400: error is 2.332048 (50 iterations in 6.26 seconds)
## Iteration 450: error is 2.170681 (50 iterations in 6.03 seconds)
## Iteration 500: error is 2.053819 (50 iterations in 6.10 seconds)
## Iteration 550: error is 1.966327 (50 iterations in 6.03 seconds)
## Iteration 600: error is 1.897763 (50 iterations in 6.15 seconds)
## Iteration 650: error is 1.845017 (50 iterations in 6.11 seconds)
## Iteration 700: error is 1.804008 (50 iterations in 6.14 seconds)
## Iteration 750: error is 1.773503 (50 iterations in 6.15 seconds)
## Iteration 800: error is 1.750443 (50 iterations in 6.31 seconds)
## Iteration 850: error is 1.735577 (50 iterations in 6.17 seconds)
## Iteration 900: error is 1.723695 (50 iterations in 6.22 seconds)
## Iteration 950: error is 1.715242 (50 iterations in 6.18 seconds)
## Iteration 1000: error is 1.709080 (50 iterations in 6.54 seconds)
## Fitting performed in 125.83 seconds.

I choose the features V1-V5 V7 V9-V12 V14 V16-V18 Hour and Day to run a t-SNE algorithm since these features show up stronger fraud patterns. The t-SNE plot contains all fraud points and 10000 samples of nonfrauds. The plot shows two major clusters of frauds (upper and right) as well as other individual frauds whose patterns or features may look very similar to normal transactions so as hard to be identified.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The time series plots of features are more helpful to see the transaction distribution vary when the time elapses. I also observe the hour shifts and day shifts of some features. The plots confirm the useful features I found from bivariate analysis section.

From the correlation heat matrix, I see some features are highly correlated e.g. V16-V18. I would not consider dropping some features before I build up a baseline model.

Finally I would like to select the most useful features to build a model: The features V1-V5 V7 V9-V12 V14 V16-V18 have distinct and separated distributions between frauds and nonfrauds; Hour has interactions with features V9 V11 V12 V14 v17; Day has an interaction with V4.

Were there any interesting or surprising interactions between features?

The features like V12 V13 have a periodic shift at 1:00-7:00 everyday, also the distributions are various when the shift occurs. V4 and V26 have a day shift, so each day has a different distribution.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes. I create a script based on Python. The script is to build up a baseline neural network model for fraud detection.

The model scores around 0.8 AUPRC and is able to detect about 80% of frauds without interfering many customers. However, increasing the rate above 80% is very difficult because a huge number of customers would be inspected while only a few more frauds would be discovered.


Final Plots and Summary

Plot One

Description One

Plot one shows the Amount of transaction during two days, the red points are fraudulent transactions.

Plot Two

Description Two

Plot two indicates a distribution shift on V12 from 1:00 to 7:00.

Plot Three

Description Three

The t-SNE plot reduce the high feature dimension into two. The plot shows two clusters of red points which are fraudulent transactions.


Reflection

The creditcard data set contains two days of transaction within only 0.172% frauds. I start by exploring individual features and the relationships on multiple features, eventually select the best features into a model. I also build up a baseline model which is able to detect 80% of frauds without interfering many customers.

I struggled selecting the best features that can distinguish frauds as much as possible. Some features are strongly correlated but I don’t have any background information besides Time and Amount to explain the correlations. I am still looking for high dimension visualization tools to better see any hidden fraud pattern across all features.

Due to the frauds are very rare, I am using AUPRC as the metric to evaluate a model. My model can achieve average 0.8 score as well as detect 80% of frauds. Anyway I think it’s very difficult to make a breakthrough above this score. The remaining 20% of frauds, unfortunately they do a nice job on camouflage, of which the values of V1-V28 are all close to zero the mean of normal transactions. Hence, I assert the existing features are not sufficient to uncover all frauds. Collecting more features and more transaction records on different days are recommended to make a better classification model.

The future work I think will investigate the fraudulent cases that are failed to be detected by the model.

References

https://www.kaggle.com/dalpozz/creditcardfraud

https://cran.r-project.org/web/packages/tsne/tsne.pdf

https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf

https://cran.r-project.org/web/packages/moments/moments.pdf