LCClean

In this file, we will be performing some exploratory data analysis and data cleaning.

dim(dataSet)

## [1] 42537   111

table(sapply(dataSet[1,],class))

## 
## character   logical   numeric 
##        23        55        33

We have 111 features and roughly 42,000 observations. However, some of these observations and/or features might not be useful, so we need to take a closer look.

Let’s visualize missing values in this data set.

ggplot_missing <- function(x){
    if(!require(reshape2)){warning('you need to install reshape2')}
    require(reshape2)
    #### This function produces a plot of the missing data pattern
    #### in x.  It is a modified version of a function in the 'neato' package
  x %>% 
    is.na %>%
    melt %>%
    ggplot(data = .,
           aes(x = Var2,
               y = Var1)) +
    geom_raster(aes(fill = value)) +
    scale_fill_grey(name = "",
                    labels = c("Present","Missing")) +
    theme_minimal() + 
    theme(axis.text.x  = element_text(angle=45, vjust=0.5)) + 
    labs(x = "Variables in Dataset",
         y = "Rows / observations")
}
ggplot_missing(dataSet)

## Loading required package: reshape2

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

There’s quite a bit of missing data there - I’ll remove features with entirely missing data, as I can’t do anything with those, and then look at what’s left.

dataSet = dataSet[,colSums(is.na(dataSet))<nrow(dataSet)]
table(sapply(dataSet,class))

## 
## character   logical   numeric 
##        23         1        33

Some motivation for this project: let’s take a look at “purpose”, which looks to be a factor with 15 levels, and see if we can use other features to build a classifier for it. This could be useful for, say, targeted advertising to potential customers which specifically mentions loan purposes which might be relevant to them. To that end, I’m going to take a subset of this data which excludes features that wouldn’t be available for prospects who aren’t yet customers. Some of these fields, like loan amount, could be highly predictive, but wouldn’t actually be useful in building this model from a business perspective (obviously, if a customer has already applied for a loan of a specific amount, we already know what their purpose is).

dataSetReduced = dataSet %>%
  select(purpose, home_ownership, annual_inc,delinq_2yrs,
         inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,
         pub_rec,pub_rec_bankruptcies,revol_bal,revol_util,total_acc)
sapply(dataSetReduced, function(x)length(unique(x)))

##                purpose         home_ownership             annual_inc 
##                     15                      6                   5598 
##            delinq_2yrs         inq_last_6mths mths_since_last_delinq 
##                     13                     29                     96 
## mths_since_last_record               open_acc                pub_rec 
##                    114                     45                      7 
##   pub_rec_bankruptcies              revol_bal             revol_util 
##                      4                  22710                   1120 
##              total_acc 
##                     84

sapply(dataSetReduced,function(x){sum(is.na(x))})

##                purpose         home_ownership             annual_inc 
##                      2                      2                      6 
##            delinq_2yrs         inq_last_6mths mths_since_last_delinq 
##                     31                     31                  26928 
## mths_since_last_record               open_acc                pub_rec 
##                  38886                     31                     31 
##   pub_rec_bankruptcies              revol_bal             revol_util 
##                   1367                      2                     92 
##              total_acc 
##                     31

We have to decide what to do with these missing values in the data. Fortunately, most of these features have very few missing values, so removing those entries should have a minimal impact. 2 features stand out to me here - mths_since_last_record, and mths_since_last_delinq. I’d like to look at these features more closely.

unique(dataSetReduced$mths_since_last_delinq)

##  [1]  NA  35  38  61   8  20  18  68  45  48  41  40  74  25  53  39  10  26  56
## [20]  77  28  52  24  16  60  54  23   9  11  13  65  19  80  22  59  79  44  64
## [39]  57  14  63  49  15  73  70  29  51   5  75  55   2  30  47  33  69   4  43
## [58]  21  27  46  81  78  82  31  76  62  72  42  50   3  12  67  36  34  58  17
## [77]  71  66  32   6  37   7   1  83  86 115  96 103 120 106  89 107  85  97  95
## [96]   0

unique(dataSetReduced$mths_since_last_record)

##   [1]  NA 113 105  97  33  93  52  85  90  91 114  92 117  87  45  83 118  38
##  [19] 101 100 112 110  88  79  77 107 102  98  95 103  96 116 111  89 108  29
##  [37] 106 115  53  86  57  63  94 109  99 104  76  61  28  23  75  47  82  21
##  [55]  62  44  80  67 119  42  34  66  58  22  56  72  64  50  69  49  74  35
##  [73]  12  26  78  54  37  73  11  31  59  32  81  68  55  39  51  70  30  41
##  [91]  71  40  43  27  65  46  19  17  25  13  48  36   7  60  14   6  18   0
## [109]  20 120 129   5  24  15

Since these features measure the length of time since the last event of interest (delinquincies and public records), the missing values seem to indicate that the event has never happened. I will be recoding these into qualititative features with 2 levels to indicate if the individual has ever had a delinquincy/public record.

m = is.na(dataSetReduced$mths_since_last_delinq)
dataSetReduced$mths_since_last_delinq[m==TRUE] = 0
dataSetReduced$mths_since_last_delinq[m==FALSE] = 1
n = is.na(dataSetReduced$mths_since_last_record)
dataSetReduced$mths_since_last_record[n==TRUE] = 0
dataSetReduced$mths_since_last_record[n==FALSE] = 1

Now, I’ll drop the rest of our missing value records, and do some conversion/fixing of data types:

dataSetReduced = drop_na(dataSetReduced)

#Remove the % sign and convert to numeric for revol_util
dataSetReduced$revol_util = gsub('.{1}$', '', dataSetReduced$revol_util)
dataSetReduced$revol_util = (as.numeric(dataSetReduced$revol_util))
#Convert purpose to factor and use one-hot encoding on home_ownership
dataSetReduced$home_ownership = as.factor(dataSetReduced$home_ownership)
dataSetReducedTemp = dataSetReduced[,-1]
dmy <- dummyVars(" ~ .", data = dataSetReducedTemp)
dataSetReducedTemp <- data.frame(predict(dmy, newdata = dataSetReducedTemp))
dataSetReducedTemp$purpose = as.factor(dataSetReduced$purpose)
dataSetReduced = dataSetReducedTemp
summary(dataSetReduced$purpose)

##                car        credit_card debt_consolidation        educational 
##               1582               5293              19199                365 
##   home_improvement              house     major_purchase            medical 
##               3113                410               2265                727 
##             moving              other   renewable_energy     small_business 
##                605               4193                106               1886 
##           vacation            wedding 
##                390                978

I’m noticing here that we have a somewhat imbalanced classifier - around half our records have a “debt consolidation” classification. This may present an issue going forward.

Applying a training/test split:

Y = dataSetReduced$purpose
X = select(dataSetReduced, -purpose)
set.seed(1)
trainSplit = createDataPartition(y = Y, p = 0.8, list = FALSE)

Ytrain = Y[trainSplit]
Xtrain = X[trainSplit,]
Ytest  = Y[-trainSplit]
Xtest  = X[-trainSplit,]
training = dataSetReduced[ trainSplit,]
testing = dataSetReduced[-trainSplit,]

trControl = trainControl(method = "cv", number = 5)

First, I will try out KNN to classify Purpose. This requires some preprocessing.

tuneGrid = expand.grid(k = c(1,2,10,50,100,150,200))
knnOut = train(x = Xtrain, y = Ytrain, method = "knn", tuneGrid = tuneGrid, trControl = trControl,preProcess = c("center","scale"))

Looking at the results from knn:

YhatKnn = predict(knnOut, Xtest)
table(YhatKnn, Ytest)

##                     Ytest
## YhatKnn               car credit_card debt_consolidation educational
##   car                   0           0                  0           0
##   credit_card           0           2                  0           0
##   debt_consolidation  288        1046               3782          69
##   educational           0           0                  0           0
##   home_improvement     19           8                 35           1
##   house                 0           0                  0           0
##   major_purchase        4           0                  6           0
##   medical               0           0                  0           0
##   moving                0           0                  0           0
##   other                 5           2                 16           3
##   renewable_energy      0           0                  0           0
##   small_business        0           0                  0           0
##   vacation              0           0                  0           0
##   wedding               0           0                  0           0
##                     Ytest
## YhatKnn              home_improvement house major_purchase medical moving other
##   car                               0     0              0       0      0     0
##   credit_card                       0     1              0       0      0     2
##   debt_consolidation              546    73            403     136    113   796
##   educational                       0     0              0       0      0     0
##   home_improvement                 66     4             32       5      2    21
##   house                             0     0              0       0      0     0
##   major_purchase                    4     1              6       2      2     9
##   medical                           0     0              0       0      0     0
##   moving                            0     0              0       0      0     0
##   other                             6     3             12       2      4    10
##   renewable_energy                  0     0              0       0      0     0
##   small_business                    0     0              0       0      0     0
##   vacation                          0     0              0       0      0     0
##   wedding                           0     0              0       0      0     0
##                     Ytest
## YhatKnn              renewable_energy small_business vacation wedding
##   car                               0              0        0       0
##   credit_card                       0              2        0       0
##   debt_consolidation               19            345       75     179
##   educational                       0              0        0       0
##   home_improvement                  2             20        1      10
##   house                             0              0        0       0
##   major_purchase                    0              2        0       2
##   medical                           0              0        0       0
##   moving                            0              0        0       0
##   other                             0              8        2       4
##   renewable_energy                  0              0        0       0
##   small_business                    0              0        0       0
##   vacation                          0              0        0       0
##   wedding                           0              0        0       0

Indeed, the imbalanced classifier is an issue - KNN is more or less predicting debt_conolidation for everything. In addition, it turns out we have too few data points relative to the number of features for KNN to be effective. I will try a decision tree method going forward, which may handle this issue better.

First, I’m going to rework the supervisor to narrow my focus into classifying between “debt consolidation” and “single purpose” (e.g. home repair, education, etc.) loans. After some consideration of what we might be trying to acheive with this project, I decided that this split would be much easier to use while still providing considerable business value. The problem of marketing debt consolidation products to debt-laden consumers is, in some ways, the polar opposite of trying to market products designed to increase the purchasing power of consumers without as much existing debt to worry about, so distinguishing between these two categories of consumers is valuable.

levels(Ytrain)[levels(Ytrain)!="debt_consolidation"] = "single_purpose"
levels(Ytest)[levels(Ytest)!="debt_consolidation"] = "single_purpose"
Ytrain = relevel(Ytrain,"debt_consolidation")
Ytest = relevel(Ytest,"debt_consolidation")

Here, I’m trying a classification tree with pruning

tuneGrid = expand.grid(cp = c(0.0001,0.001, 0.01, 0.1))
rpartOut = train(x = Xtrain, y = Ytrain,
                  method = "rpart",
                  tuneGrid = tuneGrid,
                  trControl = trControl)
plot(rpartOut$finalModel,margin= rep(.1,4))
text(rpartOut$finalModel, cex = 0.4, digits = 1)

Next, I’m going to try a Random Forest method (bagging), followed by a boosting method.

set.seed(1)
tuneGridRf     = data.frame(mtry = round(sqrt(ncol(Xtrain))))
rfOut      = train(x = Xtrain, y = Ytrain,
                   method = "rf",
                   tuneGrid = tuneGridRf,
                   trControl = trControl)

set.seed(1)
tuneGrid = data.frame('nrounds'=c(100,150,200),
                      'max_depth' = 6,
                      'eta' = .01,
                      'gamma' = 0,
                      'colsample_bytree' = 1,
                      'min_child_weight' = 0,
                      'subsample' = .5)
boostOut   = train(x = Xtrain, y = Ytrain,
                   method = "xgbTree", verbose = 0,
                   tuneGrid = tuneGrid,
                   trControl = trControl)

Now let’s take a look at these results with some ROC curves. I’m also going to look at various decision thresholds for each model.

YhatRfProbs = predict(rfOut, Xtest,type="prob")
YhatRPartProbs = predict(rpartOut, Xtest,type="prob")
YhatBoostProbs = predict(boostOut, Xtest,type="prob")

rocCurve = roc(Ytest, YhatRfProbs$single_purpose)

## Setting levels: control = debt_consolidation, case = single_purpose

## Setting direction: controls < cases

rocCurve2 = roc(Ytest, YhatRPartProbs$single_purpose)

## Setting levels: control = debt_consolidation, case = single_purpose
## Setting direction: controls < cases

rocCurve3 = roc(Ytest, YhatBoostProbs$single_purpose)

## Setting levels: control = debt_consolidation, case = single_purpose
## Setting direction: controls < cases

plot(rocCurve, legacy.axes=TRUE,print.thres=c(0.3,0.44,0.5),col='black',main = "thresholds = Random Forest")
lines(rocCurve2, col = 'red')
lines(rocCurve3, col = 'green')
legend(x=0,y = 0.5,legend=c("Class. Tree","Random Forest","Xgboost"),col=c("red","black","green"),lty=1,cex=0.8)

plot(rocCurve2, legacy.axes=TRUE,print.thres=c(0.432,0.6),col='red', main = "thresholds = Class. Tree")
lines(rocCurve, col = 'black')
lines(rocCurve3, col = 'green')
legend(x=0,y = 0.5,legend=c("Class. Tree","Random Forest","Xgboost"),col=c("red","black","green"),lty=1,cex=0.8)

plot(rocCurve3, legacy.axes=TRUE,print.thres=c(0.44,0.47,0.5),col='green',main = "thresholds = Xgboost")
lines(rocCurve2, col = 'red')
lines(rocCurve, col = 'black')
legend(x=0,y = 0.5,legend=c("Class. Tree","Random Forest","Xgboost"),col=c("red","black","green"),lty=1,cex=0.8)

These classifiers all perform fairly similarly. For this hypothetical business problem, I am going to choose the default of 0.5 as a threshold to maximize accuracy.

YhatRf = as.factor(ifelse(YhatRfProbs[,2] >= 0.5,"single_purpose","debt_consolidation")) #Get Yhats using threshold of 0.44 instead
YhatRpart = as.factor(ifelse(YhatRPartProbs[,1] >= 0.5,"debt_consolidation","single_purpose"))
YhatBoost = as.factor(ifelse(YhatBoostProbs[,2] >= 0.5,"single_purpose","debt_consolidation"))

confusionMatrix(reference = Ytest, data = YhatRf)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           debt_consolidation single_purpose
##   debt_consolidation               2421           1824
##   single_purpose                   1418           2555
##                                             
##                Accuracy : 0.6055            
##                  95% CI : (0.5948, 0.6161)  
##     No Information Rate : 0.5329            
##     P-Value [Acc > NIR] : < 2.2e-16         
##                                             
##                   Kappa : 0.2127            
##                                             
##  Mcnemar's Test P-Value : 1.136e-12         
##                                             
##             Sensitivity : 0.6306            
##             Specificity : 0.5835            
##          Pos Pred Value : 0.5703            
##          Neg Pred Value : 0.6431            
##              Prevalence : 0.4671            
##          Detection Rate : 0.2946            
##    Detection Prevalence : 0.5165            
##       Balanced Accuracy : 0.6070            
##                                             
##        'Positive' Class : debt_consolidation
##

confusionMatrix(reference = Ytest, data = YhatRpart)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           debt_consolidation single_purpose
##   debt_consolidation               2324           1705
##   single_purpose                   1515           2674
##                                             
##                Accuracy : 0.6082            
##                  95% CI : (0.5975, 0.6188)  
##     No Information Rate : 0.5329            
##     P-Value [Acc > NIR] : < 2.2e-16         
##                                             
##                   Kappa : 0.2154            
##                                             
##  Mcnemar's Test P-Value : 0.0008663         
##                                             
##             Sensitivity : 0.6054            
##             Specificity : 0.6106            
##          Pos Pred Value : 0.5768            
##          Neg Pred Value : 0.6383            
##              Prevalence : 0.4671            
##          Detection Rate : 0.2828            
##    Detection Prevalence : 0.4903            
##       Balanced Accuracy : 0.6080            
##                                             
##        'Positive' Class : debt_consolidation
##

confusionMatrix(reference = Ytest, data = YhatBoost)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           debt_consolidation single_purpose
##   debt_consolidation               2500           1877
##   single_purpose                   1339           2502
##                                             
##                Accuracy : 0.6087            
##                  95% CI : (0.598, 0.6192)   
##     No Information Rate : 0.5329            
##     P-Value [Acc > NIR] : < 2.2e-16         
##                                             
##                   Kappa : 0.2207            
##                                             
##  Mcnemar's Test P-Value : < 2.2e-16         
##                                             
##             Sensitivity : 0.6512            
##             Specificity : 0.5714            
##          Pos Pred Value : 0.5712            
##          Neg Pred Value : 0.6514            
##              Prevalence : 0.4671            
##          Detection Rate : 0.3042            
##    Detection Prevalence : 0.5326            
##       Balanced Accuracy : 0.6113            
##                                             
##        'Positive' Class : debt_consolidation
##

Now we’re getting much more useful results with these models.

I would like to also look at which feature importance to see which data is most important in making these predictions.

boostImportance = xgb.importance(model = boostOut$finalModel) 
rfImportance = data.frame(rfOut$finalModel$importance)
rpartImportance = data.frame(rpartOut$finalModel$variable.importance/sum(rpartOut$finalModel$variable.importance))

boostImportance

##                     Feature         Gain        Cover    Frequency
##  1:               revol_bal 0.3795661770 0.3169186403 0.2058592435
##  2:              revol_util 0.2618471067 0.2175931656 0.1791446851
##  3:              annual_inc 0.1263372281 0.1560599465 0.2134919744
##  4:               total_acc 0.0765458607 0.0829598776 0.1378381412
##  5:                open_acc 0.0611268411 0.0852286784 0.1007969469
##  6:     home_ownership.RENT 0.0363603676 0.0640877519 0.0358064878
##  7:          inq_last_6mths 0.0306427452 0.0507815580 0.0572454821
##  8: home_ownership.MORTGAGE 0.0091530312 0.0115087258 0.0166124144
##  9:             delinq_2yrs 0.0047609539 0.0029040951 0.0169491525
## 10:    pub_rec_bankruptcies 0.0045920784 0.0049886779 0.0079694691
## 11:  mths_since_last_delinq 0.0041360980 0.0025473904 0.0112246043
## 12:  mths_since_last_record 0.0026427669 0.0027321219 0.0085306993
## 13:      home_ownership.OWN 0.0019873056 0.0012619101 0.0071837468
## 14:    home_ownership.OTHER 0.0001959571 0.0001725382 0.0011224604
## 15:                 pub_rec 0.0001054823 0.0002549222 0.0002244921

rfImportance

##                         MeanDecreaseGini
## home_ownership.MORTGAGE        105.57663
## home_ownership.OTHER            13.46809
## home_ownership.OWN              84.77347
## home_ownership.RENT            110.00604
## annual_inc                    2340.03653
## delinq_2yrs                    250.42551
## inq_last_6mths                 737.83009
## mths_since_last_delinq         276.80219
## mths_since_last_record          68.80754
## open_acc                      1332.64678
## pub_rec                         73.70613
## pub_rec_bankruptcies            68.55729
## revol_bal                     2976.05144
## revol_util                    2818.87040
## total_acc                     1805.57385

rpartImportance

##                         rpartOut.finalModel.variable.importance.sum.rpartOut.finalModel.variable.importance.
## revol_bal                                                                                       0.5163784229
## revol_util                                                                                      0.2760834115
## annual_inc                                                                                      0.0546097921
## open_acc                                                                                        0.0417013228
## home_ownership.RENT                                                                             0.0391649992
## home_ownership.MORTGAGE                                                                         0.0335993991
## total_acc                                                                                       0.0235749521
## pub_rec_bankruptcies                                                                            0.0041384896
## inq_last_6mths                                                                                  0.0032724429
## pub_rec                                                                                         0.0029547725
## mths_since_last_record                                                                          0.0029172960
## delinq_2yrs                                                                                     0.0006317620
## mths_since_last_delinq                                                                          0.0005502834
## home_ownership.OWN                                                                              0.0004226540

All three models agree that the 3 most overwhelmingly important features are revol_bal, revol_util, and annual_inc. This makes a lot of sense: revol_bal and revol_util tell us an individual’s revolving balance and their balance relative to all available credit, respectively. These should be strongly predictive of an individual’s need for debt consolidation. Income also would heavily impact the relative burden imposed by a given level of debt, so it’s also expected that annual income is a highly important feature.

Finally, I will choose a model to export which will be used in a Shiny app to help hypothetical employees of a company decide which marketing materials to send to a potential customer. Based on its accuracy (and since we aren’t concerned with model interpretability for this problem), I will use the Xgboost model.

saveRDS(boostOut,"./final_model.rds")

LCClean

Eric Voss

9/21/2020