In this file, we will be performing some exploratory data analysis and data cleaning.
dim(dataSet)
## [1] 42537 111
table(sapply(dataSet[1,],class))
##
## character logical numeric
## 23 55 33
We have 111 features and roughly 42,000 observations. However, some of these observations and/or features might not be useful, so we need to take a closer look.
Let’s visualize missing values in this data set.
ggplot_missing <- function(x){
if(!require(reshape2)){warning('you need to install reshape2')}
require(reshape2)
#### This function produces a plot of the missing data pattern
#### in x. It is a modified version of a function in the 'neato' package
x %>%
is.na %>%
melt %>%
ggplot(data = .,
aes(x = Var2,
y = Var1)) +
geom_raster(aes(fill = value)) +
scale_fill_grey(name = "",
labels = c("Present","Missing")) +
theme_minimal() +
theme(axis.text.x = element_text(angle=45, vjust=0.5)) +
labs(x = "Variables in Dataset",
y = "Rows / observations")
}
ggplot_missing(dataSet)
## Loading required package: reshape2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
There’s quite a bit of missing data there - I’ll remove features with entirely missing data, as I can’t do anything with those, and then look at what’s left.
dataSet = dataSet[,colSums(is.na(dataSet))<nrow(dataSet)]
table(sapply(dataSet,class))
##
## character logical numeric
## 23 1 33
Some motivation for this project: let’s take a look at “purpose”, which looks to be a factor with 15 levels, and see if we can use other features to build a classifier for it. This could be useful for, say, targeted advertising to potential customers which specifically mentions loan purposes which might be relevant to them. To that end, I’m going to take a subset of this data which excludes features that wouldn’t be available for prospects who aren’t yet customers. Some of these fields, like loan amount, could be highly predictive, but wouldn’t actually be useful in building this model from a business perspective (obviously, if a customer has already applied for a loan of a specific amount, we already know what their purpose is).
dataSetReduced = dataSet %>%
select(purpose, home_ownership, annual_inc,delinq_2yrs,
inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,
pub_rec,pub_rec_bankruptcies,revol_bal,revol_util,total_acc)
sapply(dataSetReduced, function(x)length(unique(x)))
## purpose home_ownership annual_inc
## 15 6 5598
## delinq_2yrs inq_last_6mths mths_since_last_delinq
## 13 29 96
## mths_since_last_record open_acc pub_rec
## 114 45 7
## pub_rec_bankruptcies revol_bal revol_util
## 4 22710 1120
## total_acc
## 84
sapply(dataSetReduced,function(x){sum(is.na(x))})
## purpose home_ownership annual_inc
## 2 2 6
## delinq_2yrs inq_last_6mths mths_since_last_delinq
## 31 31 26928
## mths_since_last_record open_acc pub_rec
## 38886 31 31
## pub_rec_bankruptcies revol_bal revol_util
## 1367 2 92
## total_acc
## 31
We have to decide what to do with these missing values in the data. Fortunately, most of these features have very few missing values, so removing those entries should have a minimal impact. 2 features stand out to me here - mths_since_last_record, and mths_since_last_delinq. I’d like to look at these features more closely.
unique(dataSetReduced$mths_since_last_delinq)
## [1] NA 35 38 61 8 20 18 68 45 48 41 40 74 25 53 39 10 26 56
## [20] 77 28 52 24 16 60 54 23 9 11 13 65 19 80 22 59 79 44 64
## [39] 57 14 63 49 15 73 70 29 51 5 75 55 2 30 47 33 69 4 43
## [58] 21 27 46 81 78 82 31 76 62 72 42 50 3 12 67 36 34 58 17
## [77] 71 66 32 6 37 7 1 83 86 115 96 103 120 106 89 107 85 97 95
## [96] 0
unique(dataSetReduced$mths_since_last_record)
## [1] NA 113 105 97 33 93 52 85 90 91 114 92 117 87 45 83 118 38
## [19] 101 100 112 110 88 79 77 107 102 98 95 103 96 116 111 89 108 29
## [37] 106 115 53 86 57 63 94 109 99 104 76 61 28 23 75 47 82 21
## [55] 62 44 80 67 119 42 34 66 58 22 56 72 64 50 69 49 74 35
## [73] 12 26 78 54 37 73 11 31 59 32 81 68 55 39 51 70 30 41
## [91] 71 40 43 27 65 46 19 17 25 13 48 36 7 60 14 6 18 0
## [109] 20 120 129 5 24 15
Since these features measure the length of time since the last event of interest (delinquincies and public records), the missing values seem to indicate that the event has never happened. I will be recoding these into qualititative features with 2 levels to indicate if the individual has ever had a delinquincy/public record.
m = is.na(dataSetReduced$mths_since_last_delinq)
dataSetReduced$mths_since_last_delinq[m==TRUE] = 0
dataSetReduced$mths_since_last_delinq[m==FALSE] = 1
n = is.na(dataSetReduced$mths_since_last_record)
dataSetReduced$mths_since_last_record[n==TRUE] = 0
dataSetReduced$mths_since_last_record[n==FALSE] = 1
Now, I’ll drop the rest of our missing value records, and do some conversion/fixing of data types:
dataSetReduced = drop_na(dataSetReduced)
#Remove the % sign and convert to numeric for revol_util
dataSetReduced$revol_util = gsub('.{1}$', '', dataSetReduced$revol_util)
dataSetReduced$revol_util = (as.numeric(dataSetReduced$revol_util))
#Convert purpose to factor and use one-hot encoding on home_ownership
dataSetReduced$home_ownership = as.factor(dataSetReduced$home_ownership)
dataSetReducedTemp = dataSetReduced[,-1]
dmy <- dummyVars(" ~ .", data = dataSetReducedTemp)
dataSetReducedTemp <- data.frame(predict(dmy, newdata = dataSetReducedTemp))
dataSetReducedTemp$purpose = as.factor(dataSetReduced$purpose)
dataSetReduced = dataSetReducedTemp
summary(dataSetReduced$purpose)
## car credit_card debt_consolidation educational
## 1582 5293 19199 365
## home_improvement house major_purchase medical
## 3113 410 2265 727
## moving other renewable_energy small_business
## 605 4193 106 1886
## vacation wedding
## 390 978
I’m noticing here that we have a somewhat imbalanced classifier - around half our records have a “debt consolidation” classification. This may present an issue going forward.
Applying a training/test split:
Y = dataSetReduced$purpose
X = select(dataSetReduced, -purpose)
set.seed(1)
trainSplit = createDataPartition(y = Y, p = 0.8, list = FALSE)
Ytrain = Y[trainSplit]
Xtrain = X[trainSplit,]
Ytest = Y[-trainSplit]
Xtest = X[-trainSplit,]
training = dataSetReduced[ trainSplit,]
testing = dataSetReduced[-trainSplit,]
trControl = trainControl(method = "cv", number = 5)
First, I will try out KNN to classify Purpose. This requires some preprocessing.
tuneGrid = expand.grid(k = c(1,2,10,50,100,150,200))
knnOut = train(x = Xtrain, y = Ytrain, method = "knn", tuneGrid = tuneGrid, trControl = trControl,preProcess = c("center","scale"))
Looking at the results from knn:
YhatKnn = predict(knnOut, Xtest)
table(YhatKnn, Ytest)
## Ytest
## YhatKnn car credit_card debt_consolidation educational
## car 0 0 0 0
## credit_card 0 2 0 0
## debt_consolidation 288 1046 3782 69
## educational 0 0 0 0
## home_improvement 19 8 35 1
## house 0 0 0 0
## major_purchase 4 0 6 0
## medical 0 0 0 0
## moving 0 0 0 0
## other 5 2 16 3
## renewable_energy 0 0 0 0
## small_business 0 0 0 0
## vacation 0 0 0 0
## wedding 0 0 0 0
## Ytest
## YhatKnn home_improvement house major_purchase medical moving other
## car 0 0 0 0 0 0
## credit_card 0 1 0 0 0 2
## debt_consolidation 546 73 403 136 113 796
## educational 0 0 0 0 0 0
## home_improvement 66 4 32 5 2 21
## house 0 0 0 0 0 0
## major_purchase 4 1 6 2 2 9
## medical 0 0 0 0 0 0
## moving 0 0 0 0 0 0
## other 6 3 12 2 4 10
## renewable_energy 0 0 0 0 0 0
## small_business 0 0 0 0 0 0
## vacation 0 0 0 0 0 0
## wedding 0 0 0 0 0 0
## Ytest
## YhatKnn renewable_energy small_business vacation wedding
## car 0 0 0 0
## credit_card 0 2 0 0
## debt_consolidation 19 345 75 179
## educational 0 0 0 0
## home_improvement 2 20 1 10
## house 0 0 0 0
## major_purchase 0 2 0 2
## medical 0 0 0 0
## moving 0 0 0 0
## other 0 8 2 4
## renewable_energy 0 0 0 0
## small_business 0 0 0 0
## vacation 0 0 0 0
## wedding 0 0 0 0
Indeed, the imbalanced classifier is an issue - KNN is more or less predicting debt_conolidation for everything. In addition, it turns out we have too few data points relative to the number of features for KNN to be effective. I will try a decision tree method going forward, which may handle this issue better.
First, I’m going to rework the supervisor to narrow my focus into classifying between “debt consolidation” and “single purpose” (e.g. home repair, education, etc.) loans. After some consideration of what we might be trying to acheive with this project, I decided that this split would be much easier to use while still providing considerable business value. The problem of marketing debt consolidation products to debt-laden consumers is, in some ways, the polar opposite of trying to market products designed to increase the purchasing power of consumers without as much existing debt to worry about, so distinguishing between these two categories of consumers is valuable.
levels(Ytrain)[levels(Ytrain)!="debt_consolidation"] = "single_purpose"
levels(Ytest)[levels(Ytest)!="debt_consolidation"] = "single_purpose"
Ytrain = relevel(Ytrain,"debt_consolidation")
Ytest = relevel(Ytest,"debt_consolidation")
Here, I’m trying a classification tree with pruning
tuneGrid = expand.grid(cp = c(0.0001,0.001, 0.01, 0.1))
rpartOut = train(x = Xtrain, y = Ytrain,
method = "rpart",
tuneGrid = tuneGrid,
trControl = trControl)
plot(rpartOut$finalModel,margin= rep(.1,4))
text(rpartOut$finalModel, cex = 0.4, digits = 1)
Next, I’m going to try a Random Forest method (bagging), followed by a boosting method.
set.seed(1)
tuneGridRf = data.frame(mtry = round(sqrt(ncol(Xtrain))))
rfOut = train(x = Xtrain, y = Ytrain,
method = "rf",
tuneGrid = tuneGridRf,
trControl = trControl)
set.seed(1)
tuneGrid = data.frame('nrounds'=c(100,150,200),
'max_depth' = 6,
'eta' = .01,
'gamma' = 0,
'colsample_bytree' = 1,
'min_child_weight' = 0,
'subsample' = .5)
boostOut = train(x = Xtrain, y = Ytrain,
method = "xgbTree", verbose = 0,
tuneGrid = tuneGrid,
trControl = trControl)
Now let’s take a look at these results with some ROC curves. I’m also going to look at various decision thresholds for each model.
YhatRfProbs = predict(rfOut, Xtest,type="prob")
YhatRPartProbs = predict(rpartOut, Xtest,type="prob")
YhatBoostProbs = predict(boostOut, Xtest,type="prob")
rocCurve = roc(Ytest, YhatRfProbs$single_purpose)
## Setting levels: control = debt_consolidation, case = single_purpose
## Setting direction: controls < cases
rocCurve2 = roc(Ytest, YhatRPartProbs$single_purpose)
## Setting levels: control = debt_consolidation, case = single_purpose
## Setting direction: controls < cases
rocCurve3 = roc(Ytest, YhatBoostProbs$single_purpose)
## Setting levels: control = debt_consolidation, case = single_purpose
## Setting direction: controls < cases
plot(rocCurve, legacy.axes=TRUE,print.thres=c(0.3,0.44,0.5),col='black',main = "thresholds = Random Forest")
lines(rocCurve2, col = 'red')
lines(rocCurve3, col = 'green')
legend(x=0,y = 0.5,legend=c("Class. Tree","Random Forest","Xgboost"),col=c("red","black","green"),lty=1,cex=0.8)
plot(rocCurve2, legacy.axes=TRUE,print.thres=c(0.432,0.6),col='red', main = "thresholds = Class. Tree")
lines(rocCurve, col = 'black')
lines(rocCurve3, col = 'green')
legend(x=0,y = 0.5,legend=c("Class. Tree","Random Forest","Xgboost"),col=c("red","black","green"),lty=1,cex=0.8)
plot(rocCurve3, legacy.axes=TRUE,print.thres=c(0.44,0.47,0.5),col='green',main = "thresholds = Xgboost")
lines(rocCurve2, col = 'red')
lines(rocCurve, col = 'black')
legend(x=0,y = 0.5,legend=c("Class. Tree","Random Forest","Xgboost"),col=c("red","black","green"),lty=1,cex=0.8)
These classifiers all perform fairly similarly. For this hypothetical business problem, I am going to choose the default of 0.5 as a threshold to maximize accuracy.
YhatRf = as.factor(ifelse(YhatRfProbs[,2] >= 0.5,"single_purpose","debt_consolidation")) #Get Yhats using threshold of 0.44 instead
YhatRpart = as.factor(ifelse(YhatRPartProbs[,1] >= 0.5,"debt_consolidation","single_purpose"))
YhatBoost = as.factor(ifelse(YhatBoostProbs[,2] >= 0.5,"single_purpose","debt_consolidation"))
confusionMatrix(reference = Ytest, data = YhatRf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction debt_consolidation single_purpose
## debt_consolidation 2421 1824
## single_purpose 1418 2555
##
## Accuracy : 0.6055
## 95% CI : (0.5948, 0.6161)
## No Information Rate : 0.5329
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2127
##
## Mcnemar's Test P-Value : 1.136e-12
##
## Sensitivity : 0.6306
## Specificity : 0.5835
## Pos Pred Value : 0.5703
## Neg Pred Value : 0.6431
## Prevalence : 0.4671
## Detection Rate : 0.2946
## Detection Prevalence : 0.5165
## Balanced Accuracy : 0.6070
##
## 'Positive' Class : debt_consolidation
##
confusionMatrix(reference = Ytest, data = YhatRpart)
## Confusion Matrix and Statistics
##
## Reference
## Prediction debt_consolidation single_purpose
## debt_consolidation 2324 1705
## single_purpose 1515 2674
##
## Accuracy : 0.6082
## 95% CI : (0.5975, 0.6188)
## No Information Rate : 0.5329
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2154
##
## Mcnemar's Test P-Value : 0.0008663
##
## Sensitivity : 0.6054
## Specificity : 0.6106
## Pos Pred Value : 0.5768
## Neg Pred Value : 0.6383
## Prevalence : 0.4671
## Detection Rate : 0.2828
## Detection Prevalence : 0.4903
## Balanced Accuracy : 0.6080
##
## 'Positive' Class : debt_consolidation
##
confusionMatrix(reference = Ytest, data = YhatBoost)
## Confusion Matrix and Statistics
##
## Reference
## Prediction debt_consolidation single_purpose
## debt_consolidation 2500 1877
## single_purpose 1339 2502
##
## Accuracy : 0.6087
## 95% CI : (0.598, 0.6192)
## No Information Rate : 0.5329
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2207
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6512
## Specificity : 0.5714
## Pos Pred Value : 0.5712
## Neg Pred Value : 0.6514
## Prevalence : 0.4671
## Detection Rate : 0.3042
## Detection Prevalence : 0.5326
## Balanced Accuracy : 0.6113
##
## 'Positive' Class : debt_consolidation
##
Now we’re getting much more useful results with these models.
I would like to also look at which feature importance to see which data is most important in making these predictions.
boostImportance = xgb.importance(model = boostOut$finalModel)
rfImportance = data.frame(rfOut$finalModel$importance)
rpartImportance = data.frame(rpartOut$finalModel$variable.importance/sum(rpartOut$finalModel$variable.importance))
boostImportance
## Feature Gain Cover Frequency
## 1: revol_bal 0.3795661770 0.3169186403 0.2058592435
## 2: revol_util 0.2618471067 0.2175931656 0.1791446851
## 3: annual_inc 0.1263372281 0.1560599465 0.2134919744
## 4: total_acc 0.0765458607 0.0829598776 0.1378381412
## 5: open_acc 0.0611268411 0.0852286784 0.1007969469
## 6: home_ownership.RENT 0.0363603676 0.0640877519 0.0358064878
## 7: inq_last_6mths 0.0306427452 0.0507815580 0.0572454821
## 8: home_ownership.MORTGAGE 0.0091530312 0.0115087258 0.0166124144
## 9: delinq_2yrs 0.0047609539 0.0029040951 0.0169491525
## 10: pub_rec_bankruptcies 0.0045920784 0.0049886779 0.0079694691
## 11: mths_since_last_delinq 0.0041360980 0.0025473904 0.0112246043
## 12: mths_since_last_record 0.0026427669 0.0027321219 0.0085306993
## 13: home_ownership.OWN 0.0019873056 0.0012619101 0.0071837468
## 14: home_ownership.OTHER 0.0001959571 0.0001725382 0.0011224604
## 15: pub_rec 0.0001054823 0.0002549222 0.0002244921
rfImportance
## MeanDecreaseGini
## home_ownership.MORTGAGE 105.57663
## home_ownership.OTHER 13.46809
## home_ownership.OWN 84.77347
## home_ownership.RENT 110.00604
## annual_inc 2340.03653
## delinq_2yrs 250.42551
## inq_last_6mths 737.83009
## mths_since_last_delinq 276.80219
## mths_since_last_record 68.80754
## open_acc 1332.64678
## pub_rec 73.70613
## pub_rec_bankruptcies 68.55729
## revol_bal 2976.05144
## revol_util 2818.87040
## total_acc 1805.57385
rpartImportance
## rpartOut.finalModel.variable.importance.sum.rpartOut.finalModel.variable.importance.
## revol_bal 0.5163784229
## revol_util 0.2760834115
## annual_inc 0.0546097921
## open_acc 0.0417013228
## home_ownership.RENT 0.0391649992
## home_ownership.MORTGAGE 0.0335993991
## total_acc 0.0235749521
## pub_rec_bankruptcies 0.0041384896
## inq_last_6mths 0.0032724429
## pub_rec 0.0029547725
## mths_since_last_record 0.0029172960
## delinq_2yrs 0.0006317620
## mths_since_last_delinq 0.0005502834
## home_ownership.OWN 0.0004226540
All three models agree that the 3 most overwhelmingly important features are revol_bal, revol_util, and annual_inc. This makes a lot of sense: revol_bal and revol_util tell us an individual’s revolving balance and their balance relative to all available credit, respectively. These should be strongly predictive of an individual’s need for debt consolidation. Income also would heavily impact the relative burden imposed by a given level of debt, so it’s also expected that annual income is a highly important feature.
Finally, I will choose a model to export which will be used in a Shiny app to help hypothetical employees of a company decide which marketing materials to send to a potential customer. Based on its accuracy (and since we aren’t concerned with model interpretability for this problem), I will use the Xgboost model.
saveRDS(boostOut,"./final_model.rds")