Loan Default Prediction was a Kaggle competition to predict if a loan will default and how much of the loan will be lost in the case of a default. My code is here.

The data set consists of 700-odd Some of the data consisted of very large numbers, which python imported as objects instead of integers. I coped with this by writing a "convert" function which returned the object data as ints or floats. [code] Not much happened in the competition for a few weeks because just predicting which loans would default was a hard problem. I initially tried to reduce the data using PCA or by eliminating features which were highly correlated. But even with the reduced data, none of the tools in sklearn did any better than just guessing "no default" for every loan. Yasser Tabandeh eventually pointed out that using the binomial model from the glm packages in statsmodels with two particular features was a good predictor of defaults. But why did sklearn fail while glm succeeded? One reason is that values of a feature in the test data were often much larger than values of the same feature in the training data. dolaameng pointed out that "the local neighbor models are good at interpolation but very bad at extrapolation." |

Like many other Kagglers, my strategy was first to classify loans as either "default" or "non-default"
and then apply a regression tool to predict the score for those loans that defaulted.
Using the binomial generalized linear model in the statsmodels.glm package to classify the loans as "default" or "non-default" is suggested by the substantial differences between the values of features in the test data and in the training data. Logistic regression might also work well. |

Since no one single feature is especially predictive of whether a loan defaults or not
and since there are several features which differ only slightly, it makes sense to examine
pairs of features to see if the two features together are valuable. Two nested loops to
investigate all these combinations was time-consuming (an overnight operation), but
allowed me to identify the "golden features".
- Identifying useful features
Since the binomial glm classifier returns a value between 0 and 1, I used the area-under-the-curve to quantify the quality of the prediction. An AUC of 0.5 corresponds to random guessing, while an AUC of 1 corresponds to classifying every loan correctly. I also needed to select a cutoff value for labeling the loans as "default" or "not-default". I chose 0.66 because it produced the best results in the subsequent step: using regression to predict loss for defaults. |

After training the statsmodels classifier to predict defaults with some success, I used sklearn's
RandomForestRegressor to generate predictions of the loss.
My most useful insight here was to train the classifier on my training subset, then apply the classifier to the training subset and use all the loans that my classifier identified as "default" to train my regressor. This meant that I included loans that the classifier had incorrectly labelled as "default" when I trained the regressor. RandomForestRegressor seemed to do the best job at estimating the loss. It turned out to be
very profitable to |

Copyright (c) 2013 small yellow duck. All rights reserved. Design by FreeCSSTemplates.org.