Shirley K Data


 Deep Learning Project

This python project explores Lending Club data to predict whether a person will pay off their loan. The data was cleaned and engineered for use with a neural network model.


Library and Data Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
df = pd.read_csv('../DATA/lending_club_loan_two.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396030 entries, 0 to 396029
Data columns (total 27 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   loan_amnt             396030 non-null  float64
 1   term                  396030 non-null  object 
 2   int_rate              396030 non-null  float64
 3   installment           396030 non-null  float64
 4   grade                 396030 non-null  object 
 5   sub_grade             396030 non-null  object 
 6   emp_title             373103 non-null  object 
 7   emp_length            377729 non-null  object 
 8   home_ownership        396030 non-null  object 
 9   annual_inc            396030 non-null  float64
 10  verification_status   396030 non-null  object 
 11  issue_d               396030 non-null  object 
 12  loan_status           396030 non-null  object 
 13  purpose               396030 non-null  object 
 14  title                 394275 non-null  object 
 15  dti                   396030 non-null  float64
 16  earliest_cr_line      396030 non-null  object 
 17  open_acc              396030 non-null  float64
 18  pub_rec               396030 non-null  float64
 19  revol_bal             396030 non-null  float64
 20  revol_util            395754 non-null  float64
 21  total_acc             396030 non-null  float64
 22  initial_list_status   396030 non-null  object 
 23  application_type      396030 non-null  object 
 24  mort_acc              358235 non-null  float64
 25  pub_rec_bankruptcies  395495 non-null  float64
 26  address               396030 non-null  object 
dtypes: float64(12), object(15)
memory usage: 81.6+ MB

Exploratory Data Analysis

The following countplot shows the target variable distribution.

Ideally, the distribution would be more even.

sns.countplot(x='loan_status',data=df)
<AxesSubplot:xlabel='loan_status', ylabel='count'>
output_5_1.png

The following histogram shows the number of loans for each dollar amount.
There are noticeable peaks at each 5000th increments.

plt.figure(figsize=(12,8))
sns.histplot(df['loan_amnt'],bins=40)
<AxesSubplot:xlabel='loan_amnt', ylabel='Count'>
df.corr()
loan_amnt int_rate installment annual_inc dti open_acc pub_rec revol_bal revol_util total_acc mort_acc pub_rec_bankruptcies
loan_amnt 1.000000 0.168921 0.953929 0.336887 0.016636 0.198556 -0.077779 0.328320 0.099911 0.223886 0.222315 -0.106539
int_rate 0.168921 1.000000 0.162758 -0.056771 0.079038 0.011649 0.060986 -0.011280 0.293659 -0.036404 -0.082583 0.057450
installment 0.953929 0.162758 1.000000 0.330381 0.015786 0.188973 -0.067892 0.316455 0.123915 0.202430 0.193694 -0.098628
annual_inc 0.336887 -0.056771 0.330381 1.000000 -0.081685 0.136150 -0.013720 0.299773 0.027871 0.193023 0.236320 -0.050162
dti 0.016636 0.079038 0.015786 -0.081685 1.000000 0.136181 -0.017639 0.063571 0.088375 0.102128 -0.025439 -0.014558
open_acc 0.198556 0.011649 0.188973 0.136150 0.136181 1.000000 -0.018392 0.221192 -0.131420 0.680728 0.109205 -0.027732
pub_rec -0.077779 0.060986 -0.067892 -0.013720 -0.017639 -0.018392 1.000000 -0.101664 -0.075910 0.019723 0.011552 0.699408
revol_bal 0.328320 -0.011280 0.316455 0.299773 0.063571 0.221192 -0.101664 1.000000 0.226346 0.191616 0.194925 -0.124532
revol_util 0.099911 0.293659 0.123915 0.027871 0.088375 -0.131420 -0.075910 0.226346 1.000000 -0.104273 0.007514 -0.086751
total_acc 0.223886 -0.036404 0.202430 0.193023 0.102128 0.680728 0.019723 0.191616 -0.104273 1.000000 0.381072 0.042035
mort_acc 0.222315 -0.082583 0.193694 0.236320 -0.025439 0.109205 0.011552 0.194925 0.007514 0.381072 1.000000 0.027239
pub_rec_bankruptcies -0.106539 0.057450 -0.098628 -0.050162 -0.014558 -0.027732 0.699408 -0.124532 -0.086751 0.042035 0.027239 1.000000

The following is a strong visual of which features are most correlated to one another.

plt.figure(figsize=(12,8))
sns.heatmap(df.corr(),cmap='viridis',annot=True)
<AxesSubplot:>
output_8_1.png

The installment amount is highly correlated to the loan amount, which makes sense.

df.columns
Index(['loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       'emp_title', 'emp_length', 'home_ownership', 'annual_inc',
       'verification_status', 'issue_d', 'loan_status', 'purpose', 'title',
       'dti', 'earliest_cr_line', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'application_type',
       'mort_acc', 'pub_rec_bankruptcies', 'address'],
      dtype='object')
grade_order = list(df['grade'].sort_values().unique())
subgrade_order = list(df['sub_grade'].sort_values().unique())

The following shows the number of fully paid versus charged off accounts by grade.
We see that higher grades have more instances of paid off accounts than lower grades.

sns.countplot(x='grade',data=df,hue='loan_status',order=grade_order)
<AxesSubplot:xlabel='grade', ylabel='count'>

The following is what looks to be a Poisson distribution of loans by subgrade.

plt.figure(figsize=(13,5))
sns.countplot(x='sub_grade',data=df,order=subgrade_order,palette='coolwarm')
<AxesSubplot:xlabel='sub_grade', ylabel='count'>
output_14_1.png

We see in the following plot that when separated by the target variables, those with higher grades and subgrades have more instances of fully paid off loans, which is expected.

plt.figure(figsize=(13,5))
sns.countplot(x='sub_grade',data=df,order=subgrade_order,palette='coolwarm',hue='loan_status')
<AxesSubplot:xlabel='sub_grade', ylabel='count'>

Those with F and G grades have closer to 50/50 odds of paying off their loans.

fg_df = df[(df['grade']=='F')|(df['grade']=='G')]
plt.figure(figsize=(13,5))
sns.countplot(x='sub_grade',data=fg_df,hue='loan_status',order=fg_df['sub_grade'].sort_values().unique(),palette='coolwarm')
<AxesSubplot:xlabel='sub_grade', ylabel='count'>
output_16_1.png

Feature Engineering

Prepping data for our machine learning algorithm later.

def paid_charged(pc):
    if pc == 'Fully Paid':
        return 1
    else:
        return 0
df['loan_repaid'] = df['loan_status'].apply(lambda x: paid_charged(x))
df[['loan_repaid','loan_status']]
loan_repaid loan_status
0 1 Fully Paid
1 1 Fully Paid
2 1 Fully Paid
3 1 Fully Paid
4 0 Charged Off
... ... ...
396025 1 Fully Paid
396026 1 Fully Paid
396027 1 Fully Paid
396028 1 Fully Paid
396029 1 Fully Paid

396030 rows × 2 columns

Interest rate has an inverse relationship to being repaid. Annual income and mortgage accounts have a stronger correlation to a loan being repaid.

df.corr()['loan_repaid'][:-1].sort_values().plot(kind='bar')
<AxesSubplot:>
output_20_1.png

Missing Data

Missing data as percentages of the total data instances. The missing instances in Title, Revolving Utilization, and Public Record Bankruptcies are less than 1% and those instances can be deleted without impacting our dataset too much. The other three may need to have data imputed.

((df.isnull().sum())/(3960.30))
loan_amnt               0.000000
term                    0.000000
int_rate                0.000000
installment             0.000000
grade                   0.000000
sub_grade               0.000000
emp_title               5.789208
emp_length              4.621115
home_ownership          0.000000
annual_inc              0.000000
verification_status     0.000000
issue_d                 0.000000
loan_status             0.000000
purpose                 0.000000
title                   0.443148
dti                     0.000000
earliest_cr_line        0.000000
open_acc                0.000000
pub_rec                 0.000000
revol_bal               0.000000
revol_util              0.069692
total_acc               0.000000
initial_list_status     0.000000
application_type        0.000000
mort_acc                9.543469
pub_rec_bankruptcies    0.135091
address                 0.000000
loan_repaid             0.000000
dtype: float64

We will take a closer look at emp_title, emp_length, and mort_acc.

df['emp_title'].nunique()
173105
df['emp_title'].value_counts()
Teacher                                     4389
Manager                                     4250
Registered Nurse                            1856
RN                                          1846
Supervisor                                  1830
                                            ... 
Warner Brothers                                1
Liscened  Practical Nirse                      1
Crews Lake Middle  School                      1
Kaiser - Southern California Permanente        1
HIV Testing/Counseling Coordinator             1
Name: emp_title, Length: 173105, dtype: int64

Employment Titles will be difficult to address effectively for our purposes. It would be hard to glean any useful and meaningful insights versus the cost to computing and time spent re-categorizing these titles. We will delete this column altogether.

df.drop('emp_title',axis=1,inplace=True)
df['emp_length'].unique()
array(['10+ years', '4 years', '< 1 year', '6 years', '9 years',
       '2 years', '3 years', '8 years', '7 years', '5 years', '1 year',
       nan], dtype=object)
sorted(df['emp_length'].dropna().unique())
['1 year',
 '10+ years',
 '2 years',
 '3 years',
 '4 years',
 '5 years',
 '6 years',
 '7 years',
 '8 years',
 '9 years',
 '< 1 year']
order = ['< 1 year','1 year','2 years','3 years','4 years','5 years',
         '6 years', '7 years', '8 years', '9 years','10+ years']
plt.figure(figsize=(10,6))
sns.countplot(x='emp_length',data=df,order=order)
<AxesSubplot:xlabel='emp_length', ylabel='count'>
output_30_1.png

What is the relationship between our target variables and this feature?

plt.figure(figsize=(10,6))
sns.countplot(x='emp_length',data=df,order=order,hue='loan_status')
<AxesSubplot:xlabel='emp_length', ylabel='count'>
df.columns
Index(['loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       'emp_length', 'home_ownership', 'annual_inc', 'verification_status',
       'issue_d', 'loan_status', 'purpose', 'title', 'dti', 'earliest_cr_line',
       'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc',
       'initial_list_status', 'application_type', 'mort_acc',
       'pub_rec_bankruptcies', 'address', 'loan_repaid'],
      dtype='object')
emp_co = df[df['loan_status']=="Charged Off"].groupby("emp_length").count()['loan_status']
emp_fp = df[df['loan_status']=="Fully Paid"].groupby("emp_length").count()['loan_status']
emp_len = emp_co/emp_fp
emp_len
emp_length
1 year       0.248649
10+ years    0.225770
2 years      0.239560
3 years      0.242593
4 years      0.238213
5 years      0.237911
6 years      0.233341
7 years      0.241887
8 years      0.249625
9 years      0.250735
< 1 year     0.260830
Name: loan_status, dtype: float64
emp_len.plot(kind='bar')
<AxesSubplot:xlabel='emp_length'>
output_37_1.png

As a percentage of instances, the amount of charge off versus full repayment is consistent across employment lengths. We will drop this column as well.

df = df.drop('emp_length',axis=1)

The title column is essentially the same as the purpose column. They are also both string columns.

df = df.drop('title',axis=1)
df['mort_acc'].value_counts()
0.0     139777
1.0      60416
2.0      49948
3.0      38049
4.0      27887
5.0      18194
6.0      11069
7.0       6052
8.0       3121
9.0       1656
10.0       865
11.0       479
12.0       264
13.0       146
14.0       107
15.0        61
16.0        37
17.0        22
18.0        18
19.0        15
20.0        13
24.0        10
22.0         7
21.0         4
25.0         4
27.0         3
23.0         2
32.0         2
26.0         2
31.0         2
30.0         1
28.0         1
34.0         1
Name: mort_acc, dtype: int64
print("Correlation with the mort_acc column")
df.corr()['mort_acc'].sort_values()[:-1]
Correlation with the mort_acc column

int_rate               -0.082583
dti                    -0.025439
revol_util              0.007514
pub_rec                 0.011552
pub_rec_bankruptcies    0.027239
loan_repaid             0.073111
open_acc                0.109205
installment             0.193694
revol_bal               0.194925
loan_amnt               0.222315
annual_inc              0.236320
total_acc               0.381072
Name: mort_acc, dtype: float64
df.groupby('total_acc').mean()['mort_acc']
total_acc
2.0      0.000000
3.0      0.052023
4.0      0.066743
5.0      0.103289
6.0      0.151293
           ...   
124.0    1.000000
129.0    1.000000
135.0    3.000000
150.0    2.000000
151.0    0.000000
Name: mort_acc, Length: 118, dtype: float64
tot_acc_avg = df.groupby('total_acc').mean()['mort_acc']
tot_acc_avg[2.0]
0.0

The missing data will be imputed with the average value calculated by total accounts.

def fill_mort_acc(total_acc,mort_acc):
    if np.isnan(mort_acc):
        return tot_acc_avg[total_acc]
    else:
        return mort_acc
df['mort_acc'] = df.apply(lambda x: fill_mort_acc(x['total_acc'],x['mort_acc']),axis=1)

We have addresssed the most significant missing data points. We can drop other missing instances since it will not impact our overall dataset significantly.

df = df.dropna()

We now have no more missing data points.

df.isnull().sum()
loan_amnt               0
term                    0
int_rate                0
installment             0
grade                   0
sub_grade               0
home_ownership          0
annual_inc              0
verification_status     0
issue_d                 0
loan_status             0
purpose                 0
dti                     0
earliest_cr_line        0
open_acc                0
pub_rec                 0
revol_bal               0
revol_util              0
total_acc               0
initial_list_status     0
application_type        0
mort_acc                0
pub_rec_bankruptcies    0
address                 0
loan_repaid             0
dtype: int64

Non-numeric data features.

We can now address non-number data types in our set so we can run our algorithm.

list(df.select_dtypes(exclude='number').columns)
['term',
 'grade',
 'sub_grade',
 'home_ownership',
 'verification_status',
 'issue_d',
 'loan_status',
 'purpose',
 'earliest_cr_line',
 'initial_list_status',
 'application_type',
 'address']

term feature

The 'term' column can be converted to a numerical column by taking the first portion of the inputs.

df['term'][0].split()
['36', 'months']
df['term'] = df['term'].apply(lambda x: int(x.split()[0]))
df['term'].value_counts()
36    301247
60     93972
Name: term, dtype: int64

grade and sub_grade features

grade is part of sub_grade, so grade is dropped.

df = df.drop('grade',axis=1)

sub_grade is converted to dummy variables and the original sub_grade column is deleted.

subs = pd.get_dummies(df['sub_grade'])
df.drop('sub_grade',axis=1,inplace=True)
df = pd.concat([df,subs],axis=1)

verification_status, application_type, initial_list_status, purpose features

Again, we will convert to dummy variables.

ver_stat = pd.get_dummies(df['verification_status'])
app_type = pd.get_dummies(df['application_type'])
init_list_stat = pd.get_dummies(df['initial_list_status'])
purp = pd.get_dummies(df['purpose'])
df = df.drop(['verification_status','application_type','initial_list_status','purpose'],axis=1)
df = pd.concat([df,ver_stat,app_type,init_list_stat,purp],axis=1)
list(df.select_dtypes(exclude='number').columns)
['home_ownership', 'issue_d', 'loan_status', 'earliest_cr_line', 'address']

home_ownership feature

df['home_ownership'].value_counts()
MORTGAGE    198022
RENT        159395
OWN          37660
OTHER          110
NONE            29
ANY              3
Name: home_ownership, dtype: int64

None and Any do not give us useful insights. They will be lumped together.

def home_own_type(ownership):
    if ownership.lower() in ['none','any']:
        return 'OTHER'
    else:
        return ownership

df['home_ownership'] = df['home_ownership'].apply(lambda x: home_own_type(x))
home_own_type = pd.get_dummies(df['home_ownership'])
df = df.drop('home_ownership',axis=1)
df = pd.concat([df,home_own_type],axis=1)

address feature

We will extract zip codes from the address column and create dummy variables.

df['address'][0][-5:]
'22690'
df['zip_code'] = df['address'].apply(lambda x: x[-5:])
zip_codes = pd.get_dummies(df['zip_code'])
df = df.drop('zip_code',axis=1)
df = pd.concat([df,zip_codes],axis=1)
df = df.drop('address',axis=1)

There are three columns remaining with non-numeric data types.

df.select_dtypes(exclude='number').columns
Index(['issue_d', 'loan_status', 'earliest_cr_line'], dtype='object')

issue_d feature

issue_d is the issue date, which is not a predicting feature for our target variables. We will drop this column.

df = df.drop('issue_d',axis=1)

earliest_cr_line feature

We will extract the year from the earliest_cr_line feature and use that as a numeric feature.

df['earliest_cr_line'][0][-4:]
'1990'
df['earliest_cr_year'] = df['earliest_cr_line'].apply(lambda x: int(x[-4:]))
df = df.drop('earliest_cr_line',axis=1)
df.select_dtypes(exclude='number').columns
Index(['loan_status'], dtype='object')

Working with our Model

Train Test Split

from sklearn.model_selection import train_test_split

We have converted loan_repaid to numeric features, so we can drop loan_status, which is essentially a repeat of those values.

df = df.drop('loan_status',axis=1)

X are the features, y are the target variables.

X = df.drop('loan_repaid',axis=1).values
y = df['loan_repaid'].values

We will take a portion of our full data, due to computing resources.

df = df.sample(frac=0.1,random_state=101)
print(len(df))
39522
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=101)

Normalizing Data

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Creating Model/Neural Network

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout
from tensorflow.keras.callbacks import EarlyStopping
X_train.shape
(316175, 85)

We will employ early stopping to reduce time.

early_stop = EarlyStopping(monitor='val_loss',mode='auto',verbose=1,patience=25)
model = Sequential()

# Input layer with rectified linear unit activation function
model.add(Dense(78,activation='relu'))
model.add(Dropout(0.5))

# Hidden layers
model.add(Dense(39,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(19,activation='relu'))
model.add(Dropout(0.5))

# Output layer with sigmoid activation function
model.add(Dense(1,activation='sigmoid'))

# Model will use adam optimizer.
model.compile(loss='binary_crossentropy',optimizer='adam')
model.fit(x=X_train,y=y_train,epochs=200,
          validation_data=(X_test,y_test),
          verbose=1,callbacks=[early_stop])
Epoch 1/200
9881/9881 [==============================] - 9s 954us/step - loss: 0.2907 - val_loss: 0.2663
Epoch 2/200
9881/9881 [==============================] - 9s 942us/step - loss: 0.2669 - val_loss: 0.2648
Epoch 3/200
9881/9881 [==============================] - 10s 962us/step - loss: 0.2659 - val_loss: 0.2640
Epoch 4/200
9881/9881 [==============================] - 9s 952us/step - loss: 0.2654 - val_loss: 0.2639
Epoch 5/200
9881/9881 [==============================] - 9s 941us/step - loss: 0.2651 - val_loss: 0.2632
Epoch 6/200
9881/9881 [==============================] - 9s 953us/step - loss: 0.2646 - val_loss: 0.2635
Epoch 7/200
9881/9881 [==============================] - 9s 952us/step - loss: 0.2644 - val_loss: 0.2625
Epoch 8/200
9881/9881 [==============================] - 10s 966us/step - loss: 0.2647 - val_loss: 0.2631
Epoch 9/200
9881/9881 [==============================] - 10s 967us/step - loss: 0.2643 - val_loss: 0.2628
Epoch 10/200
9881/9881 [==============================] - 9s 954us/step - loss: 0.2642 - val_loss: 0.2633
Epoch 11/200
9881/9881 [==============================] - 10s 970us/step - loss: 0.2642 - val_loss: 0.2634
Epoch 12/200
9881/9881 [==============================] - 10s 966us/step - loss: 0.2643 - val_loss: 0.2630
Epoch 13/200
9881/9881 [==============================] - 10s 1ms/step - loss: 0.2643 - val_loss: 0.2622
Epoch 14/200
9881/9881 [==============================] - 10s 1ms/step - loss: 0.2645 - val_loss: 0.2630
Epoch 15/200
9881/9881 [==============================] - 10s 964us/step - loss: 0.2642 - val_loss: 0.2626
Epoch 16/200
9881/9881 [==============================] - 9s 961us/step - loss: 0.2641 - val_loss: 0.2627
Epoch 17/200
9881/9881 [==============================] - 9s 954us/step - loss: 0.2638 - val_loss: 0.2634
Epoch 18/200
9881/9881 [==============================] - 10s 969us/step - loss: 0.2640 - val_loss: 0.2631
Epoch 19/200
9881/9881 [==============================] - 10s 962us/step - loss: 0.2638 - val_loss: 0.2620
Epoch 20/200
9881/9881 [==============================] - 10s 984us/step - loss: 0.2639 - val_loss: 0.2622
Epoch 21/200
9881/9881 [==============================] - 10s 973us/step - loss: 0.2635 - val_loss: 0.2623
Epoch 22/200
9881/9881 [==============================] - 9s 960us/step - loss: 0.2635 - val_loss: 0.2628
Epoch 23/200
9881/9881 [==============================] - 10s 970us/step - loss: 0.2633 - val_loss: 0.2624
Epoch 24/200
9881/9881 [==============================] - 10s 966us/step - loss: 0.2638 - val_loss: 0.2636
Epoch 25/200
9881/9881 [==============================] - 9s 959us/step - loss: 0.2637 - val_loss: 0.2629
Epoch 26/200
9881/9881 [==============================] - 10s 962us/step - loss: 0.2637 - val_loss: 0.2633
Epoch 27/200
9881/9881 [==============================] - 10s 971us/step - loss: 0.2637 - val_loss: 0.2647
Epoch 28/200
9881/9881 [==============================] - 10s 975us/step - loss: 0.2635 - val_loss: 0.2630
Epoch 29/200
9881/9881 [==============================] - 10s 967us/step - loss: 0.2631 - val_loss: 0.2623
Epoch 30/200
9881/9881 [==============================] - 9s 957us/step - loss: 0.2632 - val_loss: 0.2645
Epoch 31/200
9881/9881 [==============================] - 10s 963us/step - loss: 0.2632 - val_loss: 0.2620
Epoch 32/200
9881/9881 [==============================] - 10s 978us/step - loss: 0.2631 - val_loss: 0.2637
Epoch 33/200
9881/9881 [==============================] - 9s 959us/step - loss: 0.2633 - val_loss: 0.2630
Epoch 34/200
9881/9881 [==============================] - 10s 974us/step - loss: 0.2633 - val_loss: 0.2636
Epoch 35/200
9881/9881 [==============================] - 10s 965us/step - loss: 0.2636 - val_loss: 0.2622
Epoch 36/200
9881/9881 [==============================] - 10s 970us/step - loss: 0.2628 - val_loss: 0.2619
Epoch 37/200
9881/9881 [==============================] - 9s 961us/step - loss: 0.2627 - val_loss: 0.2629
Epoch 38/200
9881/9881 [==============================] - 10s 966us/step - loss: 0.2632 - val_loss: 0.2628
Epoch 39/200
9881/9881 [==============================] - 10s 967us/step - loss: 0.2629 - val_loss: 0.2618
Epoch 40/200
9881/9881 [==============================] - 9s 960us/step - loss: 0.2633 - val_loss: 0.2627
Epoch 41/200
9881/9881 [==============================] - 10s 962us/step - loss: 0.2630 - val_loss: 0.2625
Epoch 42/200
9881/9881 [==============================] - 9s 956us/step - loss: 0.2636 - val_loss: 0.2621
Epoch 43/200
9881/9881 [==============================] - 10s 972us/step - loss: 0.2634 - val_loss: 0.2627
Epoch 44/200
9881/9881 [==============================] - 10s 965us/step - loss: 0.2631 - val_loss: 0.2628
Epoch 45/200
9881/9881 [==============================] - 10s 964us/step - loss: 0.2632 - val_loss: 0.2622
Epoch 46/200
9881/9881 [==============================] - 10s 962us/step - loss: 0.2631 - val_loss: 0.2638
Epoch 47/200
9881/9881 [==============================] - 9s 960us/step - loss: 0.2634 - val_loss: 0.2624
Epoch 48/200
9881/9881 [==============================] - 9s 956us/step - loss: 0.2626 - val_loss: 0.2623
Epoch 49/200
9881/9881 [==============================] - 10s 968us/step - loss: 0.2632 - val_loss: 0.2627
Epoch 50/200
9881/9881 [==============================] - 10s 962us/step - loss: 0.2628 - val_loss: 0.2640
Epoch 51/200
9881/9881 [==============================] - 10s 968us/step - loss: 0.2627 - val_loss: 0.2632
Epoch 52/200
9881/9881 [==============================] - 9s 961us/step - loss: 0.2628 - val_loss: 0.2624
Epoch 53/200
9881/9881 [==============================] - 10s 967us/step - loss: 0.2631 - val_loss: 0.2618
Epoch 54/200
9881/9881 [==============================] - 9s 954us/step - loss: 0.2628 - val_loss: 0.2621
Epoch 55/200
9881/9881 [==============================] - 10s 961us/step - loss: 0.2629 - val_loss: 0.2644
Epoch 56/200
9881/9881 [==============================] - 10s 966us/step - loss: 0.2636 - val_loss: 0.2630
Epoch 57/200
9881/9881 [==============================] - 10s 966us/step - loss: 0.2635 - val_loss: 0.2623
Epoch 58/200
9881/9881 [==============================] - 9s 957us/step - loss: 0.2630 - val_loss: 0.2624
Epoch 59/200
9881/9881 [==============================] - 10s 988us/step - loss: 0.2626 - val_loss: 0.2639
Epoch 60/200
9881/9881 [==============================] - 10s 974us/step - loss: 0.2628 - val_loss: 0.2636
Epoch 61/200
9881/9881 [==============================] - 10s 965us/step - loss: 0.2631 - val_loss: 0.2628
Epoch 62/200
9881/9881 [==============================] - 10s 963us/step - loss: 0.2629 - val_loss: 0.2636
Epoch 63/200
9881/9881 [==============================] - 10s 1ms/step - loss: 0.2635 - val_loss: 0.2626
Epoch 64/200
9881/9881 [==============================] - 10s 1ms/step - loss: 0.2630 - val_loss: 0.2624
Epoch 00064: early stopping





<tensorflow.python.keras.callbacks.History at 0x7fb361785ba8>

Saving this model for future use.

from tensorflow.keras.models import load_model
model.save('lending_club_model.h5')

Evaluating Model Performance

model_loss = pd.DataFrame(model.history.history)
model_loss.plot()
<AxesSubplot:>
output_103_1.png
predictions = model.predict_classes(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
[[ 6831  8827]
 [   90 63296]]
              precision    recall  f1-score   support

           0       0.99      0.44      0.61     15658
           1       0.88      1.00      0.93     63386

    accuracy                           0.89     79044
   macro avg       0.93      0.72      0.77     79044
weighted avg       0.90      0.89      0.87     79044

Predicting One Input

import random
random.seed(101)
random_ind = random.randint(0,len(df))

new_customer = df.drop('loan_repaid',axis=1).iloc[random_ind]
new_customer
loan_amnt           25000.00
term                   36.00
int_rate                7.90
installment           782.26
annual_inc          62000.00
                      ...   
48052                   0.00
70466                   1.00
86630                   0.00
93700                   0.00
earliest_cr_year     1991.00
Name: 385487, Length: 85, dtype: float64
model.predict_classes(new_customer.values.reshape(1,85))
array([[1]], dtype=int32)

It appears this person is likely to repay their loan.

df.iloc[random_ind]['loan_repaid']
1.0

They did in fact repay their loan.

Previous
Previous

Project Two: Finance Project