Data cleaning and preparation is a vital, important first step when starting a machine learning project. Scientists spend much of their time cleaning data as opposed to messing around with algorithms and models.
In this article we’ll walk you through data cleaning in Python, examining data sets, selecting columns for features, exploring data visually, and encoding the features for machine learning. Utilizing a Jupyter Notebook in this tutorial is suggested.
For more about data cleaning, check out these courses:
Loans that are approved are listed on their website. When an investor decides to fund a loan, the borrower pays monthly installments to Lending Club, where Lending Club then redistributes these to investors. Investors don’t have to wait until the full amount is paid to start seeing returns- a win for everyone!
Some borrowers, however, default on their loan. We’ll address this problem as we clean some data from Lending Club for the purposes of this tutorial on cleaning for machine learning. Pretend we have been asked to build a model to predict whether borrowers are likely to pay off or default their loans.
Lending Club releases data for it’s approved and declined loans on their website. To make sure we are using the same data sets, we have mirrored the data for this tutorial on data.world.
On the Lending Club website you have the option to choose different year ranges to download data sets for all types of loans. There’s a data dictionary on the bottom of the Lending Club page as well, which has information on each different column names. This dictionary is useful for knowing what each column represents. It contains two sheets:
LoanStats sheet: describes the approved loans dataset
RejectStats sheet: describes the rejected loans dataset
We will use the LoanStats sheet for the purpose of this tutorial. The approved loans data has information on current, completed, and defaulted loans. We’ll use the approved loans data from 2007 to 2011, but similar steps would be needed to clean any of the data on their site.
First we import the libraries we’ll be using and set parameters for ease of reading. We’ll assume a solid grasp of the fundamentals of working with data in Python, so if you need a refresh on those skills you can browse our course listings.
import pandas as pd
import numpy as np
pd.set_option('max_columns', 120)
pd.set_option('max_colwidth', 5000)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams['figure.figsize'] = (12,8)
Now we’ll load the data into pandas. We’ve named our downloaded data set lending_club_loans.csv. We need to load it into pandas DataFrame in order to explore it. We’ll do some easy cleaning tasks to remove info we don’t need that will slow our process down.
We will:
Name the filtered data set loans_2007, and at the end of this section save it as loans_2007.csv to separate it from the raw data. This is a good practice to follow, as it keeps your original data in case you need to retrieve something later.
Perform these steps:
# skip row 1 so pandas can parse the data properly.
loans_2007 = pd.read_csv('data/lending_club_loans.csv', skiprows=1, low_memory=False)
half_count = len(loans_2007) / 2
loans_2007 = loans_2007.dropna(thresh=half_count,axis=1) # Drop any column with more than 50% missing values
loans_2007 = loans_2007.drop(['url','desc'],axis=1) # These columns are not useful for our purposes
Use the pandas head() method to display the first three rows of the loans_2007 DataFrame.
loans_2007.head(3)
idmember_idloan_amntfunded_amntfunded_amnt_invtermint_rateinstallmentgradesub_gradeemp_titleemp_lengthhome_ownershipannual_incverification_statusissue_dloan_statuspymnt_planpurposetitlezip_codeaddr_statedtidelinq_2yrsearliest_cr_linefico_range_lowfico_range_highinq_last_6mthsopen_accpub_recrevol_balrevol_utiltotal_accinitial_list_statusout_prncpout_prncp_invtotal_pymnttotal_pymnt_invtotal_rec_prncptotal_rec_inttotal_rec_late_feerecoveriescollection_recovery_feelast_pymnt_dlast_pymnt_amntlast_credit_pull_dlast_fico_range_highlast_fico_range_lowcollections_12_mths_ex_medpolicy_codeapplication_typeacc_now_delinqchargeoff_within_12_mthsdelinq_amntpub_rec_bankruptciestax_liens010775011296599.05000.05000.04975.036 months10.65%162.87BB2NaN10+ yearsRENT24000.0VerifiedDec-2011Fully Paidncredit_cardComputer860xxAZ27.650.0Jan-1985735.0739.01.03.00.013648.083.7%9.0f0.00.05863.1551875833.845000.00863.160.00.000.00Jan-2015171.62Sep-2016744.0740.00.01.0INDIVIDUAL0.00.00.00.00.0110774301314167.02500.02500.02500.060 months15.27%59.83CC4Ryder< 1 yearRENT30000.0Source VerifiedDec-2011Charged Offncarbike309xxGA1.000.0Apr-1999740.0744.05.03.00.01687.09.4%4.0f0.00.01008.7100001008.71456.46435.170.0117.081.11Apr-2013119.66Sep-2016499.00.00.01.0INDIVIDUAL0.00.00.00.00.0210771751313524.02400.02400.02400.036 months15.96%84.33CC5NaN10+ yearsRENT12252.0Not VerifiedDec-2011Fully Paidnsmall_businessreal estate business606xxIL8.720.0Nov-2001735.0739.02.02.00.02956.098.5%10.0f0.00.03005.6668443005.672400.00605.670.00.000.00Jun-2014649.91Sep-2016719.0715.00.01.0INDIVIDUAL0.00.00.00.00.0
Use pandas .shape attribute to view the number of samples and features we’re dealing with at this stage:
loans_2007.shape(42538, 56)
Our data is now all set up, so it’s time to explore and understand what feature each column represents. Having a bad understanding of these features could cause us to make mistakes down the road, which would impede on our data analysis and modeling process.
We’ll use the LendingClub data dictionary. Create a DataFrame to contain the names of the columns, data type, first row’s values, and description from the dictionary. We’ve pre-converted the data dictionary from Excel format to a CSV. Load it and take a look.
data_dictionary = pd.read_csv('LCDataDictionary.csv') # Loading in the data dictionary print(data_dictionary.shape[0])
print(data_dictionary.columns.tolist())117
['LoanStatNew', 'Description']data_dictionary.head()
data_dictionary = data_dictionary.rename(columns={'LoanStatNew': 'name', 'Description': 'description'})
LoanStatNewDescription0acc_now_delinqThe number of accounts on which the borrower is now delinquent.1acc_open_past_24mthsNumber of trades opened in past 24 months.2addr_stateThe state provided by the borrower in the loan application3all_utilBalance to credit limit on all trades4annual_incThe self-reported annual income provided by the borrower during registration.
Join the first row of loans_2007 to the data_dictionary DataFrame to give a preview DataFrame with the following columns:
loans_2007_dtypes = pd.DataFrame(loans_2007.dtypes,columns=['dtypes'])
loans_2007_dtypes = loans_2007_dtypes.reset_index()
loans_2007_dtypes['name'] = loans_2007_dtypes['index']
loans_2007_dtypes = loans_2007_dtypes[['name','dtypes']]
loans_2007_dtypes['first value'] = loans_2007.loc[0].values
preview = loans_2007_dtypes.merge(data_dictionary, on='name',how='left')preview.head()
namedtypesfirst valuedescription0idobject1077501A unique LC assigned ID for the loan listing.1member_idfloat641.2966e+06A unique LC assigned Id for the borrower member.2loan_amntfloat645000The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.3funded_amntfloat645000The total amount committed to that loan at that point in time.4funded_amnt_invfloat644975The total amount committed by investors for that loan at that point in time.
When we printed the shape of loans_2007, it had 56 columns, so we know that this preview DataFrame has 56 rows, one explaining each column in loans_2007.
It can be a lot to try to explore all the rows at once. We will break it into three parts and look at smaller parts each time. We want to pay attention to any column that:
We need to focus on data leakage, which can make the model overfit. The model would be learning from features that won’t be available when we make future loan predictions, so we need to make sure our model is trained using only available data from the point of a loan application.
Now we’ll display the first 19 rows and analyze them.
preview[:19]
namedtypesfirst valuedescription0idobject1077501A unique LC assigned ID for the loan listing.1member_idfloat641.2966e+06A unique LC assigned Id for the borrower member.2loan_amntfloat645000The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.3funded_amntfloat645000The total amount committed to that loan at that point in time.4funded_amnt_invfloat644975The total amount committed by investors for that loan at that point in time.5termobject36 monthsThe number of payments on the loan. Values are in months and can be either 36 or 60.6int_rateobject10.65%Interest Rate on the loan7installmentfloat64162.87The monthly payment owed by the borrower if the loan originates.8gradeobjectBLC assigned loan grade9sub_gradeobjectB2LC assigned loan subgrade10emp_titleobjectNaNThe job title supplied by the Borrower when applying for the loan.*11emp_lengthobject10+ yearsEmployment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.12home_ownershipobjectRENTThe home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.13annual_incfloat6424000The self-reported annual income provided by the borrower during registration.14verification_statusobjectVerifiedIndicates if income was verified by LC, not verified, or if the income source was verified15issue_dobjectDec-2011The month which the loan was funded16loan_statusobjectFully PaidCurrent status of the loan17pymnt_planobjectnIndicates if a payment plan has been put in place for the loan18purposeobjectcredit_cardA category provided by the borrower for the loan request.
We can conclude from analyzing this that the following things must be removed:
Lending Club uses a borrower’s grade and payment term to assess interest rates. This causes variations in interest rate within a grade.
For our model it’s useful to focus on clusters of borrowers over individuals. That’s what grading does- it segments borrowers based on behaviors, which is why we’ll keep the grade column and drop interest in int_rate and sub_grade. Drop these from the DatFrame before moving to the next group of columns.
The next 19 columns:
preview[19:38]
namedtypesfirst valuedescription19titleobjectComputerThe loan title provided by the borrower20zip_codeobject860xxThe first 3 numbers of the zip code provided by the borrower in the loan application.21addr_stateobjectAZThe state provided by the borrower in the loan application22dtifloat6427.65A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.23delinq_2yrsfloat640The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years24earliest_cr_lineobjectJan-1985The month the borrower’s earliest reported credit line was opened25fico_range_lowfloat64735The lower boundary range the borrower’s FICO at loan origination belongs to.26fico_range_highfloat64739The upper boundary range the borrower’s FICO at loan origination belongs to.27inq_last_6mthsfloat641The number of inquiries in past 6 months (excluding auto and mortgage inquiries)28open_accfloat643The number of open credit lines in the borrower’s credit file.29pub_recfloat640Number of derogatory public records30revol_balfloat6413648Total credit revolving balance31revol_utilobject83.7%Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.32total_accfloat649The total number of credit lines currently in the borrower’s credit file33initial_list_statusobjectfThe initial listing status of the loan. Possible values are – W, F34out_prncpfloat640Remaining outstanding principal for total amount funded35out_prncp_invfloat640Remaining outstanding principal for portion of total amount funded by investors36total_pymntfloat645863.16Payments received to date for total amount funded37total_pymnt_invfloat645833.84Payments received to date for portion of total amount funded by investors
Take note of the fico_range_low and fico_range_high columns. We’ll talk more about them after looking at the last column group. If you’re working with newer data, it may not include FICO score data.
For this second group, we can refine by dropping these columns:
Remove these 5 columns from the DataFrame:
drop_cols = [ 'zip_code','out_prncp','out_prncp_inv',
'total_pymnt','total_pymnt_inv']
loans_2007 = loans_2007.drop(drop_cols, axis=1)
Now onto the third group.
preview[38:]
namedtypesfirst valuedescription38total_rec_prncpfloat645000Principal received to date39total_rec_intfloat64863.16Interest received to date40total_rec_late_feefloat640Late fees received to date41recoveriesfloat640post charge off gross recovery42collection_recovery_feefloat640post charge off collection fee43last_pymnt_dobjectJan-2015Last month payment was received44last_pymnt_amntfloat64171.62Last total payment amount received45last_credit_pull_dobjectSep-2016The most recent month LC pulled credit for this loan46last_fico_range_highfloat64744The upper boundary range the borrower’s last FICO pulled belongs to.47last_fico_range_lowfloat64740The lower boundary range the borrower’s last FICO pulled belongs to.48collections_12_mths_ex_medfloat640Number of collections in 12 months excluding medical collections49policy_codefloat641publicly available policy_code=1\nnew products not publicly available policy_code=250application_typeobjectINDIVIDUALIndicates whether the loan is an individual application or a joint application with two co-borrowers51acc_now_delinqfloat640The number of accounts on which the borrower is now delinquent.52chargeoff_within_12_mthsfloat640Number of charge-offs within 12 months53delinq_amntfloat640The past-due amount owed for the accounts on which the borrower is now delinquent.54pub_rec_bankruptciesfloat640Number of public record bankruptcies55tax_liensfloat640Number of tax liens
In this last group of columns, we need to drop the following, all of which leak data from the future:
Drop our last group of columns:
drop_cols = ['total_rec_prncp','total_rec_int',
'total_rec_late_fee','recoveries',
'collection_recovery_fee', 'last_pymnt_d'
'last_pymnt_amnt']
loans_2007 = loans_2007.drop(drop_cols, axis=1)
Now we have a cleaner, much more useful data set for the building of our model. All irrelevant data has been cleansed out of the system, which will allow it to run smoother.
We will now discuss the fico_range_low, fico_range_high, last_fico_range_low, and last_fico_range_high columns so you can better understand them.
FICO scores are credit scores, a number used by banks and credit cards to represent how credit-worthy someone is. The FICO score is the most used credit score in the US. When a borrower applies for a loan, LendingClub gets the borrower’s credit score from FICO and they store those values as fico_range_low, fico_range_high. After that, any updates to the borrowers score are recorded as last_fico_range_low, and last_fico_range_high.
Understanding the data you’re using is so important. The original author of this article found a Stanford University project in 2014 while researching this data set. In the project, the group listed the current credit score among late fees and recovery fees as fields they added to the features by mistake. They claimed they later learned these columns leak future information.
Another group from Stanford used our same Lending Club dataset. Using the FICO columns, the group dropped only last_fico_range_low in their model. The second group’s findings reported that the last_fico_range_high was one of the more important features in finding accurate results.
So that leads us to the question- do the FICO credit scores leak information from the future? A column leaks information when the data it holds isn’t available when using the model to predict. In this specific case, that would be when we use the model on future loan applications to try to predict borrower behavior.
For more in-depth information on FICO scores for LendingClub loans, check out this article. The post notes that while noticing the trend of FICO scores is a good predictor of whether a loan will default, LendingClub continues to update FICO scores after a loan is funded. We can use the first FICO scores (fico_range_low and fico_range_high) but we cannot use last_fico_range_low and last_fico_range_high, because LendingClub may update these after the borrower submits their application.
These are the values in the two columns we can use:
print(loans_2007['fico_range_low'].unique())
print(loans_2007['fico_range_high'].unique())[ 735. 740. 690. 695. 730. 660. 675. 725. 710. 705. 720. 665. 670. 760. 685. 755. 680. 700. 790. 750. 715. 765. 745. 770. 780. 775. 795. 810. 800. 815. 785. 805. 825. 820. 630. 625. nan 650. 655. 645. 640. 635. 610. 620. 615.]
[ 739. 744. 694. 699. 734. 664. 679. 729. 714. 709. 724. 669. 674. 764. 689. 759. 684. 704. 794. 754. 719. 769. 749. 774. 784. 779. 799. 814. 804. 819. 789. 809. 829. 824. 634. 629. nan 654. 659. 649. 644. 639. 614. 624. 619.]
Let’s get rid of the missing values.
fico_columns = ['fico_range_high','fico_range_low']
print(loans_2007.shape[0])
loans_2007.dropna(subset=fico_columns,inplace=True)
print(loans_2007.shape[0])
loans_2007[fico_columns].plot.hist(alpha=0.5,bins=20);42538
42535
Now create a column for the average of fico_range_low and fico_range_high columns. Name it fico_average. This is not the average FICO score for each borrower, but rather an average of the high and low range that we know the borrower is in.
loans_2007['fico_average'] = (loans_2007['fico_range_high'] + loans_2007['fico_range_low']) / 2
Let’s check what we just did.
cols = ['fico_range_low','fico_range_high','fico_average']
loans_2007[cols].head()
fico_range_lowfico_range_highfico_average0735.0739.0737.01740.0744.0742.02735.0739.0737.03690.0694.0692.04695.0699.0697.0
Good! We got the mean calculations and everything right. Now, we can go ahead and drop fico_range_low, fico_range_high, last_fico_range_low, and last_fico_range_high columns.
drop_cols = ['fico_range_low','fico_range_high','last_fico_range_low', 'last_fico_range_high']
loans_2007 = loans_2007.drop(drop_cols, axis=1)
loans_2007.shape(42535, 33)
By analyzing this data and the columns, we have reduced the number of columns without losing any important data for our model. This is why data cleaning is so important.
We must decide on the right column to use as a target column. Our goal is to predict who will pay off their loan and who will default, and we need to find a column that reflects this. We learned from the description of columns in the preview DataFrame that loan_status is the only field in the main data set that describes a loan status, so we’ll use this column as the target column.
preview[preview.name == 'loan_status']
namedtypesfirst valuedescription16loan_statusobjectFully PaidCurrent status of the loan
This column contains text values that need to be converted to numerical values to be able to use for training a model. We’ll use the DataFrame method value_counts() to return the frequency of the unique values in the loan_status column.
loans_2007["loan_status"].value_counts()Fully Paid 33586
Charged Off 5653
Does not meet the credit policy. Status:Fully Paid 1988
Does not meet the credit policy. Status:Charged Off 761
Current 513
In Grace Period 16
Late (31-120 days) 12
Late (16-30 days) 5
Default 1
Name: loan_status, dtype: int64
The loan status has nine different possible values. We can read about most of the different loan statuses on the LendingClub website as well as posts on the Lend Academy and Orchard forums.
We’ll pull that data together in a table below so we can see the unique values, their frequency in the data set, and get a clearer idea of what each means:
meaning = [
"Loan has been fully paid off.",
"Loan for which there is no longer a reasonable expectation of further payments.",
"While the loan was paid off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.",
"While the loan was charged off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.",
"Loan is up to date on current payments.",
"The loan is past due but still in the grace period of 15 days.",
"Loan hasn't been paid in 31 to 120 days (late on the current payment).",
"Loan hasn't been paid in 16 to 30 days (late on the current payment).",
"Loan is defaulted on and no payment has been made for more than 121 days."]
status, count = loans_2007["loan_status"].value_counts().index, loans_2007["loan_status"].value_counts().values
loan_statuses_explanation = pd.DataFrame({'Loan Status': status,'Count': count,'Meaning': meaning})[['Loan Status','Count','Meaning']]
loan_statuses_explanation
Loan StatusCountMeaning0Fully Paid33586Loan has been fully paid off.1Charged Off5653Loan for which there is no longer a reasonable expectation of further payments.2Does not meet the credit policy. Status:Fully Paid1988While the loan was paid off, the loan application today would no longer meet the credit policy and wouldn’t be approved on to the marketplace.3Does not meet the credit policy. Status:Charged Off761While the loan was charged off, the loan application today would no longer meet the credit policy and wouldn’t be approved on to the marketplace.4Current513Loan is up to date on current payments.5In Grace Period16The loan is past due but still in the grace period of 15 days.6Late (31-120 days)12Loan hasn’t been paid in 31 to 120 days (late on the current payment).7Late (16-30 days)5Loan hasn’t been paid in 16 to 30 days (late on the current payment).8Default1Loan is defaulted on and no payment has been made for more than 121 days.
Our goal is to create a machine learning model that can learn from past loans and predict what loans will be paid off or not. In the above table, solely the Fully Paid and Charged Off values describe the final loan outcome. The other values describe still ongoing loans.
The Default status resembles the Charged Off status but to LendingClub, loans that are charged off have no chance of being repaid. Defaulted loans have a smaller chance. We should use only samples where the loan_status column is 'Fully Paid' or 'Charged Off'.
We want to be able to predict which of 'Fully Paid' or 'Charged Off' a loan will fall under so we can treat this as binary classification. Remove all the loans that don’t contain 'Fully Paid' or 'Charged Off' as the loan’s status and change the 'Fully Paid' values to 1 for positive cases and the 'Charged Off' values to 0 for negative cases.
Out of the ~42,000 rows we’ll thus remove just over 3,000.
There are multiple ways to transform all of the values in a column. We’ll use the DataFrame method replace().
loans_2007 = loans_2007[(loans_2007["loan_status"] == "Fully Paid") |
(loans_2007["loan_status"] == "Charged Off")]
mapping_dictionary = {"loan_status":{ "Fully Paid": 1, "Charged Off": 0}}
loans_2007 = loans_2007.replace(mapping_dictionary)
fig, axs = plt.subplots(1,2,figsize=(14,7))
sns.countplot(x='loan_status',data=filtered_loans,ax=axs[0])
axs[0].set_title("Frequency of each Loan Status")
filtered_loans.loan_status.value_counts().plot(x=None,y=None, kind='pie', ax=axs[1],autopct='%1.2f%%')
axs[1].set_title("Percentage of each Loan status")
plt.show()
These plots indicate that a significant number of borrowers in our data set paid off their loan. We are more interested in identifying these “defaulters,” as in this model we’re trying to maximize investment returns.
Not lending to people would aid in increasing returns, so we’ll clean our data while focusing on building a model that can identify likely defaulters at the application point.
Finally, let’s find columns that have only one unique value and remove them. These columns won’t be useful, as they don’t add any information. This will make the next steps easier as well.
The pandas Series method nunique() returns the number of unique values, excluding any null values. We can use this method across the data set to remove these columns in one step.
loans_2007 = loans_2007.loc[:,loans_2007.apply(pd.Series.nunique) != 1]
There may be some columns with more than one unique value, but one value that has insignificant frequency in the data set. Find and drop any columns with unique values that show up less than four times:
for col in loans_2007.columns:
if (len(loans_2007[col].unique()) < 4):
print(loans_2007[col].value_counts())
print()36 months 29096
60 months 10143
Name: term, dtype: int64
Not Verified 16845
Verified 12526
Source Verified 9868
Name: verification_status, dtype: int64
1 33586
0 5653
Name: loan_status, dtype: int64
n 39238
y 1
Name: pymnt_plan, dtype: int64
The payment plan column (pymnt_plan) has two unique values, 'y' and 'n', with 'y' occurring only once. Let’s drop this column:
print(loans_2007.shape[1])
loans_2007 = loans_2007.drop('pymnt_plan', axis=1)
print("We've been able to reduce the features to => {}".format(loans_2007.shape[1]))25
We've been able to reduce the features to => 24
Lastly, let’s use pandas to save our freshly-cleaned DataFrame as a CSV file:
loans_2007.to_csv("processed_data/filtered_loans_2007.csv",index=False)
Our data set is now much cleaner and easier to work with. This isn’t the final step though!
Now we’ll prepare the filtered_loans_2006.csv data for machine learning. We will handle missing values, convert categorical columns to numeric columns, and remove any other columns we don’t need.
Before feeding our data into a machine learning algorithm, we must handle missing values and categorical features. The mathematics underlying most machine learning models assumes that the data is numerical and contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to train a model using data that contain missing values or non-numeric values when working with models like linear regression and logistic regression.
Here’s what we’ll be doing in this step:
First we load in our data from last section’s final output:
filtered_loans = pd.read_csv('processed_data/filtered_loans_2007.csv')
print(filtered_loans.shape)
filtered_loans.head()(39239, 24)
loan_amntterminstallmentgradeemp_lengthhome_ownershipannual_incverification_statusloan_statuspurposetitleaddr_statedtidelinq_2yrsearliest_cr_lineinq_last_6mthsopen_accpub_recrevol_balrevol_utiltotal_acclast_credit_pull_dpub_rec_bankruptciesfico_average05000.036 months162.87B10+ yearsRENT24000.0Verified1credit_cardComputerAZ27.650.0Jan-19851.03.00.013648.083.7%9.0Sep-20160.0737.012500.060 months59.83C< 1 yearRENT30000.0Source Verified0carbikeGA1.000.0Apr-19995.03.00.01687.09.4%4.0Sep-20160.0742.022400.036 months84.33C10+ yearsRENT12252.0Not Verified1small_businessreal estate businessIL8.720.0Nov-20012.02.00.02956.098.5%10.0Sep-20160.0737.0310000.036 months339.31C10+ yearsRENT49200.0Source Verified1otherpersonelCA20.000.0Feb-19961.010.00.05598.021%37.0Apr-20160.0692.045000.036 months156.46A3 yearsRENT36000.0Source Verified1weddingMy wedding loan I promise to pay backAZ11.200.0Nov-20043.09.00.07963.028.3%12.0Jan-20160.0732.0
We will compute the number of missing values and determine how to handle them. We can return the number of missing values across the DataFrame by doing this:
null_counts = filtered_loans.isnull().sum()
print("Number of null values in each column:\n{}".format(null_counts))Number of null values in each column:
loan_amnt 0
term 0
installment 0
grade 0
emp_length 0
home_ownership 0
annual_inc 0
verification_status 0
loan_status 0
purpose 0
title 10
addr_state 0
dti 0
delinq_2yrs 0
earliest_cr_line 0
inq_last_6mths 0
open_acc 0
pub_rec 0
revol_bal 0
revol_util 50
total_acc 0
last_credit_pull_d 2
pub_rec_bankruptcies 697
fico_average 0
dtype: int64
While most of the columns have 0 missing values, title has 9 missing values, revol_util has 48, and pub_rec_bankruptcies contains 675 rows with missing values.
Remove columns entirely where more than 1% (392) of the rows for that column contain a null value. We’ll also remove the remaining rows containing null values. We’ll lose some data by doing this but we’ll be able to keep extra features to use for prediction (as we won’t have to drop those columns).
Keep the title and revol_util columns, removing rows containing missing values. Drop the pub_rec_bankruptcies column since more than 1% of the rows have a missing value for this column.
Here’s what to do:
In code, that looks like:
filtered_loans = filtered_loans.drop("pub_rec_bankruptcies",axis=1)
filtered_loans = filtered_loans.dropna()
There are, of course, many ways to deal with missing values. This is a vital step in data cleaning for machine learning. For our purposes in this tutorial we are set with this step, but if you want to learn more about handling missing values when cleaning data try our Data Cleaning Advanced course for Python, linked at the top of this article.
We want to end here with a data set ready for machine learning. This data must have no missing values and all values in columns must be numeric. We’ve already handled the missing values, so now we must find the number of columns that are of the object data type so we can make those values numeric.
print("Data types and their frequency\n{}".format(filtered_loans.dtypes.value_counts()))Data types and their frequency
float64 11
object 11
int64 1
dtype: int64
We have 11 object columns that contain text that must be converted into numeric features. Select solely the object columns using the DataFrame method select_dtype. Display a sample row to get a better sense of how the values in each column are formatted.
object_columns_df = filtered_loans.select_dtypes(include=['object'])
print(object_columns_df.iloc[0])term 36 months
grade B
emp_length 10+ years
home_ownership RENT
verification_status Verified
purpose credit_card
title Computer
addr_state AZ
earliest_cr_line Jan-1985
revol_util 83.7%
last_credit_pull_d Sep-2016
Name: 0, dtype: object
Note that the revol_util column contains numeric values, but is formatted as an object column. We learned from the description of columns in the preview DataFrame that revol_util is a “revolving line utilization rate or the amount of credit the borrower is using relative to all available credit” (read more here). We must format revol_util as a numeric value. Here’s what to do:
filtered_loans['revol_util'] = filtered_loans['revol_util'].str.rstrip('%').astype('float')
Moving on, these columns seem to represent categorical values:
To check, confirm by checking the number of unique values in each of them. Based on the first row’s values for purpose and title, it seems these two columns show the same information. We must explore their unique value counts to prove if this is true.
Note the first row’s values for both earliest_cr_line and last_credit_pull_d columns contain date values that would require a good amount of feature engineering for them to be potentially useful:
For some analyses, doing this feature engineering might be worth the time, but for the purposes of this tutorial we’ll remove these date columns from the DataFrame.
First, let’s explore the unique value counts of the six columns that seem to contain categorical values:
cols = ['home_ownership', 'grade','verification_status', 'emp_length', 'term', 'addr_state']
for name in cols:
print(name,':')
print(object_columns_df[name].value_counts(),'\n')home_ownership :
RENT 18677
MORTGAGE 17381
OWN 3020
OTHER 96
NONE 3
Name: home_ownership, dtype: int64
grade :
B 11873
A 10062
C 7970
D 5194
E 2760
F 1009
G 309
Name: grade, dtype: int64
verification_status :
Not Verified 16809
Verified 12515
Source Verified 9853
Name: verification_status, dtype: int64
emp_length :
10+ years 8715
< 1 year 4542
2 years 4344
3 years 4050
4 years 3385
5 years 3243
1 year 3207
6 years 2198
7 years 1738
8 years 1457
9 years 1245
n/a 1053
Name: emp_length, dtype: int64
term :
36 months 29041
60 months 10136
Name: term, dtype: int64
addr_state :
CA 7019
NY 3757
FL 2831
TX 2693
NJ 1825
IL 1513
PA 1493
VA 1388
GA 1381
MA 1322
OH 1197
MD 1039
AZ 863
WA 830
CO 777
NC 772
CT 738
MI 718
MO 677
MN 608
NV 488
SC 469
WI 447
OR 441
AL 441
LA 432
KY 319
OK 294
KS 264
UT 255
AR 241
DC 211
RI 197
NM 187
WV 174
HI 170
NH 169
DE 113
MT 84
WY 83
AK 79
SD 61
VT 53
MS 19
TN 17
IN 9
ID 6
IA 5
NE 5
ME 3
Name: addr_state, dtype: int64
Most of these columns contain discrete categorical values we can encode as dummy variables and keep. The addr_state column, in contrast, has too many unique values, so it’s better for our purposes to lose it.
Next let’s look at the unique value counts for the purpose and title columns to decide which columns we should keep.
for name in ['purpose','title']:
print("Unique Values in column: {}\n".format(name))
print(filtered_loans[name].value_counts(),'\n')Unique Values in column: purpose
debt_consolidation 18355
credit_card 5073
other 3921
home_improvement 2944
major_purchase 2178
small_business 1792
car 1534
wedding 940
medical 688
moving 580
vacation 377
house 372
educational 320
renewable_energy 103
Name: purpose, dtype: int64
Unique Values in column: title
Debt Consolidation 2142
Debt Consolidation Loan 1670
Personal Loan 650
Consolidation 501
debt consolidation 495
Credit Card Consolidation 354
Home Improvement 350
Debt consolidation 331
Small Business Loan 317
Credit Card Loan 310
Personal 306
Consolidation Loan 255
Home Improvement Loan 243
personal loan 231
personal 217
Loan 210
Wedding Loan 206
Car Loan 198
consolidation 197
Other Loan 187
Credit Card Payoff 153
Wedding 152
Major Purchase Loan 144
Credit Card Refinance 143
Consolidate 126
Medical 120
Credit Card 115
home improvement 109
My Loan 94
Credit Cards 92
...
toddandkim4ever 1
Remainder of down payment 1
Building a Financial Future 1
Higher interest payoff 1
Chase Home Improvement Loan 1
Sprinter Purchase 1
Refi credit card-great payment record 1
Karen's Freedom Loan 1
Business relocation and partner buyout 1
Update My New House 1
tito 1
florida vacation 1
Back to 0 1
Bye Bye credit card 1
britschool 1
Consolidation 16X60 1
Last Call 1
Want to be debt free in "3" 1
for excellent credit 1
loaney 1
jamal's loan 1
Refying Lending Club-I LOVE THIS PLACE! 1
Consoliation Loan 1
Personal/ Consolidation 1
Pauls Car 1
Road to freedom loan 1
Pay it off FINALLY! 1
MASH consolidation 1
Destination Wedding 1
Store Charge Card 1
Name: title, dtype: int64
The purpose and title columns contain overlapping information, but the purpose column contains fewer discrete values and is cleaner, so we’ll keep that column and drop the title column.
Drop the columns we’ve decided not to keep so far:
drop_cols = ['last_credit_pull_d','addr_state','title','earliest_cr_line']
filtered_loans = filtered_loans.drop(drop_cols,axis=1)
First we need to understand the two types of categorical features in our dataset and how we can convert them to numerical features.
We cannot do that with the column purpose. It wouldn’t make sense to say:
Now in our dataset we have these columns:
There are different ways to handle these two types. To map the ordinal values to integers, we can use the pandas DataFrame method replace() to map both grade and emp_length to appropriate numeric values:
mapping_dict = {
"emp_length": {
"10+ years": 10,
"9 years": 9,
"8 years": 8,
"7 years": 7,
"6 years": 6,
"5 years": 5,
"4 years": 4,
"3 years": 3,
"2 years": 2,
"1 year": 1,
"< 1 year": 0,
"n/a": 0
},
"grade":{
"A": 1,
"B": 2,
"C": 3,
"D": 4,
"E": 5,
"F": 6,
"G": 7
}
}
filtered_loans = filtered_loans.replace(mapping_dict)
filtered_loans[['emp_length','grade']].head()
Emp_lengthgrade010210321033103431
Now let’s move on to the Nominal Values. Converting nominal features into numerical features requires encoding them as dummy variables. This will look like:
Now encode the nominal columns that we have in our data set:
nominal_columns = ["home_ownership", "verification_status", "purpose", "term"]
dummy_df = pd.get_dummies(filtered_loans[nominal_columns])
filtered_loans = pd.concat([filtered_loans, dummy_df], axis=1)
filtered_loans = filtered_loans.drop(nominal_columns, axis=1)filtered_loans.head()
loan_amntinstallmentgradeemp_lengthannual_incloan_statusdtidelinq_2yrsinq_last_6mthsopen_accpub_recrevol_balrevol_utiltotal_accfico_averagehome_ownership_MORTGAGEhome_ownership_NONEhome_ownership_OTHERhome_ownership_OWNhome_ownership_RENTverification_status_Not Verifiedverification_status_Source Verifiedverification_status_Verifiedpurpose_carpurpose_credit_cardpurpose_debt_consolidationpurpose_educationalpurpose_home_improvementpurpose_housepurpose_major_purchasepurpose_medicalpurpose_movingpurpose_otherpurpose_renewable_energypurpose_small_businesspurpose_vacationpurpose_weddingterm_ 36 monthsterm_ 60 months05000.0162.8721024000.0127.650.01.03.00.013648.083.79.0737.000001001010000000000001012500.059.833030000.001.000.05.03.00.01687.09.44.0742.000001010100000000000000122400.084.3331012252.018.720.02.02.00.02956.098.510.0737.0000011000000000000010010310000.0339.3131049200.0120.000.01.010.00.05598.021.037.0692.000001010000000000100001045000.0156.461336000.0111.200.03.09.00.07963.028.312.0732.0000010100000000000000110
Finally, let’s look at our final output from this step to make sure all the features are the same length, contain no null value, and are numerical. We’ll use pandas’s info method to inspect the filtered_loans DataFrame:
filtered_loans.info()<class 'pandas.core.frame.dataframe'="">
Int64Index: 39177 entries, 0 to 39238
Data columns (total 39 columns):
loan_amnt 39177 non-null float64
installment 39177 non-null float64
grade 39177 non-null int64
emp_length 39177 non-null int64
annual_inc 39177 non-null float64
loan_status 39177 non-null int64
dti 39177 non-null float64
delinq_2yrs 39177 non-null float64
inq_last_6mths 39177 non-null float64
open_acc 39177 non-null float64
pub_rec 39177 non-null float64
revol_bal 39177 non-null float64
revol_util 39177 non-null float64
total_acc 39177 non-null float64
fico_average 39177 non-null float64
home_ownership_MORTGAGE 39177 non-null uint8
home_ownership_NONE 39177 non-null uint8
home_ownership_OTHER 39177 non-null uint8
home_ownership_OWN 39177 non-null uint8
home_ownership_RENT 39177 non-null uint8
verification_status_Not Verified 39177 non-null uint8
verification_status_Source Verified 39177 non-null uint8
verification_status_Verified 39177 non-null uint8
purpose_car 39177 non-null uint8
purpose_credit_card 39177 non-null uint8
purpose_debt_consolidation 39177 non-null uint8
purpose_educational 39177 non-null uint8
purpose_home_improvement 39177 non-null uint8
purpose_house 39177 non-null uint8
purpose_major_purchase 39177 non-null uint8
purpose_medical 39177 non-null uint8
purpose_moving 39177 non-null uint8
purpose_other 39177 non-null uint8
purpose_renewable_energy 39177 non-null uint8
purpose_small_business 39177 non-null uint8
purpose_vacation 39177 non-null uint8
purpose_wedding 39177 non-null uint8
term_ 36 months 39177 non-null uint8
term_ 60 months 39177 non-null uint8
dtypes: float64(12), int64(3), uint8(24)
memory usage: 5.7 MB
</class>
We’re all set! Congrats, you’ve just cleaned a huge data set for machine learning and learned some data skills in the process. There’s still one more step we need to be done, though.
Save, save, save! It’s always smart to store the final output of each part of your workflow in a separate csv file. This allows you to make changes in the processing flow without having to recalculate everything we’ve already done.
As we did before, we can store our DataFrame as a CSV using the handy pandas to_csv() function.
filtered_loans.to_csv("processed_data/cleaned_loans_2007.csv",index=False)
Here we have covered the steps you need to work through a large data set and clean and prep the data for a machine learning project. But there’s always more to learn, and many places to learn from.
If you are happy with your data cleaning skills and you want to work more with this data set, check out our interactive machine learning walkthrough course that covers the next steps in working with this Lending Club data.
For more information and learning opportunities about data cleaning, check out these courses below. Happy learning!