summary Project Summary

What is my project?

To illustrate, my project mainly involves the utilization of credit modeling, which is a prevalent date science problem that operates on modeling a borrower’s credit risk. All information and data that I deal with are manipulated from Lending club, which is a market place for borrowers and investors who are both either seeking or willing to provide a moderate amount of loans to each other. Thus, my main job is to look through all the approved datas on lending club typically from 2007 to 2011, modify it into one dataset that includes all useful information, then generate features that enable me to create its matched algorithm and made the final predictions.

What are my research targets and research method?

My primary research target focuses on potential possibility of a loan getting whether paid back on time or not from a variety of data that I gained through Lending Club. Since the primary purpose for most businessmen while dealing with loans is to gain money while trying to avoid losing it, I mainly focused on the mindset of those conservative investors, who only want to invest in loans that have a good chance of getting paid off on time. To achieve the final goal as I illustrated previously, I relied on following methods. Firstly, in order to enhance the efficiency in measuring future data and minimizing potential risks, I reduced the size of the original dataset by removing irrelevant columns that contain redundant information, aren’t useful for future modeling, require too much processing to be useful and leak future result. For instance, in the loan status column, which describes directly the condition of a loan whether it’s paid off or default, only the fully paid and charged off sections should be kept since they both describe the final outcome of the loans. In that case other irrelevant values or information such as “Grace period” or “Late” should be removed. After renaming my new modified datafile as filtered_loans_2007.csv, I start to concentrate on my next step, which is handling missing value and converting categorical columns to numerical columns. By using the null function in Pandas Data frame, I can easily get the precise number of missing values that each column has. Then, I utilize the “rstrip” method to strip the right trailing percent sign (%). At last, to achieve my goal in generating information to accomplish final prediction, I calculated the high recall (true positive rate) and low fall-out (false positive rate) since the original “loan_status” columns has a imbalanced number of loans that are paid off compared to the one that weren’t paid off on time. After that, through using the “logical regression”, which is a type of algorithm that is easy to interpret and quickly to train, I successfully generated the predictions. However, those data are overfit because I use the same data that I trained my model on. Therefore, by using the k-fold cross validation I can gain the accurate date information.

What was the biggest challenge that I encountered? How did I overcome it?

To me, since it’s my first time get in touch with python, or any computer science related program, those python basic principles like term definitions or typing rules have confused me a lot. Furthermore, certain terminology involved in Lending club and the way some functions play a role in my whole project have caused me some trouble to perceive. However, I have passion in learning and doing research on such area, and those obstacles are made for me to overcome in order to become better. Thus, I have spent a great amount of time looking through those formulas and principles during my free time and trying to understand them. Additionally I have asked my tutors for extra helps for some more complex issues.

What was the most interesting part of my project?

In my opinion, the most interesting part of my project can be the part when I started to focus on the true positive rate and false positive rate for two reasons. Firstly, to achieve that step I have already accumulated a decent amount of knowledges that help me to understand their purpose and meaning more efficiently. Besides, I really love math problem solving including calculating any kinds of information. Since those previous functions require some fundamental math solving methods, I’m relatively favored in the process of doing calculations and enjoyed the final achievement.

What are my research result(s) and conclusion(s)?

My final model has a false positive rate of 7 percent, and a true positive rate of 20 percent. This means that it will greatly benefit a conservative investor by making money as long as the interests rate is high enough to offset the losses from those 7 percent of defaulting, and 20 percent of borrowers is large enough to make enough interest money to offset losses. Moreover, although I have excluding more loans than any normal strategies had, there is approximately only 15 percent of chance for loans defaulting by certain borrowers.

What have I learned in this project? Anything more I want to add?

I have learned the function and usefulness of python and the value of machine learning, which is really a very useful tool and area that particularly matched my interests in both engineering or our mathematic aspects. Also, I have gained a deeper knowledge in understanding the strategy investors used involved in lending club, as well as some calculating prediction methods.

Overall, I have to say that I truly learned a lot through this project. At first I’m only interested in pure mathematics problem solving, while this project makes me recognized the significance and the function of computer science, especially the python.