deal with inbalance in a nlp classification task by extracting numerical features


Competition Introduction

In Elo Merchant Category Recommendation competition, kagglers will develop algorithms to identify and serve the most relevant opportunities to individuals, by uncovering signal in customer loyalty, which will improve customers’ lives and help Elo reduce unwanted campaigns as well as create the right experience for customers. The competition entrance is here.

Motivation

Using a LightGBM model, this blog tries to dive into initial explore of training and testing datasets. New features are built by extracting numberical feat from first active month in order to recover some information from the time series.

Contents

  1. Questions for better business understanding
  2. Results evaluation: conclusion and discussion
  3. References

1. Questions for better business understanding

At the beginning, a good understanding of data is always appreciable. In this blog we only consider two datasets: training dataset and testing dataset. By asking questions, here is what i find:

Question 1.1: What scale and what kind of data are necessary for customer loyalty prediction model?

Fig1: scale and memory usage of datasets

A: This question is supposed to provide proper evaluation of how many sources company need to invest for the deploy of customer loyalty prediction model. Because MemoryError is raised when checking transaction dataset, i choose to evaluate the memory usage of theses data files by using google colab.

The results of scale and memory usage of each dataset are shown in the Fig1. As it is shown above, all needed data could be divided in three parts: card feature data, transaction data and merchant data. Card feature data and merchant data both are relatively small than transaction data, which the later is nearly 30 million, with memory usage more than 8 GB in total in this competition

Question 1.2: In which way the application of customer loyalty prediction model would impact different business units?

A:
By diving into this question, necessary changes would to be revealed which will help company work better.

If customer loyalty prediction model are applied, finance part may need to expand or reconstruct existed database to record more features as well as missing values; sales part need to use the conclusion of model for decision and collect customer review; technology part is supposed to assure model stability and improve model using or constructing new features from finance part and reviews from sales part. This process would be visualized by a business cycle, more details for model deploy is provided in this blog here.

Question 1.3: How to evaluate the influence of deploying customer loyalty prediction model? Is modeling target really convincible?

Fig2: distribution of “target”

A: This question aims at assuring the application of customer loyalty prediction model more controllable and scalable.

Take a look at the Fig 2, the feat column “target”, the modeling target, is stored in numberical data type which means several loss function like MSE could be used for evaluation. By checking its distribution, there are outliers around -30 found in modeling target which means not all target data is convincible.

2. Results evaluation: conclusion and discussion

Same as what i did in the first section, some questions i found are reviewed and answered in this step. Generally, the more good questions are reviewed, the better understanding of model evaluation you gain.

Fig3: feature importance

Fig4: part of feature importance extracted from “first_active_month”

Take a further look at the Fig3, “first_active_month_elapsed_time_today” contributes most to the model prediction while “first_active_month_monthsession_January” contributes least. Among the feat extracted from date data shown in the Fig4, “first_active_month_elapsed_time_today” gains the most importance while the “first_active_month_monthsession_January” gain the least importance as well, which means valuable information hidden in “first_active_month”. Isn’t it a intersting finding?

When we compare distribution training dataset and testing dataset of the same feat, here comes new findings. “first_active_month_elapsed_time_today”, “first_active_month_elapsed_time_specific” shows measurable different distribution between training data and testing data, among which the first one is the feature picked by the above two questions. On the other hand, the left feats couldn’t provide considerable difference on data distribution between two datasets, which is relatively consistent with the result shown in feat importance figure. More details could refer to my github here.

What’s next?

Up to now, we have briefly went through the whole data processing procedure. For me, next i’d like to add more feats by considering other files provided by this competition. Also, it is advisable for you to start a mining journal following exsited kernel which is listed in the below references. By the way, there is a few questions that i want to share with you which i think maybe desirble to be devoted to:

  • How many feats are suitable? Generally, the more feats you obtain may not guarantee higer submit score you expect
  • For this model, missing value would not lead to the failure of LightGBM. If using LightGBM, how to deal with missing values? Imputer or just keep by?

Hope everyone has a magic data miner time! 🙂

3. References

  1. My first kernel (3.699) by Chau Ngoc Huynh
  2. A Closer Look at Date Variables by Robin Denz
  3. LGB + FE (LB 3.707) by Konrad Banachewicz