practical guide for kaggle competition

Define your goals.

What you can get out of your participation?

Get familiar with problem domain
Start with simple (or even primitive) solution
Debug full pipeline
− From reading data to writing submission file
“From simple to complex”
− I prefer to start with Random Forest rather than Gradient Boosted Decision Trees

Do basic preprocessing and convert csv/txt files into hdf5/npy for much faster loading
Do not forget that by default data is stored in 64-bit arrays, most of the times you can safely downcast it to 32-bits
Large datasets can be processed in chunks

Sort all parameters by these principles:

Note: changing one parameter can affect the whole pipeline

Fast and dirty always better
- Don’t pay too much attention to code quality
- Keep things simple: save only important things
- If you feel uncomfortable with given computational resources
- rent a larger server
Use good variable names
- If your code is hard to read — you definitely will have
  problems soon or later
Keep your research reproducible
- Fix random seed
  − Write down exactly how any features were generated
  − Use Version Control Systems (VCS, for example, git)
Reuse code
− Especially important to use same code for train and test stages
Read papers
- For example, how to optimize AUC
Read forums and examine kernels first
Code organization
- keeping it clean
- macros
- test/val