eda check list

  • Get domain knowledge
  • Check if the data is intuitive (abnormal detection)
    • add a feature is_incorrect
  • Understand how the data was generated
    • It is crucial to understand the generation process to set up a proper validation scheme
  • Two things to do with anonymized features
  • Visualization
    • Tools for individual features exploration
      • Histograms plt.hist(x)
      • Plot (index versus value) plt.plot(x, something)
      • Statistics df.describe() or x.mean() or x.var()
      • Other tools x.value_counts() or x.isnull()
    • Tools for feature relationships
      • Pairs
        • plt.scatter(x1, x2)
        • pd.scatter_matrix(df)
        • df.corr() or plt.matshow()
      • Groups:
        • Clustering
        • Plot (index vs feature statistics) df.mean().sort_values().plot()
  • Data Clean
    • remove duplicated and constant features
      • traintest.nunique(axis=1) == 1
      • traintest.T.drop_duplicates()
      • for f in categorical_feats: traintest[f] = traintest[f].factorize then traintest.T.drop_duplicates()
    • check if same rows have same label
    • check if dataset is shuffled