eda check list

Get domain knowledge
Check if the data is intuitive (abnormal detection)
- add a feature is_incorrect
Understand how the data was generated
- It is crucial to understand the generation process to set up a proper validation scheme
Two things to do with anonymized features
- Try to decode the features
  - Guess the true meaning of the feature
- Guess the feature types
  - Each type need its own preprocessing
Visualization
- Tools for individual features exploration
  - Histograms plt.hist(x)
  - Plot (index versus value) plt.plot(x, something)
  - Statistics df.describe() or x.mean() or x.var()
  - Other tools x.value_counts() or x.isnull()
- Tools for feature relationships
  - Pairs
    - plt.scatter(x1, x2)
    - pd.scatter_matrix(df)
    - df.corr() or plt.matshow()
  - Groups:
    - Clustering
    - Plot (index vs feature statistics) df.mean().sort_values().plot()
Data Clean
- remove duplicated and constant features
  - traintest.nunique(axis=1) == 1
  - traintest.T.drop_duplicates()
  - for f in categorical_feats: traintest[f] = traintest[f].factorize then traintest.T.drop_duplicates()
- check if same rows have same label
- check if dataset is shuffled

近期文章