eda check list
admin
11月 12, 2020
0
- Get domain knowledge
- Check if the data is intuitive (abnormal detection)
- add a feature
is_incorrect
- Understand how the data was generated
- It is crucial to understand the generation process to set up a proper validation scheme
- Two things to do with anonymized features
- Try to decode the features
- Guess the feature types
- Each type need its own preprocessing
- Visualization
- Tools for individual features exploration
- Histograms
plt.hist(x)
- Plot (index versus value)
plt.plot(x, something)
- Statistics
df.describe() or x.mean() or x.var()
- Other tools
x.value_counts() or x.isnull()
- Tools for feature relationships
- Pairs
plt.scatter(x1, x2)
pd.scatter_matrix(df)
df.corr() or plt.matshow()
- Groups:
- Clustering
- Plot (index vs feature statistics)
df.mean().sort_values().plot()
- Data Clean
- remove duplicated and constant features
traintest.nunique(axis=1) == 1
traintest.T.drop_duplicates()
for f in categorical_feats: traintest[f] = traintest[f].factorize then traintest.T.drop_duplicates()
- check if same rows have same label
- check if dataset is shuffled
近期评论