
Transform of Data:
1.split column
2.lambda function & re.findall
3.replace
1.Split column
Here we take an example of titanic dataset. Suppose we want to Get title (Mr, Miss…) from the Name column to create a new feature.

df['Name_split']=df['Name'].str.split(',')
df['Name_split']=df['Name_split'].str.get(1)
df['Title']=df['Name_split'].str.split('.').str.get(0)
or
df['Title']=df['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip())
1.Split the Name column by ‘,’ and assign it to ‘Name_split’

2.Get the second part of ‘Name_split’

3.Split the ‘Name_split’ by ‘.’ and get the first part of it

2.lambda function & re.findall
import re
df['Title']=df['Name'].apply(lambda x: re.findall('([A-Za-z]+).',x)[0])
#regular expression
dollar sign: $
arbitrary number of digits: d*
decimal point: .
x number of digits :d{x}
any capital letter : [A-Z]
arbitrary number of alphanumeric characters : w*
3.replace
After get the ‘Title’ column, we’d better reduce the Title column’s category.
data_train.Title.value_counts()
Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Major 2
Mlle 2
Col 2
Capt 1
Jonkheer 1
Mme 1
Don 1
Ms 1
Sir 1
Lady 1
the Countess 1
Name: Title, dtype: int64
1.Use replace() method
df['Title'] = df['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr',
'Major', 'Rev', 'Sir', 'Jonkheer'], 'Rare')
df['Title'] = df['Title'].replace('Mlle', 'Miss')
df['Title'] = df['Title'].replace('Ms', 'Miss')
df['Title'] = df['Title'].replace('Mme', 'Mrs')
Mr 517
Miss 182
Mrs 125
Master 40
Rare 23
Mlle 2
Ms 1
Mme 1
Name: Title, dtype: int64
2.Use map() method
Title_Dict = {}
Title_Dict.update(dict.fromkeys(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer'))
Title_Dict.update(dict.fromkeys(['Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty'))
Title_Dict.update(dict.fromkeys(['Mme', 'Ms', 'Mrs'], 'Mrs'))
Title_Dict.update(dict.fromkeys(['Mlle', 'Miss'], 'Miss'))
Title_Dict.update(dict.fromkeys(['Mr'], 'Mr'))
Title_Dict.update(dict.fromkeys(['Master','Jonkheer'], 'Master'))
df['Title'] = df['Title'].map(Title_Dict)




近期评论