月落丹枫

Transform of Data:

1.split column
2.lambda function & re.findall
3.replace

1.Split column

Here we take an example of titanic dataset. Suppose we want to Get title (Mr, Miss…) from the Name column to create a new feature.

图片名称
df['Name_split']=df['Name'].str.split(',')

df['Name_split']=df['Name_split'].str.get(1)

df['Title']=df['Name_split'].str.split('.').str.get(0)
or
df['Title']=df['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip())

1.Split the Name column by ‘,’ and assign it to ‘Name_split’

图片名称

2.Get the second part of ‘Name_split’

图片名称

3.Split the ‘Name_split’ by ‘.’ and get the first part of it

图片名称

2.lambda function & re.findall

import re
df['Title']=df['Name'].apply(lambda x: re.findall('([A-Za-z]+).',x)[0])
#regular expression
dollar sign: $
arbitrary number of digits: d*
decimal point: .
x number of digits :d{x}
any capital letter : [A-Z]
arbitrary number of alphanumeric characters : w*

3.replace

After get the ‘Title’ column, we’d better reduce the Title column’s category.

data_train.Title.value_counts()
Mr              517
 Miss            182
 Mrs             125
 Master           40
 Dr                7
 Rev               6
 Major             2
 Mlle              2
 Col               2
 Capt              1
 Jonkheer          1
 Mme               1
 Don               1
 Ms                1
 Sir               1
 Lady              1
 the Countess      1
Name: Title, dtype: int64

1.Use replace() method

df['Title'] = df['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr',
'Major', 'Rev', 'Sir', 'Jonkheer'], 'Rare')
df['Title'] = df['Title'].replace('Mlle', 'Miss')
df['Title'] = df['Title'].replace('Ms', 'Miss')
df['Title'] = df['Title'].replace('Mme', 'Mrs')

Mr        517
Miss      182
Mrs       125
Master     40
Rare       23
Mlle        2
Ms          1
Mme         1
Name: Title, dtype: int64

2.Use map() method

Title_Dict = {}
Title_Dict.update(dict.fromkeys(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer'))
Title_Dict.update(dict.fromkeys(['Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty'))
Title_Dict.update(dict.fromkeys(['Mme', 'Ms', 'Mrs'], 'Mrs'))
Title_Dict.update(dict.fromkeys(['Mlle', 'Miss'], 'Miss'))
Title_Dict.update(dict.fromkeys(['Mr'], 'Mr'))
Title_Dict.update(dict.fromkeys(['Master','Jonkheer'], 'Master'))

df['Title'] = df['Title'].map(Title_Dict)