
label smoothing:
- MOTIVATION:
传统交叉熵函数中:
$$Loss=sum_{k}^K q(k|x)p(k|x)$$
其中真实的label数据 $q(k|x)$ 往往是一个 one-hot 向量,即只有正确的val值为1,其余位置都为0,这种做法会导致两个问题。- it may result in over-fitting: if the model learns to assign full probability to the groundtruth label for each training example, it is not guaranteed to generalize.
- it encourages the differences between the largest logit and all others to become large, and this, combined with the bounded gradient $frac{∂_{loss}}{∂_{z_{k}}}$, reduces the ability of the model to adapt. Intuitively, this happens because the model becomes too confident about its predictions.
- SOLUTION
替换原先的交叉熵函数中的 $q(k|x)=delta_{k,y}$ 为以下形式 $q'(k|x)=(1-epsilon)delta_{k,y} + epsilonmu(k)$
其中 $mu(k)$为均匀分布。 - IMPLEMENT
def compute_loss(self, model, net_output, sample, reduce=True):
lprobs = model.get_normalized_probs(net_output, log_probs=True)
lprobs = lprobs.view(-1, lprobs.size(-1))
target = model.get_targets(sample, net_output).view(-1, 1)
non_pad_mask = target.ne(self.padding_idx)
nll_loss = -lprobs.gather(dim=-1, index=target)[non_pad_mask]
smooth_loss = -lprobs.sum(dim=-1, keepdim=True)[non_pad_mask]
if reduce:
nll_loss = nll_loss.sum()
smooth_loss = smooth_loss.sum()
eps_i = self.eps / lprobs.size(-1)
loss = (1. - self.eps) * nll_loss + eps_i * smooth_loss
return loss, nll_loss
warmup
(learning rate) warmup
(Maxout Dropout, DropConect) dropout




近期评论