neural network 技巧篇 (过拟合)

label smoothing:

  • MOTIVATION:
    传统交叉熵函数中:
    $$Loss=sum_{k}^K q(k|x)p(k|x)$$
    其中真实的label数据 $q(k|x)$ 往往是一个 one-hot 向量,即只有正确的val值为1,其余位置都为0,这种做法会导致两个问题。

    1. it may result in over-fitting: if the model learns to assign full probability to the groundtruth label for each training example, it is not guaranteed to generalize.
    2. it encourages the differences between the largest logit and all others to become large, and this, combined with the bounded gradient $frac{∂_{loss}}{∂_{z_{k}}}$, reduces the ability of the model to adapt. Intuitively, this happens because the model becomes too confident about its predictions.
  • SOLUTION
    替换原先的交叉熵函数中的 $q(k|x)=delta_{k,y}$ 为以下形式 $q'(k|x)=(1-epsilon)delta_{k,y} + epsilonmu(k)$
    其中 $mu(k)$为均匀分布。
  • IMPLEMENT
    def compute_loss(self, model, net_output, sample, reduce=True):
        lprobs = model.get_normalized_probs(net_output, log_probs=True)
        lprobs = lprobs.view(-1, lprobs.size(-1))
        target = model.get_targets(sample, net_output).view(-1, 1)
        non_pad_mask = target.ne(self.padding_idx)
        nll_loss = -lprobs.gather(dim=-1, index=target)[non_pad_mask]
        smooth_loss = -lprobs.sum(dim=-1, keepdim=True)[non_pad_mask]
        if reduce:
            nll_loss = nll_loss.sum()
            smooth_loss = smooth_loss.sum()
        eps_i = self.eps / lprobs.size(-1)
        loss = (1. - self.eps) * nll_loss + eps_i * smooth_loss
        return loss, nll_loss

代码地址
论文地址

warmup

(learning rate) warmup
(Maxout Dropout, DropConect) dropout

Data Augmentation