cs224d notes (5,6)

Lecture 5

Max-margin objective function (window classification)

For a single window

  • It means that positive sample has a score +1 higher than negative sample
    • xxx |<= 1 =>| ooo
  • Advantage, stop back propagation when J becomes 0

Back propagation example

Compute the derivatives with respect to U, W, b, x

… a lot more

Lecture 6

Neural tips & tricks

Multi-task learning / Weight sharing

see “NLP (almost) from scratch, Collobert et al. 2011”

Genetal Strategy for Successful NNets

  1. Select appropriate network structure

    1. Single words, fixed window, sentence based, document level; bag of words, recursive vs. recurrent, CNN

    2. Nonlinearity

      1. sigmoid (logistic): not good

      2. tanh: bet in many models

      3. ReLu (rectified linear): avoid gradient vanishing

      4. Hard tanh

      5. soft sign

  2. Gradient checks & model simplification (一步一步implement)

  3. Parameter Initialization

    1. If z if great, then the derivative could be close to 0 (e.g. for tanh)

    2. hidden bias initialized to 0

    3. output bias initialized to optimal value if weights were 0 (e.g. mean target or inverse sigmoid of mean target)

    4. x between -1 and 1

    5. Weights W initialized to so that z is small enough to be on a linear regime. Initialized to ~ Uniform(-r, r), r inversely proportional to fan-in (previos layer size) and fan-out (next laeyer size):

      for tanh and 4 times bigger for sigmoid (see Glorot AISTATS 2010)

    6. For ReLu, to avoid getting 0, we can initialize the bias in the positive part of the value

  4. Mini-batch SGD (SGD updates after only 1 example while mimi-batch after a batch)

    1. size of batch: 20 to 1000

    2. helps parallelizing

    3. Momentum: add a fraction v of previos update to current one, build up velocity in direction of consistent gradient

      • v is initialized at 0
      • common μ = 0.9
      • momentum increased after some epochs (0.5 -> 0.99)
    4. Learning Rates

      1. reduce by 0.5 when validation error stops improving

      2. Adagrad: adaptive learning rate for each parameter

        Problem: with time going by, learning rate will be really small. Solution: reset the sum (or Adam)

    5. Regularize

      • Reduce model size!
      • L1, L2 regularization
      • early stopping: use parameters that gave best validation error (keep wights for the last 50 iterations)
      • Dropout (Hinton et al. 2012)
        • Training time: at each instance of evaluation, randomly set 50% of the inputs to each neuron to 0.
        • Test time: halve(减半) the model weights (now twice as many)
    6. Y. Bengio (2012), “Pratical Recommendations for Gradient-Based Training of Deep Architectures”

      1. Hyperparameter search: set a range and random search

Language Models

Definition: a language model computes a probability for a sequence of words, often conditioned on a window of n previos words

  • Original neural language model (A neural Probabilistic Language Model, Bengio et al. 2003)

    • Problem: fixed window of context

      To solve the problem:

Recurrent Neural Networks

Main idea: use the same set f W weights at all time steps

  • initialization

  • Training RNNs is hard: Vainishing or exploding gradient