cs224d notes (5,6)

Lecture 5

Max-margin objective function (window classification)

For a single window

$J = max(0,1-s+s_c) \ text{where $s$ is the score of current sample and $s_c$ the score for corrupt(negative) sample}\ text{we hope $s$ is big and $s_c$ is small} \ text{when $s>s_c+1$, $J=0$, we can ignore the window, stop back propagation}$

It means that positive sample has a score +1 higher than negative sample
- xxx |<= 1 =>| ooo
Advantage, stop back propagation when J becomes 0

Back propagation example

$J = max(0,1-s+s_c) \ s = U^Tf(Wx+b) \ s_c = U^Tf(Wx_c+b)$

Compute the derivatives with respect to U, W, b, x

$frac{partial s}{partial U} = frac{partial U^Ta}{partial U} = a = f(Wx+b)$

… a lot more

Lecture 6

Neural tips & tricks

see “NLP (almost) from scratch, Collobert et al. 2011”

Genetal Strategy for Successful NNets

Select appropriate network structure
1. Single words, fixed window, sentence based, document level; bag of words, recursive vs. recurrent, CNN
2. Nonlinearity
  1. sigmoid (logistic): not good
    
    $f(z) = frac{1}{1+exp(-z)} ~~~ f'(z) = f(z)(1-f(z))$
  2. tanh: bet in many models
    
    $f(z) = tanh(z) = frac{e^z - e^{-z}}{e^z+e^{-z}} ~~~ f'(z) = 1-f(z)^2 \ tanh(z) = 2logistic(2z)-1$
  3. ReLu (rectified linear): avoid gradient vanishing
    
    $rect(z) = max(z,0)$
  4. Hard tanh
  5. soft sign
Gradient checks & model simplification (一步一步implement)
Parameter Initialization
1. If z if great, then the derivative could be close to 0 (e.g. for tanh)
2. hidden bias initialized to 0
3. output bias initialized to optimal value if weights were 0 (e.g. mean target or inverse sigmoid of mean target)
4. x between -1 and 1
5. Weights W initialized to so that z is small enough to be on a linear regime. Initialized to ~ Uniform(-r, r), r inversely proportional to fan-in (previos layer size) and fan-out (next laeyer size):
  
  $sqrt{6/(text{fan-in} + text{fan-out})}$
  
  for tanh and 4 times bigger for sigmoid (see Glorot AISTATS 2010)
6. For ReLu, to avoid getting 0, we can initialize the bias in the positive part of the value
Mini-batch SGD (SGD updates after only 1 example while mimi-batch after a batch)

$theta^{new} = theta^{old} - alpha nabla_{theta}J_{t:t+B}(theta)$
1. size of batch: 20 to 1000
2. helps parallelizing
3. Momentum: add a fraction v of previos update to current one, build up velocity in direction of consistent gradient
  
  $v = mu v - alpha nabla_{theta}J_t(theta) \ theta^{new} = theta^{old} + v$
  - v is initialized at 0
  - common μ = 0.9
  - momentum increased after some epochs (0.5 -> 0.99)
4. Learning Rates
  1. reduce by 0.5 when validation error stops improving
  2. Adagrad: adaptive learning rate for each parameter
    
    $g_{t,i} = frac{partial}{partial theta^t_i}J_t(theta)\ theta_{t-1,i} - frac{alpha}{sqrt{sum_{tau=1}^t g_{tau,i}^2}}g_{t,i}$
    
    Problem: with time going by, learning rate will be really small. Solution: reset the sum (or Adam)
5. Regularize
  - Reduce model size!
  - L1, L2 regularization
  - early stopping: use parameters that gave best validation error (keep wights for the last 50 iterations)
  - Dropout (Hinton et al. 2012)
    - Training time: at each instance of evaluation, randomly set 50% of the inputs to each neuron to 0.
    - Test time: halve(减半) the model weights (now twice as many)
6. Y. Bengio (2012), “Pratical Recommendations for Gradient-Based Training of Deep Architectures”
  1. Hyperparameter search: set a range and random search

Language Models

Definition: a language model computes a probability for a sequence of words, often conditioned on a window of n previos words

$P(w_1, ldots,w_m) = prod_{i=1}^m P(w_i|w_1,ldots,w_{i-1}) simprod_{i=1}^m P(w_i|w_{i-(n-1)},ldots,w_{i-1})$

Original neural language model (A neural Probabilistic Language Model, Bengio et al. 2003)
- Problem: fixed window of context
  
  To solve the problem:

Recurrent Neural Networks

Main idea: use the same set f W weights at all time steps

$h_t = sigma(W^{(hh)}h_{t-1} + W^{(hx)}x[t]) \ hat y_t = softmax(W^{(S)}h_t) \ hat P(x_{t+1} = v_j|x_t,ldots,x_1) = hat y_{t,j }$

initialization

$h_0 in mathbb{R}^{D_h} text{ vecotr for the hidden layer at step 0}, D_h text{dimension of hidden layer} \ x[t] text{ column vector of L at index [t] at step t} \ W^{(hh)} in mathbb{R}^{D_htimes D_h} ~~~ W^{(hx)}in mathbb{R}^{D_h times d} ~~~ W^{(S)} in mathbb{R}^{|V| times D_h}$
Training RNNs is hard: Vainishing or exploding gradient

cs224d notes (5,6)

Lecture 5

Max-margin objective function (window classification)

Back propagation example

Lecture 6

Neural tips & tricks

Genetal Strategy for Successful NNets

Language Models

Recurrent Neural Networks

近期文章

近期评论

标签

热门

文章归档

分类目录

功能