
Lecture 5
Max-margin objective function (window classification)
For a single window
- It means that positive sample has a score +1 higher than negative sample
- xxx |<= 1 =>| ooo
- Advantage, stop back propagation when J becomes 0
Back propagation example
Compute the derivatives with respect to U, W, b, x
… a lot more
Lecture 6
Neural tips & tricks
Multi-task learning / Weight sharing
see “NLP (almost) from scratch, Collobert et al. 2011”
Genetal Strategy for Successful NNets
-
Select appropriate network structure
-
Single words, fixed window, sentence based, document level; bag of words, recursive vs. recurrent, CNN
-
Nonlinearity
-
sigmoid (logistic): not good
-
tanh: bet in many models
-
ReLu (rectified linear): avoid gradient vanishing
-
Hard tanh
-
soft sign
-
-
-
Gradient checks & model simplification (一步一步implement)
-
Parameter Initialization
-
If z if great, then the derivative could be close to 0 (e.g. for tanh)
-
hidden bias initialized to 0
-
output bias initialized to optimal value if weights were 0 (e.g. mean target or inverse sigmoid of mean target)
-
x between -1 and 1
-
Weights W initialized to so that z is small enough to be on a linear regime. Initialized to ~ Uniform(-r, r), r inversely proportional to fan-in (previos layer size) and fan-out (next laeyer size):
for tanh and 4 times bigger for sigmoid (see Glorot AISTATS 2010)
-
For ReLu, to avoid getting 0, we can initialize the bias in the positive part of the value
-
-
Mini-batch SGD (SGD updates after only 1 example while mimi-batch after a batch)
-
size of batch: 20 to 1000
-
helps parallelizing
-
Momentum: add a fraction v of previos update to current one, build up velocity in direction of consistent gradient
- v is initialized at 0
- common μ = 0.9
- momentum increased after some epochs (0.5 -> 0.99)
-
Learning Rates
-
reduce by 0.5 when validation error stops improving
-
Adagrad: adaptive learning rate for each parameter
Problem: with time going by, learning rate will be really small. Solution: reset the sum (or Adam)
-
-
Regularize
- Reduce model size!
- L1, L2 regularization
- early stopping: use parameters that gave best validation error (keep wights for the last 50 iterations)
- Dropout (Hinton et al. 2012)
- Training time: at each instance of evaluation, randomly set 50% of the inputs to each neuron to 0.
- Test time: halve(减半) the model weights (now twice as many)
-
Y. Bengio (2012), “Pratical Recommendations for Gradient-Based Training of Deep Architectures”
- Hyperparameter search: set a range and random search
-
Language Models
Definition: a language model computes a probability for a sequence of words, often conditioned on a window of n previos words
-
Original neural language model (A neural Probabilistic Language Model, Bengio et al. 2003)
-
Problem: fixed window of context
To solve the problem:
-
Recurrent Neural Networks
Main idea: use the same set f W weights at all time steps
-
initialization
-
Training RNNs is hard: Vainishing or exploding gradient




近期评论