Brief introduction:
Connectionist temporal classification is used for sequence learning after the last layer of RNN. For a sequence with length T, every unit point of RNN network will output an softmax vector as the prediction probability. CTC defines the RNN network output as a probability distribution over label squences and is trained by minizing the negative log-likelihood of the true label sequences over a training set.
The exponential number of possible alignment between a label sequence and an observation sequence is kept in check by an evaluation strategy that is inspired by the forward-backward algorithm for HMMs.
Description of CTC in detail:
-
Assuming there is a sequence with a length of T, every label should be one of n labels (n-1 classes or
). -
i.e. Let x ∈ (R^m) be an input sequence of length T and y ∈ (R^n) the sequence of outputs produced by a network that classifies each frame into one of n − 1 classes or blank.
Let L^T be the set sequences of length T over the alphabet {1, . . . , n}. Then a sequence π ∈ L^T of labels of the same length as x has the probability.
-
-
conditioned on the observation x, π is a path
-
Define ß: L^T → L^{≤T} as the function that maps a path to a sequence.
-
with all blanks and repeated labels removed from the path, for example ß(aa − −ab−) = ß(a − − −a b−) = aab.
Then a probability of any label sequence l of length at most T can be computed by:
-
-
For any t, the proability of a labelling l:
-
Training by maximizing the log-likelihood with this formula naively would require you to
sum over all paths corresponding to a label and “in general there are very many of these”
(Graves). However, the combinatorial explosion can be avoided by dynamic programming
similar to the forward-backward algorithm for HMMs. For a labelling l define the total
probability of its first s labels l1:s at time t as:where α is called the forward variable
Define a backward variable β for suffix probabilities in an analogous way.
The important insight is that αt(s) can be computed recursively from αt−1(s) and αt−1(s − 1) and similarly for β.
Then for any t the probability of a labelling l can be rewritten as:
-
-
the loss function and gradient can be derived from the noted functions. SGD is for optimization.
-
Decoding after CTC output:
- Best path decoding
-
Prefix-search decoding
- not recommended due to its worst-case runtime
-
Viterbi decoding
- Recommended. It produce the most likely sequence of hidden states (used for HMM)
近期评论