Attention Mechanism

RNN Encoder-Decoder

input: $x_1, x_2,…, x_{Tx}$
output: $y_1, y_2,…, y_{Ty}$
hidden: $h_t = f(x_{t-1}, y_t)$, $h_1, h_2,…, h_{Tx}$

Bahdanau(ICLR, 2015)

encoder: biGRU forword

$r_i = sigma(W_re(x_i) + U_rh_{i-1}) \ z_i = sigma(W_zr(x_i) + U_zh_{i-1}) \ tilde h_i = tanh(We(x_i) + U[r_i circ h_{i-1}]) \ h_i = (1-z_i) circ h_{i-1} + z_i circ tilde h_{i-1}$

decoder: align model

$r_i = sigma(W_r e(y_{i-1}) + U_r s_{i-1} + C_r c_i) \ z_i = sigma(W_z e(y_{i-1}) + U_z s_{i-1} + C_z c_i) \</p> <p>tilde s_i = tanh(We(y_{i-1}) + U[r_i circ s_{i-1}] + C c_i) \ s_i = (1-z_i) circ s_{i-1} + z_i circ tilde s_i \ s_0 = tanh(W_s overleftarrow{h_1})$

align

$alpha_{ij} = frac{exp(e_{ij})}{sum_{k=1}^{T_y} exp(e_{ik})} \ e_{ij} = v_a^T tanh(W_as_{i-1} + U_a h_j) \ c_i = sum_{j=1}^{T_x} alpha_{ij} h_j \$

output

$p(y_i|s_i, y_{i-1}, c_i) propto exp(y_i^T W_o t_i) (softmax)\ t_i = [max{tilde t_{i, 2j-1}, tilde t_{i, 2j}}]_{j=1,...,l} \ tilde t_i = U_o s_{i-1} + V_o Ee(y_{i-1}) + C_o c_i \$

Minh-Thang Luong(2015)

align

$alpha_{ij} = frac{exp(e_{ij})}{sum_{k=1}^T exp(e_{ik})} \ \ e_{ij} = left { begin{aligned} & s_i^T h_j \ & s_i^T W_a h_j \ & v_a^T tanh(W_a[s_i, h_j]) end{aligned} right. \ c_i = sum_{j=1}^{Tx} alpha_{ij} h_j$

output

$tilde s_t = tanh(W_c[c_t, s_t])\ p(y_t|y_{<t}, x) = softmax(W_s tilde s_t)$

global

local

CNN-RNN

text Generation

$tilde g_t = sigma(W_{eg} e(y_{t-1}) + W_{sg} s_{t-1} + W_{g_i} C_t)) \ i_t = sigma(W_{ei} e(y_{t-1}) + W_{si} s_{t-1} + W_{ci} C_t)) \ f_t = sigma(W_{ef} e(y_{t-1}) + W_{sf} s_{t-1} + W_{cf} C_t)) \ o_t = sigma(W_{eo} e(y_{t-1}) + W_{so} s_{t-1} + W_{co} C_t)) \ g_t = f_t circ g_{t-1} + i_t circ g_{t} \ s_t = o_t circ tanh(g_t)$

soft-attention

$C_t = beta_t sum_j^L alpha_{tj} * a_j \ beta_t = sigma(f_beta (s_{t-1})) \ e_ij = f_att(a_j, s_{i-1}) = v_a^T tanh(W_a s_{i-1} + U_a a_j)$

hard-attention
one hot weight of attention

Transformer

Dot-Product Attention

$Attention(Q, K, V) = softmax(frac{QK^T}{sqrt{d_k}})V$

Multi-Head Attention

$MultiHead(Q, K, V) = Concat(head1, head2, ..., head_h)W_O \ head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)\$

LayerNorm

$LayerNorm(x) = x + sublayer(x)$

Feed Forward

$FFN(x) = max(0, xW_1 +b_1)W_2 + b_2$

Position Encoding(Rico.S, 2016)

$PE_{pos, 2i} = sin(pos/10000^{2i/d_{model}}) \ PE_{pos, 2i+1} = cos(pos/10000^{2i/d_{model}})$

Universal Transformer

recurrent network

$H^t = LayerNorm(A^{t-1} + Transition(A^t)) \ A^t = LayerNorm(H^{t-1} + MultiHead(H^{t-1} + p^t)) \ P^t_{pos,2j} = sin(pos/10000^{2j/d}) oplus sin(t/10000^{2j/d})\ P^t_{pos, 2j+1} = cos(pos/10000^{2j/d}) oplus cos(t/10000^{2j/d})$

output

$p(y_{pos} | y_{[1:pos-1]}, H^T) = soft(OH^T)$

Reference

[1] Neural Machine Translation By Jointly Learning to Align And Translate
[2] Effective Approaches to Attention-based Neural Machine Translation
[3] Attention Is All You Need
[4] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
[5] Universal Transformers
[6] Neural Machine Translation of Rare Words with Subword Units
[7] Convolutional Sequence to Sequence Learning]
[8] Layer Normalization

Attention Mechanism

Attention Definition

RNN Encoder-Decoder

Bahdanau(ICLR, 2015)

Minh-Thang Luong(2015)

global

local

CNN-RNN

Transformer

Universal Transformer

Reference

近期文章

近期评论

标签

热门

文章归档

分类目录

功能