论文笔记

最近读了两篇2017年很出名的论文,分别是Facebook的《Convolutional Sequence to Sequence Learning》和Google的《Attention is All You Need》。这两篇让我改观开始变得对读论文有好感,可想而知精彩程度。本质上都是抛弃了RNN做Seq2Seq的机器翻译任务。本篇先进行《Convolutional Sequence to Sequence Learning》的论文走读。

面向读者:本身就对这篇论文感兴趣,但还是有不理解的地方。

论文摘要

  • 抛弃RNN,只用CNN做Seq2Seq(机器翻译)任务。但是全篇少不了CNN与RNN的对比描述。
  • RNN是链式结构(Chain Structure),不能并行训练,CNN(Hierarchical Structure)可以,并且大大降低计算复杂度。

论文目录如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
1. Introduction
2. Recurrent Sequence to Sequence Learning
3. A Convolutional Architecture
|- 3.1 Position Embedding
|- 3.2 Convolutional Block Structure
|- 3.3 Multi-step Attention
|- 3.4 Normalization Strategy
|- 3.5 Initialization
4. Experimental Setup
|- 4.1 Datasets
|- 4.2 Model Parameters and Optimization
|- 4.3 Evaluation
5. Results
|- 5.1 Recurrent vs. Convolutional Models
|- 5.2 Ensemble Results
|- 5.3 Generation Speed
|- 5.4 Position Embedding
|- 5.5 Multi-step Attention
|- 5.6 Kernel size and Depth
|- 5.7 Summarization
6. Conclusion and Future Work

(主要详细展开第三节的内容。本篇以走读的形式,一句话quote出来,再解释理解,写完整理删,挑我觉得重要的句子讲,读这篇文章最好是对这篇论文有一定了解,或者是左边是论文,右边是这篇文章,下面引用的句子都是论文里的文字,之前读过很多论文,也是在纸上画画,这个是第一篇写论文笔记)

1. Introduction

CNN vs RNN

  1. 链式/层级 结构,并行运算
  2. 上下文相关
  3. 输入输出固定长度
  4. 计算复杂度

RNN是链式结构,CNN是层级结构

Compared to recurrent layers, convolutions create representations for fixed size contexts, however, the effective context size of the network can easily be made larger by stacking several layers on top of each other. This allows to precisely control the maximum length of dependencies to be modeled.

  • CNN处理的input受限,需要是等长的文本,fixed size,但是只要往上堆叠层数就能处理更长的文本。
  • 这样的结构能控制最大长度数值。

Convolutional networks do not depend on the computations of the previous
time step and therefore allow parallelization over every element in a sequence. This contrasts with RNNs which maintain a hidden state of the entire past that prevents parallel computation within a sequence.

  • 句子中的每个单词并行运算,不依赖前个词的计算。与RNN的隐藏状态相反。

Multi-layer convolutional neural networks create hierarchical representations over the input sequence in which nearby input elements interact at lower layers while distant elements interact at higher layers.

为了获取更多的信息,使用的是多层的CNN结构。越低层的处理的词在句子中更近,越高层处理的词距更大。

Hierarchical structure provides a shorter path to capture long-range dependencies compared to the chain structure modeled by recurrent networks, e.g. we can obtain a feature representation capturing relationships within a window of n words by applying only O(n/k) convolutional operations for kernels of width k, compared to a linear number O(n) for recurrent neural networks.

这种层级结构缩短了获取长距离的信息步骤。

Inputs to a convolutional network are fed through a constant number of kernels and non-linearities, whereas recurrent networks apply up to n operations and non-linearities to the first word and only a single set of operations to the last word.

CNN处理数据:常量个kernel和非线性处理

RNN:变量个n步骤,第一个词非线性,最后一个词做单一处理

In this paper we propose an architecture for sequence to sequence modeling that is entirely convolutional. Our model is equipped with gated linear units (Dauphin et al., 2016) and residual connections (He et al., 2015a).We also use attention in every decoder layer and demonstrate that each attention layer only adds a negligible amount of overhead.

  • 完全靠卷积
  • 顺带GLU/residual connections/attention

2. Recurrent Sequence to Sequence Learning

介绍RNN的运算过程,很简短的带过,因为之前有专门介绍

  • Input Sequence                   $x = (x_1,…,x_m)$

  • Encoder Embedding           $w = (w_!,…,w_m)$

  • State Representation           $z = (z_1,…,z_m)$

================================================= [Encoder]

  • Conditional Input                $c = (c_1,…,c_i,…)$

================================================= [Decoder]

  • Hidden State                        $h = (h_1,…,h_n)$

  • Decoder Embedding            $g = (g_1,…,g_n)$

  • Output Sequence                  $y = (y_1,…,y_n)$

  1. 因为是Encoder-Decoder结构,所以运算都是上下对称的。

    • w和g分别为input sequence和output sequence的

3. A Convolutional Architecture

正文从这个才刚刚开始。

尽量用图示表示