【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

Abstract

在CV领域，常常用对抗训练（adversarial training）来产生扰动并提高模型鲁棒性，但如果把这种方法直接应用在词嵌入空间会丢失可解释性（interpretability）。本文提出的方法就是对嵌入空间的单词做扰动方向上的约束（restrict the direction of perturbation），从而保留了可解释性。

1 Introduction

Goodfellow针对图像领域的对抗样本提出了adversarial Training（AdvT），其主要的思路就是同时用干净的原样本和对抗样本训练模型，使其能够正确分类。使用这种对抗训练，我们可以提高模型的泛化能力。这种改进意味着对抗式实例的损失函数在模型训练中是一个很好的正则化项。

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》提出将这种对抗训练用于NLP的关键是，要用在连续的嵌入空间（embedding space）而非离散的文本输入空间（input space of texts）。且这种方法只需要计算损失函数的梯度就能获得对抗扰动（adversarial perturbations）。

但是这种方法有一些缺点，即无法将嵌入空间的扰动还原为真实的文本，所以缺乏可解释性。

我们提出的主要方法是只对【词嵌入空间的单词的扰动方向】做约束，Figure1直观的解释了我们的方法

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

2 Related Work

[Jia and Liang, 2017] 提出在句子最后加众包（crowdsourcing）来欺骗阅读理解系统
[Belinkov and Bisk, 2018; Hosseini et al., 2017] 提出使用AdvT方法随机生成字符级的替换（character swaps）
[Samanta and Mehta, 2017] 使用同义词进行替换来产生大量对抗文本

到目前为止NLP领域产生对抗样本的方法与CV领域非常不同：

[Jia and Liang, 2017] 人工创造对抗文本
[Samanta and Mehta, 2017] 使用词典

我们的基准方法基于[Miyato et al., 2017] Adversarial training methods for semisupervised text classification. In ICLR, 2017.

3 Target Tasks and Baseline Models

Figure2展示了基准神经网络的结构：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

3.1 Common Notation

X 代表输入句子（input sentence）
代表输入单词的词典，代表在给定输入句子X中第t个单词。
假如句子X有T个单词，则可以表示为，简写为
代表输出标签的集合。假设输出Y代表标签的序列：，
代表词嵌入向量（word embedding vector）。向量的维度是D，则
所以X（输入句子）对应的词嵌入向量序列可以表示为，且
对于中的y，代表其ID，取值范围一般从1到，即
则代表标签ID序列中的某一个，则代表一个标签ID序列
整个训练集D可以由【词嵌入向量序列】和【标签ID序列】组成，即，N代表训练集的大小

3.2 Baseline model for text classification

我们使用带LSTM单元的RNN对输入【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》进行编码。（前向）LSTM单元计算第t步的隐藏状态（hidden state）：。假定为0向量。则可以得出在给定输入时，得到输出的条件概率：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

其中【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》是从标准前向神经网络第T个隐藏状态计算得到的，即，它的维度是。代表的第m个分量。

3.3 Baseline model for sequence labeling

我们使用bi-directional LSTM 对【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》编码。第t步的隐藏状态由前向和后向LSTM结合：。其中，。我们假定都是0向量。我们同时假定概率可以被分解到每一步t，也就是概率可以如下计算：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

3.4 Training

整个训练主要解决以下的优化问题：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

其中W代表RNN模型中的所有参数，【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》是对于整个训练集D的损失函数。是单个训练样本的损失函数。所以有：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

4 Adversarial Training in Embedding Space

对抗训练（AdvT）是一种正则化方法，用来提高对对抗样本分类的鲁棒性。 [Miyato et al., 2017]第一次提出了其在文本方面的应用，称为AdvT-Text。

设输入【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》的第t个单词是，它的扰动向量（perturbation vector）是，并假设该扰动向量是D维的。Figure3展示了AdvT-Text的结构：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

可以跟Figure2的基准网络进行对比。

假设r代表所有【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》（对t的所有取值）向量的连接（concatenated vector）。代表加入扰动后的结果。

为了得到【效果最好的扰动】【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》，我们需要最大化负的对数似然值（maximizing the negative log-likelihood），相当于最小化对数似然值，即：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

其中【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》是一个可调的超参数，用于控制扰动的范数（the norm of the perturbation）。AdvT-Text的损失函数可以定义为：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

其中【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

【个人理解】：

这里面说白了就是两层。

第一层，找【针对单个词嵌入向量的】最优扰动：因为扰动向量有很多种，先用公式7找到使损失函数（公式6）最大的扰动，称为最优扰动
第二层，确定整个模型的损失函数：将【每个词嵌入向量加上最优扰动后，得到的单个损失函数】相加，作为整个模型的损失函数

通常来说，对于复杂网络模型，用公式7找最优扰动是不可行的。所以 [Goodfellow et al., 2015]提出线性化【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》作为近似。对我们的RNN模型，计算t的最优扰动可以使用以下非迭代解（non-iterative solution）：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

其中【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》是所有的连接。

最终，我们要同时最小化目标函数【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》和：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

其中【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》是控制两个损失平衡的系数

5 Interpretable Adversarial Perturbation

而我们现在的做法，是对扰动向量的方向加入约束，使得对某个词嵌入向量加入扰动后，能得到另一个单词的词嵌入向量，从而能还原为真实存在的单词。我们将这种可解释的AdvT-Text称为iAdvT-Text

5.1 Definition of Interpretable AdvT-Text

假定【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》是对应于词典中第k个单词的词嵌入向量，我们定义从（【输入句子中的第t个词】的词嵌入向量）到的方向向量为：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》对所有的k和t来说都是单位向量，1。如果输入句子中的第t个单词是词典中的第k个单词，则，变成一个0向量。

让【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》为一个维的向量，为的第t个分量，则。定义代表中第t个单词产生的扰动：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》是输入中的第t个单词到词典中的第k个单词的方向的权重。像定义一样，我们可以定义为：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

同公式7一样，我们寻找方向向量的最差权重（也代表一种约束下的最优扰动），使得下列损失函数最大：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

接着我们可以定义IAdvT-Text的总体损失函数：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

类比公式10，可以得到总的优化问题的公式：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

为了降低最优扰动的计算量，类比公式9，可以得到一种近似计算方法：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

同【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》一样，的直观解释就是：每个方向增加损失的（正则化）强度。因此我们期望计算出某个方向的单词，作为最优的对抗扰动。

【个人理解】：

pass

5.2 Practical computation

最耗费时间的计算是公式12中要对所有出现的单词求和，即计算【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》到所有单词的方向向量（公式11）。计算尤其费时。所以我们令为第t步时的独立词典，使其满足且。在我们的实验中，令只包含词嵌入空间周围的一些单词，也就是说远距离的单词为无关单词，可以舍弃。

【个人理解】：

读到输入句子的某个单词时，只找这个单词的同义词组建词典，这样只需计算某个单词到所有同义词的方向向量。

5.3 Extension to semi-supervised learning

6 Experiments

我们选择在三个任务上进行试验：

情感分类sentiment classification（SEC）：将文本分为正面和负面
种类划分category classification（CAC）
语法错误检测grammatical error detection（GED）：检测不和语法的单词

6.1 Datasets

SEC：选择IMDB、Elec、Rotten Tomatoes数据集
CAC：选择DBpedia、RCVI数据集
GED：First Certificate in English dataset（FCE-public）

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

6.2 Model settings

为了便于比较，SEC任务的模型配置和 [Miyatoet al., 2017] 相同，GED任务的模型配置和 [Rei and Yannakoudakis, 2016; Kaneko et al., 2017]相同，具体见Figure3。

预训练的RNN模型选择了 [Bengio et al., 2000]。初始的词嵌入（word embedding）和LSTM的权重选择了 [Miyato et al., 2017]。为了减少Softmax loss的计算，我们使用 Adaptive Softmax [Grave et

al., 2017]来训练语言模型。还用到了 early stopping criterion [Caruana et al., 2000]。优化器：Adam

超参数的设置：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

对于AdvT-Text和VAT-Text，我们选择【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》，而对于我们的方法，选择。对于所有方法，设置

6.3 Evaluation by task performance

Table3展示了在IMDB数据集上的表现（用错误率衡量）

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

我们原以为自己的方法会降低性能，因为再方向上做了很大的约束，但是结果没有。由此可见词嵌入空间中，真实单词间的方向包含了有用的信息，从而可以提高泛化表现。

Table6展示了GED任务的表现（用F0.5衡量）：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

6.4 Visualization of sentence-level perturbation

我们对由IAdvT-Text产生的对抗样本进行可视化，见Figure4：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

最左边的一列是干净的文本，带颜色的两列分别是用两种方法产生的对抗文本。在我们的方法中，我们对每个t，计算【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》的最大值，并由此确定替换词。而对于AdvT-Text，我们计算原单词和扰动词间的余弦相似度，选出最高的一个扰动词作为替换词。对于SEC任务而言，iAdvT-Text成功找到方向，用worse替换better来使损失增加。对于GED任务而言，存在一个语法错误单词practice，，iAdvT也成功找到替换词play。

6.5 Adversarial texts

为了得到对抗文本，我们找到最大的扰动，并且用其表征的单词替换原单词：

【深度学习NLP论文笔记】《Interpretable Adversarial Perturbation in Input Embedding Space for Text》

Abstract

1 Introduction

2 Related Work

3 Target Tasks and Baseline Models

3.1 Common Notation

3.2 Baseline model for text classification

3.3 Baseline model for sequence labeling

3.4 Training

4 Adversarial Training in Embedding Space

5 Interpretable Adversarial Perturbation

5.1 Definition of Interpretable AdvT-Text

5.2 Practical computation

5.3 Extension to semi-supervised learning

6 Experiments

6.1 Datasets

6.2 Model settings

6.3 Evaluation by task performance

6.4 Visualization of sentence-level perturbation

6.5 Adversarial texts

7 Conclusion