Spacy，两个句子之间的奇怪相似性答案

【问题标题】：Spacy, Strange similarity between two sentencesSpacy，两个句子之间的奇怪相似性
【发布时间】：2019-02-06 09:31:37
【问题描述】：

我已经下载了en_core_web_lg模型并试图找到两个句子之间的相似性：

nlp = spacy.load('en_core_web_lg')

search_doc = nlp("This was very strange argument between american and british person")

main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

print(main_doc.similarity(search_doc))

返回非常奇怪的值：

0.9066019751888448

这两个句子不应该有 90% 相似，它们的含义非常不同。

为什么会这样？为了使相似度结果更合理，是否需要添加某种额外的词汇？

【问题讨论】：

标签： python nlp spacy

【解决方案1】：

向量相似度的Spacy documentation解释了它的基本思想：
每个单词都有一个向量表示，通过上下文嵌入 (Word2Vec) 学习，这些表示在语料库上进行训练，如文档中所述。

现在，完整句子的词嵌入只是所有不同词的平均值。如果你现在有很多词在语义上位于同一区域（例如填充词，如“he”、“was”、“this”……），并且附加词汇“cancels out”，那么你最终可能会与您的情况相似。

问题是你可以做些什么：从我的角度来看，你可以想出一个更复杂的相似性度量。由于search_doc 和main_doc 有额外的信息，就像原始句子一样，您可以通过长度差异惩罚来修改向量，或者尝试比较句子的较短部分，并计算成对相似性（然后，问题将是要比较的部分）。

遗憾的是，目前还没有简单的方法来简单地解决此问题。

【讨论】：

干净的方法是要么有更有意义的向量表示，要么只通过有意义的词来判断相似度（见下面的答案）。

【解决方案2】：

Spacy 通过平均词嵌入来构造句子嵌入。因为，在一个普通的句子中，有很多无意义的词（称为stop words），你得到的结果很差。您可以像这样删除它们：

search_doc = nlp("This was very strange argument between american and british person")
main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

search_doc_no_stop_words = nlp(' '.join([str(t) for t in search_doc if not t.is_stop]))
main_doc_no_stop_words = nlp(' '.join([str(t) for t in main_doc if not t.is_stop]))

print(search_doc_no_stop_words.similarity(main_doc_no_stop_words))

或者只保留名词，因为它们的信息最多：

doc_nouns = nlp(' '.join([str(t) for t in doc if t.pos_ in ['NOUN', 'PROPN']]))

【讨论】：

通过阅读本文和其他文章，它澄清了我的误解，即在文档相似性中删除了停用词。这个特定的答案很棒，因为它专注于实际内容，同时减少了噪音词并使相似度计算更快。

【解决方案3】：

正如@dennlinger 所指出的，Spacy 的句子嵌入只是所有单词向量嵌入的平均值。因此，如果您有一个带有“好”和“坏”之类的否定词的句子，它们的向量可能会相互抵消，从而导致上下文嵌入不太好。如果您的用例专门用于获取句子嵌入，那么您应该尝试以下 SOTA 方法。

Google 通用句子编码器：https://tfhub.dev/google/universal-sentence-encoder/2
Facebook 的推理编码器：https://github.com/facebookresearch/InferSent

我已经尝试了这两种嵌入，并在大多数情况下为您提供了良好的结果，并使用词嵌入作为构建句子嵌入的基础。

干杯！

【讨论】：

【解决方案4】：

正如其他人所说，您可能想要使用通用句子编码器或推理。

对于 Universal Sentence Encoder，您可以安装管理 TFHub 包装的预构建 SpaCy 模型，因此您只需要使用 pip 安装软件包，向量和相似性就可以按预期工作。

你可以按照这个仓库的指示（我是作者）https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub

安装模型：pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub/releases/download/en_use_md-0.2.0/en_use_md-0.2.0.tar.gz#en_use_md-0.2.0
加载和使用模型

import spacy
# this loads the wrapper
nlp = spacy.load('en_use_md')

# your sentences
search_doc = nlp("This was very strange argument between american and british person")

main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

print(main_doc.similarity(search_doc))
# this will print 0.310783598221594

【讨论】：

请披露您是所提及包的作者（虽然很明显）
谢谢@m02ph3u5 我添加了提及
当使用这样的模型时，我应该仍然删除停用词还是将它们用作必要上下文的一部分？
@benino 您不需要删除停用词或词形还原。 Universal Sentence Encoder 可以直接处理您未处理的文本
嗨@MartinoMensio，感谢您的回复。如何从磁盘加载“en_use_md”？我的服务器没有连接到 Internet。有什么解决方法吗？ 1. 下载哪个文件，然后怎么spacy.load...？谢谢。