删除 NLP 中句子比较的循环答案

【问题标题】：Remove loops for sentence comparison in NLP删除 NLP 中句子比较的循环
【发布时间】：2019-06-18 19:52:42
【问题描述】：

我正在使用 BERT 比较文本相似度，代码如下：

from bert_embedding import BertEmbedding
import numpy as np
from scipy.spatial.distance  import cosine as cosine_similarity

bert_embedding = BertEmbedding()
TEXT1 = "As expected from MIT-level of course: it's interesting, challenging, engaging, and for me personally quite enlightening. This course is second part of 5 courses in  micromasters program. I was interested in learning about supply chain (purely personal interest, my work touch this topic but not directly) and stumbled upon this course, took it, and man-oh-man...I just couldn't stop learning. Now I'm planning to take the rest of the courses. Average time/effort per week should be around 8-10 hours, but I tried to squeeze everything into just 5 hours since I have very limited free time. You will need 2-3 hours per week for the lecture videos, 2 hours for practice problems, and another 2 hours for the weekly homework. This course offers several topics around demand forecasting and inventory. Basic knowledge of probability and statistics is needed. It will help if you take the prerequisite course: supply chain analytics. But if you've already familiar with basic concept of statistics, you can pick yourself along the way. The lectures are very interesting and engaging, it gives you a lot of knowledge but also throw in some business perspective, so it's very relatable and applicable! The practice problems can help strengthen the understanding of the given knowledge and the homework are very challenging compared to other online-courses I have taken. This course is the best quality I have taken so far, and I have taken several (3-4 MOOCs) from other provider."
TEXT1 = TEXT1.split('.')

sentence2 = ["CHALLENGING COURSE "]

从那里我想使用余弦距离在 TEXT1 的一个句子中找到 sentence2 的最佳匹配

best_match = {'sentence':'','score':''}
best = 0
for sentence in TEXT1: 
  #sentence = sentence.replace('SUPPLY CHAIN','')
  if len(sentence) < 5:
    continue
  avg_vec1 = calculate_avg_vec([sentence])
  avg_vec2 = calculate_avg_vec(sentence2)

  score = cosine_similarity(avg_vec1,avg_vec2)
  if score > best:
    best_match['sentence'] =  sentence
    best_match['score'] =  score
    best = score

best_match

代码正在运行，但是由于我想将 sentence2 不仅与 TEXT1 与 N 文本进行比较，因此我需要提高速度。是否可以对这个循环进行矢量化？或者有什么方法可以加快速度？

【问题讨论】：

由于不需要保留顺序，因此将它们都设置并比较。
@d_kennetz 我不清楚你的建议
你分析过你的代码吗？什么需要更多时间？ calculate_avg_vec 或 cosine_similarity ?
您是否尝试使用并行循环？我认为这可能会有所帮助，因为您只有句子 2 可以在线程之间共享。
我看不到 bert_embedding 稍后会进入您的代码。您可以做的是运行一个循环将您的句子转换为嵌入，然后使用 sklearn 的sklearn.metrics.pairwise.cosine_similarity 来计算相似度矩阵。

标签： python numpy bert-language-model

【解决方案1】：

cosine_similarity 定义为两个归一化向量的点积。

这本质上是一个矩阵乘法，后跟一个argmax 以获得最佳索引。

我将使用 numpy，尽管 - 如 cmets 中所述 - 您可能可以使用 pytorch 或 tensorflow 将其插入 BERT 模型。

首先，我们定义一个归一化平均向量：

def calculate_avg_norm_vec(sentence):
    vs = sentence2vectors(sentence) # TODO: use Bert embedding
    vm = vs.mean(axis=0)
    return vm/np.linalg.norm(vm)

然后，我们构建一个包含所有句子及其向量的矩阵

X = np.apply_along_axis(calculate_avg_norm_vec, 1, all_sentences)
target = calculate_avg_norm_vec(target_sentence)

最后，我们需要将target 向量与X 矩阵相乘，得到argmax

index_of_sentence = np.dot(X,target.T).argmax(axis=1)

您可能希望确保 axis 和索引适合您的数据，但这是总体方案

【讨论】：