Python 嵌套循环替代方案答案

【问题标题】：Python Nested Loop AlternativePython 嵌套循环替代方案
【发布时间】：2021-05-25 11:14:58
【问题描述】：

我有两个包含文本的大列表。 X = [30,000 个条目] 和 Y = [400 个条目]

我想使用余弦相似度找出两个列表中相似的文本。下面是我尝试使用嵌套 for 循环执行的代码

vectorizer = CountVectorizer()
found_words = []
for x in X:
    for y in Y:
       vector1 = vectorizer(x.lower())
       vector2 = vectorizer(y.lower())
       sim = cosine_similarity(vector1, vector2)
       if sim > 0.9:
           found_words.append(x.capitalize())

上面的代码运行良好，但需要很长时间才能执行。有没有其他方法可以在时间和空间复杂性上都有效。谢谢

【问题讨论】：

类似的答案在这里：stackoverflow.com/questions/13908518/…
我确实看过，但由于我的列表很大，与嵌套的 for 循环相比，它需要相同的时间来执行
stackoverflow.com/questions/21850508/…
当你移动：vector1 = vectorizer(x.lower()) 到 for y in Y: 之前会发生什么？
好的，看来您需要multiprocessing。无论如何，您应该为每个字符串执行一次s.lower()，因此您可能需要在循环之前使用Y = [y.lower() for y in Y]。

标签： python list for-loop nlp cosine-similarity

【解决方案1】：

您可以计算归一化向量的点积，而不是余弦。然后，可以在此操作之前进行矢量化。

这是我用随机向量复制测试的尝试：

import numpy as np 

# assume vector dimension is 100:
a = np.random.random([30000, 100]) # X vectors
b = np.random.random([400, 100]) # Y vectors

a = np.array([[_v/np.linalg.norm(_v)] for _v in a]) # shape (30000, d, 1)
b = np.array([[_v/np.linalg.norm(_v)] for _v in b]) # shape (400, d, 1)

sims = np.tensordot(a, b, axes=([1,2], [1,2]))

print(np.where(sims > 0.87)[0]) # index of matched item in X

我将阈值降低到 0.87 以便能够在我的随机向量中显示一些结果。

用矢量化代码替换随机的a 和b：

vectorizer = CountVectorizer()
a = [vectorizer(s.lower()) for s in X]
b = [vectorizer(s.lower()) for s in Y]

另外，最后，您需要使用X 索引来返回实际来源。

x_indices, _ = np.where(sims > 0.9)
x_indices = set(list(x_indices)) # avoid possible duplicate matches
found_words = [X[i] for i in x_indices]

如果您可以使用支持 CUDA 的 Nvidia GPU，则可以将其用于更快/并行的张量操作。您可以使用torch访问设备：

import torch
import numpy as np

vectorizer = CountVectorizer()
a = [vectorizer(s.lower()) for s in X]
b = [vectorizer(s.lower()) for s in Y]

# normalize the vectors and also convert them to tensor types
a = torch.tensor([[_v/np.linalg.norm(_v)] for _v in a], device='cuda') # shape (30000, d, 1)
b = torch.tensor([[_v/np.linalg.norm(_v)] for _v in b], device='cuda') # shape (400, d, 1)

sims = torch.tensordot(a, b, dims=([1, 2], [1, 2])).cpu().numpy()
# shape (30000, 400)

x_indices, _ = np.where(sims > 0.9)
x_indices = set(list(x_indices)) # avoid possible duplicate matches
found_words = [X[i] for i in x_indices]

【讨论】：