python：如何计算两个单词列表的余弦相似度？答案

【问题标题】：python: How to calculate the cosine similarity of two word lists?python：如何计算两个单词列表的余弦相似度？
【发布时间】：2015-05-03 08:44:55
【问题描述】：

我想计算两个列表的余弦相似度，如下所示：

A = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory']

B = [u'home (private)', u'school', u'bank', u'shopping mall']

我知道A和B的余弦相似度应该是

3/(sqrt(7)*sqrt(4)).

我尝试将列表改造成像“home bank bank building factory”这样的形式，它看起来像一个句子，但是，一些元素（例如 home（私人））本身有空格，有些元素有括号，所以我发现很难计算单词的出现。

你知道如何计算这个复杂列表中的单词出现，那么对于列表B，单词出现可以表示为

{'home (private):1, 'school':1, 'bank': 1, 'shopping mall':1}?

或者你知道如何计算这两个列表的余弦相似度吗？

非常感谢

【问题讨论】：

你如何定义cosine similarity？这些变量3/(sqrt(7)*sqrt(4)). 来自哪里？
我只知道定义余弦相似度的一种方法，就是点(A,B)/|A|.|B|，就像A = [2, 1, 1, 1, 0 , 0] 和 B = [1,1,0,0,1,1]，它们的余弦相似度为 3/(sqrt(7)*sqrt(4))

标签： python string list cosine-similarity

【解决方案1】：

首先建立一个字典（这是一组或语料库中所有不同单词列表的技术术语）。

vocab = {}
i = 0

# loop through each list, find distinct words and map them to a
# unique number starting at zero

for word in A:
    if word not in vocab:
        vocab[word] = i
        i += 1


for word in B:
    if word not in vocab:
        vocab[word] = i
        i += 1

vocab 字典现在将每个单词映射到从零开始的唯一数字。我们将使用这些数字作为数组（或向量）的索引。

在下一步中，我们将为每个输入列表创建一个称为词频向量的东西。我们将在这里使用一个名为numpy 的库。这是进行这种科学计算的一种非常流行的方法。如果您对余弦相似度（或其他机器学习技术）感兴趣，那么值得您花时间。

import numpy as np

# create a numpy array (vector) for each input, filled with zeros
a = np.zeros(len(vocab))
b = np.zeros(len(vocab))

# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary

for word in A:
    index = vocab[word] # get index from dictionary
    a[index] += 1 # increment count for that index

for word in B:
    index = vocab[word]
    b[index] += 1

最后一步是实际计算余弦相似度。

# use numpy's dot product to calculate the cosine similarity
sim = np.dot(a, b) / np.sqrt(np.dot(a, a) * np.dot(b, b))

变量sim 现在包含您的答案。您可以将这些子表达式中的每一个提取出来并验证它们是否与您的原始公式匹配。

通过一些重构，这种技术具有相当大的可扩展性（相对大量的输入列表，具有相对大量的不同单词）。对于非常大的语料库（如维基百科），您应该查看为此类事情制作的自然语言处理库。这里有几个不错的。

【讨论】：

【解决方案2】：

from collections import Counter

# word-lists to compare
a = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory']
b = [u'home (private)', u'school', u'bank', u'shopping mall']

# count word occurrences
a_vals = Counter(a)
b_vals = Counter(b)

# convert to word-vectors
words  = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]        # [0, 0, 1, 1, 2, 1]
b_vect = [b_vals.get(word, 0) for word in words]        # [1, 1, 1, 0, 1, 0]

# find cosine
len_a  = sum(av*av for av in a_vect) ** 0.5             # sqrt(7)
len_b  = sum(bv*bv for bv in b_vect) ** 0.5             # sqrt(4)
dot    = sum(av*bv for av,bv in zip(a_vect, b_vect))    # 3
cosine = dot / (len_a * len_b)                          # 0.5669467

【讨论】：

非常感谢您的回答。看起来很酷，但是在 words = list(a_vals.keys() | b_vals.keys()) 中，解释器说 'TypeError: unsupported operand type(s) for |: 'list' and 'list'。任何想法？ '
对不起，我在 Python 3.4 中测试过。对于 2.x，你会做 word = list(set(a_vals) | set(b_vals)).