【问题标题】:Calculate cosine similarity between words计算单词之间的余弦相似度
【发布时间】:2017-03-19 04:27:06
【问题描述】:

如果我们有两个字符串列表:

A = "Hello how are you? The weather is fine. I'd like to go for a walk.".split()
B = "bank, weather, sun, moon, fun, hi".split(",")

列表A中的单词构成了我的词向量基础。 如何计算B中每个单词的余弦相似度分数?

到目前为止我做了什么: 我可以使用以下函数计算两个完整列表的余弦相似度:

def counter_cosine_similarity(c1, c2):
    terms = set(c1).union(c2)
    dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
    magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
    magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
    return dotprod / (magA * magB)

但是我必须如何整合我的向量基础以及如何计算 B 中的术语之间的相似性?

【问题讨论】:

  • “计算B中每个单词的余弦相似度分数”是什么意思?正如您在counter_cosine_similarity 的参数中看到的那样,相似性与两个向量有关,所以我假设您希望在两个词之间使用它。那么你想要每对单词的相似度,一个来自A,一个来自B

标签: python cosine-similarity


【解决方案1】:
import math
from collections import Counter

ListA = "Hello how are you? The weather is fine. I'd like to go for a walk.".split()
ListB = "bank, weather, sun, moon, fun, hi".split(",")

def cosdis(v1, v2):
    common = v1[1].intersection(v2[1])
    return sum(v1[0][ch] * v2[0][ch] for ch in common) / v1[2] / v2[2]

def word2vec(word):
    cw = Counter(word)
    sw = set(cw)
    lw = math.sqrt(sum(c * c for c in cw.values()))
    return cw, sw, lw

def removePunctuations(str_input):
    ret = []
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    for char in str_input:
        if char not in punctuations:
            ret.append(char)

    return "".join(ret)


for i in ListA:
    for j in ListB:
       print(cosdis(word2vec(removePunctuations(i)), word2vec(removePunctuations(j))))

【讨论】:

    猜你喜欢
    • 2021-02-22
    • 1970-01-01
    • 2015-05-24
    • 2018-06-26
    • 2015-07-21
    • 1970-01-01
    • 2017-07-07
    • 2018-04-11
    相关资源
    最近更新 更多