【问题标题】:Return 'similar score' based on two dictionaries' similarity in Python?基于Python中两个字典的相似性返回“相似分数”?
【发布时间】:2021-05-20 23:52:24
【问题描述】:

我知道使用以下函数可以返回两个字符串的相似程度:

from difflib import SequenceMatcher
def similar(a, b):
    output=SequenceMatcher(None, a, b).ratio()
    return output

In [37]: similar("Hey, this is a test!","Hey, man, this is a test, man.")
Out[37]: 0.76
In [38]: similar("This should be one.","This should be one.")
Out[38]: 1.0

但是是否可以根据键及其对应值的相似性对两个字典进行评分?不是一些共同的键,也不是 的共同点,而是从 0 到 1 的分数,就像上面的字符串示例一样。

我正在尝试在这本词典中找到 rating['Shane'] 和 rating['Joe'] 之间的相似度分数:

ratings={'Shane': {'127 Hours': 3.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}

我正在使用 Python 2.7.10

【问题讨论】:

  • 那么预期的输出是什么?公共键的数量? (就像除了Taken 3之外的键是相同的。或者实际值?多级字典呢?
  • 查看en.m.wikipedia.org/wiki/Jaccard_index 相当简单的工具集。
  • 结果将取决于您的指标
  • @fodma1 我希望能找到兼顾一切的东西。
  • 你想要一些相关系数吗?

标签: python dictionary similarity


【解决方案1】:
import math

ratings={'Shane': {'127 Hours': 3.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}

def cosine_similarity(vec1,vec2):
        sum11, sum12, sum22 = 0, 0, 0
        for i in range(len(vec1)):
            x = vec1[i]; y = vec2[i]
            sum11 += x*x
            sum22 += y*y
            sum12 += x*y
        return sum12/math.sqrt(sum11*sum22)

list1 = list(ratings['Shane'].values())
list2 =  list(ratings['Joe'].values())

sim = cosine_similarity(list1,list2)
print(sim)

输出

o/p : 0.9205746178983233

更新 当我使用:

ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0},
         'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}

输出:0.9574271077563381

更新 2:标准化长度和考虑的键

from math import*


ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0},
         'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0},
         'Bob': {'Panic Room':5.0,'Nonstop':5.0}}


def square_rooted(x):

    return round(sqrt(sum([a*a for a in x])),3)

def cosine_similarity(x,y):

    input1 = {}
    input2 = {}
    vector2 = []
    vector1 =[]

    if len(x) > len(y):
        input1 = x
        input2 = y
    else:
        input1 = y
        input2 = x


    vector1 = list(input1.values())

    for k in input1.keys():    # Normalizing input vectors. 
        if k in input2:
            vector2.append(float(input2[k])) #picking the values for the common keys from input 2
        else :
            vector2.append(float(0))


    numerator = sum(a*b for a,b in zip(vector2,vector1))
    denominator = square_rooted(vector1)*square_rooted(vector2)
    return round(numerator/float(denominator),3)


print("Similarity between Shane and Joe")
print (cosine_similarity(ratings['Shane'],ratings['Joe']))

print("Similarity between Joe and Bob")
print (cosine_similarity(ratings['Joe'],ratings['Bob']))

print("Similarity between Shane and Bob")
print (cosine_similarity(ratings['Shane'],ratings['Bob']))

输出:

Similarity between Shane and Joe
0.887
Similarity between Joe and Bob
0.346
Similarity between Shane and Bob
0.615

jaccard 和 cosine 之间的很好解释https://datascience.stackexchange.com/questions/5121/applications-and-differences-for-jaccard-similarity-and-cosine-similarity

我正在使用 Python 3.4

注意:我已将 0 分配给缺失值。但是您也可以分配一些适当的值。参考:http://www.analyticsvidhya.com/blog/2015/02/7-steps-data-exploration-preparation-building-model-part-2/

【讨论】:

  • 你不是在看价值观吗?只是钥匙。还是我错过了什么?
  • @JLPeyret,啊,酷将其更改为值。简单吗?
  • “不能将序列乘以 'str' 类型的非整数”是我在“sum11 += x*x”行中不断得到的结果。
  • 当把字典改成这样:ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}},它表示它们完全一样,当它们不是时输出为“1.0”。我只是将 127 小时键从 3.0 更改为 5.0
  • @TrivisionZero 但这没有意义。如果我给一部电影打 5 分,而你给另一部电影打 5 分,那么应该没有相似之处(除非你比较我们的热情程度)。
【解决方案2】:

https://en.m.wikipedia.org/wiki/Jaccard_index

现在是一些经过清理的示例代码。

def jac(s1,s2):
    """the jaccard index between 2 sets"""
    s_union = s1.union(s2)
    s_inter = s1.intersection(s2)

    len_union = len(s_union)
    if not len_union:
        return 0

    return len(s_inter)*1.0/len_union

from itertools import permutations

ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0},
     'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0},
     'Bob': {'Panic Room':5.0,'Nonstop':5.0}}

def common_movie(dict0, dict1):
    """have we rated the same movies?"""
    set0 = set(dict0.items())
    set1 = set(dict1.items())
    return jac(set0, set1)

def movies_and_ratings(dict0, dict1):
    """how do our movies and ratings line up?"""

    set_keys0 = set(dict0.keys())
    set_keys1 = set(dict1.keys())

    key_commonality = jac(set_keys0, set_keys1)

    set0 = set(dict0.items())
    set1 = set(dict1.items())

    item_commonality = jac(set0, set1)

    #ok, so now we give a proximity on key match, even if key + data dont match
    return 0.3 * key_commonality + 0.7 * item_commonality

def common_movie_ratings(dict0, dict1):
    """how do our ratings correspond on the same movies?"""

    set_keys0 = set(dict0.keys())
    set_keys1 = set(dict1.keys())

    set_common = set_keys0.intersection(set_keys1)

    set0 = set([v for k, v in dict0.items() if k in set_common])
    set1 = set([v for k, v in dict1.items() if k in set_common])

    return jac(set0, set1)

for pair in permutations(ratings.keys(), 2):

    dict0, dict1 = ratings[pair[0]], ratings[pair[1]]
    print "\n %s vs %s" % (pair)

    #make no assumption on key/value
    #order coming out of a dictionary.  So, you need to order them. 
    li = dict0.items()
    li.sort()
    print "  %s" % (li)
    li = dict1.items()
    li.sort()
    print "  %s" % (li)

    print "     common_movie    :%s" % common_movie(dict0, dict1)
    print "     movies_and_ratings:%s" % movies_and_ratings(dict0, dict1)
    print "     common_movie_ratings  :%s" % common_movie_ratings(dict0, dict1)

输出:

 Shane vs Bob
  [('127 Hours', 5.0), ('Avatar', 4.0), ('Nonstop', 5.0)]
  [('Nonstop', 5.0), ('Panic Room', 5.0)]
     common_movie    :0.25
     movies_and_ratings:0.25
     common_movie_ratings  :1.0

 Shane vs Joe
  [('127 Hours', 5.0), ('Avatar', 4.0), ('Nonstop', 5.0)]
  [('127 Hours', 5.0), ('Avatar', 5.0), ('Nonstop', 3.0), ('Taken 3', 4.0)]
     common_movie    :0.166666666667
     movies_and_ratings:0.341666666667
     common_movie_ratings  :0.333333333333

 Bob vs Shane
  [('Nonstop', 5.0), ('Panic Room', 5.0)]
  [('127 Hours', 5.0), ('Avatar', 4.0), ('Nonstop', 5.0)]
     common_movie    :0.25
     movies_and_ratings:0.25
     common_movie_ratings  :1.0

 Bob vs Joe
  [('Nonstop', 5.0), ('Panic Room', 5.0)]
  [('127 Hours', 5.0), ('Avatar', 5.0), ('Nonstop', 3.0), ('Taken 3', 4.0)]
     common_movie    :0.0
     movies_and_ratings:0.06
     common_movie_ratings  :0.0

 Joe vs Shane
  [('127 Hours', 5.0), ('Avatar', 5.0), ('Nonstop', 3.0), ('Taken 3', 4.0)]
  [('127 Hours', 5.0), ('Avatar', 4.0), ('Nonstop', 5.0)]
     common_movie    :0.166666666667
     movies_and_ratings:0.341666666667
     common_movie_ratings  :0.333333333333

 Joe vs Bob
  [('127 Hours', 5.0), ('Avatar', 5.0), ('Nonstop', 3.0), ('Taken 3', 4.0)]
  [('Nonstop', 5.0), ('Panic Room', 5.0)]
     common_movie    :0.0
     movies_and_ratings:0.06
     common_movie_ratings  :0.0

【讨论】:

  • 这在我的情况下似乎不起作用...“AttributeError: 'set' object has no attribute 'intersect'”
  • 不在 comp 上,因此您可能需要检查语法。我做了正确的相交。但是我以前用过 Jaccard,它们的评分非常好,为 0..1。
  • 现在它在除法部分给出了“不支持的操作数类型 /: 'set' 和 'set'' 错误。
  • 阿格。对于那个很抱歉。您需要查看尺寸。我在两者中都添加了 len 。我的错。
  • @JLPeyret 整数除法...需要限定Python版本
【解决方案3】:

这是我上面提到的 Jaccard Similarity 数据科学 stackexchange 帖子的实现。

假设,您有一个来自集合库的 Counter 输出,用于计算某个键在可迭代对象中出现的次数,如下所示:

d1 = {'a': 2, 'b': 1}
d2 = {'a': 1, 'c': 1}

def get_jaccard_similarity(d1,d2):

    if not isinstance(d1, dict) or not isinstance(d2, dict):
        raise TypeError(f'd1 and d2 should be of type dict'
                    f' and not {type(d1).__name__}, {type(d2).__name__}')
    if not d1 and not d2:
        return 1
    elif (d1 and not d2) or (d2 and not d1):
        return 0
    else:
        set_of_all_keys = {*d1.keys(), *d2.keys()}
        nb_of_common_elements_dict = {k:min(d1.get(k,0),d2.get(k, 0))
                                  for k in set_of_all_keys }
        nb_of_total_elements_dict = {k: max(d1.get(k, 0), d2.get(k, 0))
                                  for k in set_of_all_keys}

        return sum(nb_of_common_elements_dict.values())/sum(nb_of_total_elements_dict.values())

输出: 0.75

datascience stackexchange 帖子基于集合的概念推导出 Jaccard 相似度。我相信这个实现将给出与集合(值等于 1 的字典)相同的结果,除了它为键在两个(计数器)字典中出现的次数提供了权重

【讨论】:

    猜你喜欢
    • 2012-12-13
    • 1970-01-01
    • 2023-01-08
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-04-06
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多