【问题标题】:How to find similarity between strings in lists in Python如何在 Python 中查找列表中的字符串之间的相似性
【发布时间】:2019-09-25 00:28:39
【问题描述】:

我正在比较 Python 中的两个数据框列,目的是为第一列的每个元素找到第二列的最佳匹配。第一列包含 19.000 行,我需要检查它的每个字符串,第二列的最佳匹配是什么。因此,需要检查 19.000 行,每行 19.000 次,考虑到字符串本身必须是另一个,而不是相同的。

我从一个简单的比较开始,在列表中找到一个字符串,我成功了。然后我将它应用于一个列表,只是为了比较它们,但显然,由于比较字符串与列表,会给出错误“TypeError:预期的字符串或类似字节的对象”。最后,我尝试创建一个循环,但错误是一样的。有没有办法创建一个具有预期结果的列表?也许有更好的方法来使用另一个库,但是,到目前为止,我什么也没找到。这是目前的代码:

#simple example
from fuzzywuzzy import process
string = "appl"
compare = ["adfad.","apple","asple","tab"]
Ratios = process.extract(string,compare)
print(Ratios)
[('apple', 89), ('asple', 67), ('tab', 29), ('adfad.', 22)]

highest = process.extractOne(string,compare)
print(highest)
('apple', 89)

#data frame
from fuzzywuzzy import process
dataframecolumn = ["appl","tb"]
compare = ["adfad.","apple","asple","tab"]
Ratios = process.extract(dataframecolumn,compare)
TypeError: expected string or bytes-like object

#expected (but I need a list)
highest = process.extractOne(dataframecolumn[0],compare)
print(highest)
('apple', 89)
highest = process.extractOne(dataframecolumn[1],compare)
print(highest)
('tab', 80)

#Result expected
results = ["apple, 89","tab, 80"]

#Error
myl = ["appl","tb"]
compare = ["adfad.","apple","asple","tab"]
results = []
for x in myl:
    results.append(process.extractOne(myl,compare)[1])
TypeError: expected string or bytes-like object

【问题讨论】:

    标签: python string matching similarity


    【解决方案1】:
    from operator import itemgetter 
    
    dataframecolumn = ["appl","tb"]
    compare = ["adfad.","apple","asple","tab"]
    Ratios = [process.extract(x,compare) for x in dataframecolumn]
    print ([max(ratios, key = itemgetter(1)) for ratios in Ratios])
    
    # Or oneliner
    #Ratios = [max(process.extract(x,compare),key = itemgetter(1)) for x in dataframecolumn]
    

    如果extract 总是返回排序后的结果,那么我们可以避免调用max

    Ratios = [process.extract(x, compare)[0] for x in dataframecolumn]
    

    输出:

    [('apple', 89), ('tab', 80)]

    如果您想跳过精确匹配而只获得模糊匹配,则只需跳过得分为 100% 的匹配并获得第一个非 100% 匹配,因为它已经排序。

    dataframecolumn = ["apple","tb"]
    compare = ["adfad","apple","asple","tab"]
    Ratios = [process.extract(x,compare) for x in dataframecolumn]
    result = list()
    for ratio in Ratios:
        for match in ratio:
            if match[1] != 100:
                result.append(match)
                break
    print (result) 
    

    【讨论】:

    • 如果我想得到第二个结果怎么办?这个想法是比较同一列。例如,dataframecolumn = ["apple","tb"], compare = ["adfad.","apple","asple","tab"] 应该给出“asple”。
    • Ratios = [process.extract(x, compare)[1] for x in dataframecolumn] 输出:[('asple', 67), ('adfad.', 0)],但应该是:[('asple', 67), ('tab', 80)]
    • @ecp 如果您想跳过完全匹配,只需跳过 100% 的分数。检查更新。
    • 在某些情况下,解决方案不会给出所有结果。例如从fuzzywuzzy导入流程dataframecolumn = ["apple","tb"] compare = ["apple","apple"] Ratios = [process.extract(x,compare) for x in dataframecolumn] result = list() for ratio in Ratios: for match in ratio: if match[1] != 100: result.append(match) break print (result)
    • 当有重复时,应在给定示例中返回“apple”(2n 个)。我怎样才能做到这一点?再次感谢!!
    猜你喜欢
    • 1970-01-01
    • 2019-01-26
    • 2019-01-29
    • 1970-01-01
    • 2021-03-20
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多