【问题标题】:Fuzzy matching inside a column列内的模糊匹配
【发布时间】:2019-04-04 13:54:27
【问题描述】:

假设我有一个这样的运动列表:

sports=["futball","fitbal","football","tennis","tenis","tenisse","footbal","zennis","ping-pong"]

如果模糊匹配优于 0.5 并且不只是与其自身匹配,我想创建一个数据框,将运动的每个元素与其最接近的元素匹配。 (我想为此使用函数fuzzywuzzy.fuzz.ratio(x,y))

结果应该是这样的:

pd.DataFrame({"sport":sports,"closest_match":["futball","futball","football","tennis","tennis","tennis","futball","tennis","ping-pong"]})

    sport   closest_match
0   futball futball
1   fitbal  futball
2   football football
3   tennis  tennis
4   tenis   tennis
5   tenisse tennis
6   footbal futball
7   zennis  tennis
8   ping-pong ping-pong

谢谢

【问题讨论】:

    标签: python pandas fuzzy fuzzywuzzy


    【解决方案1】:

    这是使用itertools.combinations的解决方案:

    from fuzzywuzzy import fuzz
    import pandas as pd
    sports = ["futball", "fitbal", "football", "tennis", "tenis", "tenisse", "footbal", "zennis", "ping-pong"]
    dist = ([x for x in itertools.combinations(sports, 2) if fuzz.ratio(*x) > 50])
    
    df = pd.DataFrame(dist, columns=["sport","closest"])
    df['ratio'] = dist = ([fuzz.ratio(*x) for x in itertools.combinations(sports, 2) if fuzz.ratio(*x) > 50])
    print(df)
    
    df = df.groupby(['sport'])[['closest','ratio']].agg('max').reset_index()
    

    输出:

          sport   closest  ratio
    0    fitbal  football     77
    1  football   footbal     93
    2   futball  football     80
    3     tenis    zennis     83
    4   tenisse    zennis     62
    5    tennis    zennis     91
    

    【讨论】:

    • 谢谢!但是我只想有一个模态,它看起来像一个集群任务......
    • 只有一种模式是什么意思?
    • 1 种方式用于足球,例如 1 种用于网球,1 种用于乒乓球。这里有zennis,tenisse,tenis等
    • 最好成绩?
    • mmmh 是的,我认为这是一个好主意,所以它减少了模式的数量
    猜你喜欢
    • 1970-01-01
    • 2020-11-01
    • 1970-01-01
    • 1970-01-01
    • 2016-02-11
    • 2021-10-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多