Pandas 数据框或 SQLite 模糊搜索答案

【问题标题】：Pandas dataframe or SQLite fuzzy searchPandas 数据框或 SQLite 模糊搜索
【发布时间】：2021-03-08 07:28:36
【问题描述】：

我正在搜索多个体育博彩网站，以便比较网站上每场比赛的赔率。

我的问题是如何从数据库中已经存在但团队名称以不同方式编写的匹配中识别match_id。
请随意添加任何方法，即使它们不使用数据框或 SQLite。

matches 表的列是：
match_id: int, sport: string, home_team: string, away_team: string, date: string (dd/mm/YYY)

因此，对于每个新匹配，我想验证它是否已存在于数据库中。
新比赛 = (sport_to_check, home_team_to_check, away_team_to_check, date_to_check)
我的伪代码是这样的：

    SELECT match_id FROM matches
    WHERE sport = (sport_to_check)
    AND date = (date_to_check)
    AND (fuzz(home_team, home_team_to_check) > 80 OR fuzz(away_team, away_team_to_check) > 80) //the fuzzy scores evaluation

如果未找到匹配项，则插入新行。

我相信没有办法像这样将 python 和 SQL 混合在一起，所以我把它称为“伪代码”。我还可以将匹配表拉入 Pandas 数据框并对其进行评估，如果可行的话（如何？...）。
在任何给定时间，matches 表的记录都不会超过几千条。

让我给你一些预期输出的例子。其中解决方案由“find(row)”表示
在数据库中匹配表为：

    +----------+------------+-----------------------------+----------------------+------------+
    | match_id | sport      | home_team                   | visitor_team         | date       |
    +----------+------------+-----------------------------+----------------------+------------+
    | 84       | football   | confianca                   | cuiaba esporte clube | 24/11/2020 |
    | 209      | football   | cs alagoana                 | operario pr          | 24/11/2020 |
    | 184      | football   | grenoble foot 38            | as nancy lorraine    | 24/11/2020 |
    | 7        | football   | sv turkgucu-ataspor munchen | saarbrucken          | 24/11/2020 |
    | 414      | handball   | dinamo bucareste            | usam nimes           | 24/11/2020 |
    | 846      | handball   | benidorm                    | naturhouse la rioja  | 25/11/2020 |
    | 874      | handball   | cegledi                     | ferencvarosi tc      | 25/11/2020 |
    | 418      | handball   | lemvig-thyboron             | kif kolding          | 25/11/2020 |
    | 740      | ice hockey | tps                         | kookoo               | 25/11/2020 |
    | 385      | football   | stevenage                   | hull                 | 29/11/2020 |
    +----------+------------+-----------------------------+----------------------+------------+

还有要评估的新匹配：

    +----------------+------------+---------------------+---------------------+------------+
    | row (for demo) | sport      | home_team           | visitor_team        | date       |
    +----------------+------------+---------------------+---------------------+------------+
    | A              | football   | confianca-se        | cuiaba mt           | 24/11/2020 |
    | B              | football   | csa                 | operario            | 24/11/2020 |
    | C              | football   | grenoble            | nancy               | 24/11/2020 |
    | D              | football   | sv turkgucu ataspor | 1 fc saarbrucken    | 24/11/2020 |
    | E              | handball   | dinamo bucuresti    | nimes               | 24/11/2020 |
    | F              | handball   | bm benidorm         | bm logrono la rioja | 25/11/2020 |
    | G              | handball   | cegledi kkse        | ftc budapest        | 25/11/2020 |
    | H              | handball   | lemvig              | kif kobenhavn       | 25/11/2020 |
    | I              | ice hockey | turku ps            | kookoo kouvola      | 25/11/2020 |
    | J              | football   | stevenage borough   | hull city           | 29/11/2020 |
    | K              | football   | west brom           | sheffield united    | 28/11/2020 |
    +----------------+------------+---------------------+---------------------+------------+

输出：

find(A) returns: 84  
find(B) returns: 209  
find(C) returns: 184  
find(D) returns: 7  
find(E) returns: 414  
find(F) returns: 846  
find(G) returns: 874  
find(H) returns: 418  
find(I) returns: 740  
find(J) returns: 385  
find(K) returns: (something like "not found" => I would then insert the new row)

谢谢！

【问题讨论】：

标签： python sqlite dataframe fuzzy-search fuzzywuzzy

【解决方案1】：

基本上，我会按给定的日期和运动过滤原始表格。然后使用fuzzywuzzy在剩余行之间找到家和访客之间的最佳匹配：

设置：

import pandas as pd

cols = ['match_id','sport','home_team','visitor_team','date']

df1 = pd.DataFrame([
['84','football','confianca','cuiaba esporte clube','24/11/2020'],
['209','football','cs alagoana','operario pr','24/11/2020'],
['184','football','grenoble foot 38','as nancy lorraine','24/11/2020'],
['7','football','sv turkgucu-ataspor munchen','saarbrucken','24/11/2020'],
['414','handball','dinamo bucareste','usam nimes','24/11/2020'],
['846','handball','benidorm','naturhouse la rioja','25/11/2020'],
['874','handball','cegledi','ferencvarosi tc','25/11/2020'],
['418','handball','lemvig-thyboron','kif kolding','25/11/2020'],
['740','ice hockey','tps','kookoo','25/11/2020'],
['385','football','stevenage','hull','29/11/2020']], columns=cols)


cols = ['row','sport','home_team','visitor_team','date']

df2 = pd.DataFrame([
['A','football','confianca-se','cuiaba mt','24/11/2020'],
['B','football','csa','operario','24/11/2020'],
['C','football','grenoble','nancy','24/11/2020'],
['D','football','sv turkgucu ataspor','1 fc saarbrucken','24/11/2020'],
['E','handball','dinamo bucuresti','nimes','24/11/2020'],
['F','handball','bm benidorm','bm logrono la rioja','25/11/2020'],
['G','handball','cegledi kkse','ftc budapest','25/11/2020'],
['H','handball','lemvig','kif kobenhavn','25/11/2020'],
['I','ice hockey','turku ps','kookoo kouvola','25/11/2020'],
['J','football','stevenage borough','hull city','29/11/2020'],
['K','football','west brom','sheffield united','28/11/2020']], columns=cols)

代码：

import pandas as pd
from fuzzywuzzy import fuzz
import string

def calculate_ratio(row):
    return fuzz.token_set_ratio(row['col1'],row['col2'] )

def find(df1, df2, row_search):
    alpha = df2.query('row == "{row_search}"'.format(row_search=row_search))
    sport = alpha.iloc[0]['sport']
    date = alpha.iloc[0]['date']
    home_team = alpha.iloc[0]['home_team']
    visitor_team = alpha.iloc[0]['visitor_team']
    
    beta = df1.query('sport == "{sport}" & date == "{date}"'.format(sport=sport,date=date))
    
    if len(beta) == 0:
        return 'Not found.'
    else:
        temp = pd.DataFrame({'match_id':list(beta['match_id']),'col1':list(beta['home_team'] + ' ' + beta['visitor_team']), 'col2':[home_team + ' ' + visitor_team]*len(beta)})
        temp['score'] = temp.apply(calculate_ratio, axis=1)
        temp = temp.sort_values('score', ascending=False)
        outcome = temp.head(1).iloc[0]['match_id']
        return outcome


for row_alpha in string.ascii_uppercase[0:11]:
    outcome = find(df1, df2, row_alpha)
    print ('{row_alpha} --> {outcome}'.format(row_alpha=row_alpha, outcome=outcome))

输出：

A --> 84
B --> 209
C --> 184
D --> 7
E --> 414
F --> 846
G --> 874
H --> 418
I --> 740
J --> 385
K --> Not found.

【讨论】：