【发布时间】:2021-03-08 07:28:36
【问题描述】:
我正在搜索多个体育博彩网站,以便比较网站上每场比赛的赔率。
我的问题是如何从数据库中已经存在但团队名称以不同方式编写的匹配中识别match_id。
请随意添加任何方法,即使它们不使用数据框或 SQLite。
matches 表的列是:
match_id: int, sport: string, home_team: string, away_team: string, date: string (dd/mm/YYY)
因此,对于每个新匹配,我想验证它是否已存在于数据库中。
新比赛 = (sport_to_check, home_team_to_check, away_team_to_check, date_to_check)
我的伪代码是这样的:
SELECT match_id FROM matches
WHERE sport = (sport_to_check)
AND date = (date_to_check)
AND (fuzz(home_team, home_team_to_check) > 80 OR fuzz(away_team, away_team_to_check) > 80) //the fuzzy scores evaluation
如果未找到匹配项,则插入新行。
我相信没有办法像这样将 python 和 SQL 混合在一起,所以我把它称为“伪代码”。
我还可以将匹配表拉入 Pandas 数据框并对其进行评估,如果可行的话(如何?...)。
在任何给定时间,matches 表的记录都不会超过几千条。
让我给你一些预期输出的例子。其中解决方案由“find(row)”表示
在数据库中匹配表为:
+----------+------------+-----------------------------+----------------------+------------+
| match_id | sport | home_team | visitor_team | date |
+----------+------------+-----------------------------+----------------------+------------+
| 84 | football | confianca | cuiaba esporte clube | 24/11/2020 |
| 209 | football | cs alagoana | operario pr | 24/11/2020 |
| 184 | football | grenoble foot 38 | as nancy lorraine | 24/11/2020 |
| 7 | football | sv turkgucu-ataspor munchen | saarbrucken | 24/11/2020 |
| 414 | handball | dinamo bucareste | usam nimes | 24/11/2020 |
| 846 | handball | benidorm | naturhouse la rioja | 25/11/2020 |
| 874 | handball | cegledi | ferencvarosi tc | 25/11/2020 |
| 418 | handball | lemvig-thyboron | kif kolding | 25/11/2020 |
| 740 | ice hockey | tps | kookoo | 25/11/2020 |
| 385 | football | stevenage | hull | 29/11/2020 |
+----------+------------+-----------------------------+----------------------+------------+
还有要评估的新匹配:
+----------------+------------+---------------------+---------------------+------------+
| row (for demo) | sport | home_team | visitor_team | date |
+----------------+------------+---------------------+---------------------+------------+
| A | football | confianca-se | cuiaba mt | 24/11/2020 |
| B | football | csa | operario | 24/11/2020 |
| C | football | grenoble | nancy | 24/11/2020 |
| D | football | sv turkgucu ataspor | 1 fc saarbrucken | 24/11/2020 |
| E | handball | dinamo bucuresti | nimes | 24/11/2020 |
| F | handball | bm benidorm | bm logrono la rioja | 25/11/2020 |
| G | handball | cegledi kkse | ftc budapest | 25/11/2020 |
| H | handball | lemvig | kif kobenhavn | 25/11/2020 |
| I | ice hockey | turku ps | kookoo kouvola | 25/11/2020 |
| J | football | stevenage borough | hull city | 29/11/2020 |
| K | football | west brom | sheffield united | 28/11/2020 |
+----------------+------------+---------------------+---------------------+------------+
输出:
find(A) returns: 84
find(B) returns: 209
find(C) returns: 184
find(D) returns: 7
find(E) returns: 414
find(F) returns: 846
find(G) returns: 874
find(H) returns: 418
find(I) returns: 740
find(J) returns: 385
find(K) returns: (something like "not found" => I would then insert the new row)
谢谢!
【问题讨论】:
标签: python sqlite dataframe fuzzy-search fuzzywuzzy