【问题标题】:Fuzzy Matching Two Columns in the Same Dataframe Using Python使用 Python 模糊匹配同一数据框中的两列
【发布时间】:2019-04-05 19:53:53
【问题描述】:

我在同一个数据框中有两个数据集,每个数据集都显示了一个公司列表。一个数据集来自 2017 年,另一个来自今年。我正在尝试将两个公司数据集相互匹配,并认为模糊匹配(FuzzyWuzzy)是最好的方法。使用部分比率,我想简单地列出具有以下值的列:去年公司的名称、最高模糊匹配率、今年与最高分数相关的公司。原始数据框已被赋予变量“数据”,去年公司名称在“公司”列下,今年公司名称在“公司名称”列下。为了完成这项任务,我尝试使用 extractOne 模糊匹配过程创建一个函数,然后将该函数应用于数据框中的每个值/行。然后我会将结果添加到我的原始数据框中。

下面是代码:

names_array=[]
ratio_array=[]
def match_names(last_year,this_year):
    for row in last_year:
    x=process.extractOne(row,this_year)
    names_array.append(x[0])
    ratio_array.append(x[1])
return names_array,ratio_array


#last year company names dataset
last_year=data['Company'].dropna().values

#this year companydataset

this_year=data['Company name'].values

name_match,ratio_match=match_names(last_year,this_year)

data['this_year']=pd.Series(name_match)
data['match_rating']=pd.Series(ratio_match)

data.to_csv("test.csv")

但是,每次我执行这部分代码时,我创建的两个添加列都不会显示在 csv 中。事实上,尽管计算机显示它是最近创建的,但“test.csv”只是与以前相同的数据框。如果有人能指出问题或以任何方式帮助我,将不胜感激。

编辑(数据框预览):

          Company                Company name
0                   BODYPHLO  SPORTIQUE                         NaN
1                        JOSEPH A PERRY                         NaN
2                PCH RESORT TENNIS SHOP                         NaN
3              GREYSTONE GOLF CLUB INC.                         NaN
4                 MUSGROVE COUNTRY CLUB                         NaN
5           CITY OF PELHAM RACQUET CLUB                         NaN
6                 NORTHRIVER YACHT CLUB                         NaN
7                           LAKE FOREST                         NaN
8                   TNL TENNIS PRO SHOP                         NaN
9                SOUTHERN ATHLETIC CLUB                         NaN
10           ORANGE BEACH TENNIS CENTER                         NaN

然后在公司条目(去年公司数据集)结束后,“公司名称”列(今年公司数据集)开始如下:

4168                                NaN                LEWIS TENNIS
4169                                NaN          CHUCKS PRO SHOP AT
4170                                NaN                CHUCK KINYON
4171                                NaN   LAKE COUNTRY RACQUET CLUB
4172                                NaN   SPORTS ACADEMY & RAC CLUB

【问题讨论】:

  • 能否请您使用 data.head(10) 包含数据框的前 10 行左右?

标签: python pandas fuzzywuzzy


【解决方案1】:

考虑到一列只在另一端开始一次,您的数据框结构很奇怪,但是我们可以让它工作。让我们为您提供的 data 获取以下示例数据框:

                        Company               Company name
0           BODYPHLO  SPORTIQUE                        NaN
1                JOSEPH A PERRY                        NaN
2        PCH RESORT TENNIS SHOP                        NaN
3      GREYSTONE GOLF CLUB INC.                        NaN
4         MUSGROVE COUNTRY CLUB                        NaN
5   CITY OF PELHAM RACQUET CLUB                        NaN
6         NORTHRIVER YACHT CLUB                        NaN
7                   LAKE FOREST                        NaN
8           TNL TENNIS PRO SHOP                        NaN
9        SOUTHERN ATHLETIC CLUB                        NaN
10   ORANGE BEACH TENNIS CENTER                        NaN
11                          NaN               LEWIS TENNIS
12                          NaN         CHUCKS PRO SHOP AT
13                          NaN               CHUCK KINYON
14                          NaN  LAKE COUNTRY RACQUET CLUB
15                          NaN  SPORTS ACADEMY & RAC CLUB

然后进行匹配:

import pandas as pd
from fuzzywuzzy import process, fuzz

known_list = data['Company name'].dropna()

def find_match(x):

    match = process.extractOne(x['Company'], known_list, scorer=fuzz.partial_token_sort_ratio)
    return pd.Series([match[0], match[1]])

data[['this year','match_rating']] = data.dropna(subset=['Company']).apply(find_match, axis=1, result_type='expand')

产量:

                        Company Company name                  this year  \
0           BODYPHLO  SPORTIQUE          NaN  SPORTS ACADEMY & RAC CLUB   
1                JOSEPH A PERRY          NaN         CHUCKS PRO SHOP AT   
2        PCH RESORT TENNIS SHOP          NaN               LEWIS TENNIS   
3      GREYSTONE GOLF CLUB INC.          NaN  LAKE COUNTRY RACQUET CLUB   
4         MUSGROVE COUNTRY CLUB          NaN  LAKE COUNTRY RACQUET CLUB   
5   CITY OF PELHAM RACQUET CLUB          NaN  LAKE COUNTRY RACQUET CLUB   
6         NORTHRIVER YACHT CLUB          NaN  LAKE COUNTRY RACQUET CLUB   
7                   LAKE FOREST          NaN  LAKE COUNTRY RACQUET CLUB   
8           TNL TENNIS PRO SHOP          NaN               LEWIS TENNIS   
9        SOUTHERN ATHLETIC CLUB          NaN  SPORTS ACADEMY & RAC CLUB   
10   ORANGE BEACH TENNIS CENTER          NaN               LEWIS TENNIS   

    match_rating  
0           47.0  
1           43.0  
2           67.0  
3           43.0  
4           67.0  
5           72.0  
6           48.0  
7           64.0  
8           67.0  
9           50.0  
10          67.0 

【讨论】:

  • 所以我在最初的问题中添加了数据集的示例。从逻辑上讲,这是完全有道理的,但由于某种原因,当我应用它并将“数据”发送到 csv 或尝试将“数据”打印到 Jupyter 时,什么都没有出现。这是因为我的数据框的结构方式还是?
  • 你的数据框结构很奇怪,但我已经编辑了我的答案以适应你的情况(因为我没有你的完整数据框,我的答案中的匹配将是无稽之谈)
  • 谢谢!这是完美的。感谢所有的帮助。
猜你喜欢
  • 2019-02-14
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-08-30
相关资源
最近更新 更多