【问题标题】:How to fill column by finding best match between list and other column in pandas?如何通过查找列表和熊猫中其他列之间的最佳匹配来填充列?
【发布时间】:2020-12-30 03:13:36
【问题描述】:

因此,我正在尝试使用 Pandas/Python 处理包含过帐日期、交易说明和金额的银行帐户电子表格。我想创建一个名为“供应商名称”的新列,它读取交易描述并用存储在vendors 中的供应商列表中的“供应商名称”的最佳匹配填充新列。我将提供一个我尝试过的示例(使用我在堆栈溢出时找到的函数)。描述信息已更改以删除敏感信息,但格式仍然相同。我有一个名为vendor_type.csv 的供应商电子表格,其中包含的供应商列表比我在这里显示的要大得多。我仍然会使用vendors = vendors_df['vendor_name'].tolist() 将其转换为列表,并且格式与以下相同。

import pandas as pd
import numpy as np
import re

In [1]: import pandas as pd
   ...: import numpy as np
   ...: import re

In [2]: df = pd.DataFrame({'Posting Date': ['2020-02-20', '2020-02-20', '2020-02-20', '2020-02-21', '2020-02-21'],
   ...:                   'Description': ['CHECK 12345', 'CHECK 1234', 'FPL DIRECT DEBIT ELEC PYMT', 'CHECK 9874', 'ADP PAYROLL FEES ADP - FEES'],
   ...:                   'Amount': [-500, -700, -400, -600, -90]})

In [3]: print(df)
  Posting Date                  Description  Amount
0   2020-02-20                  CHECK 12345    -500
1   2020-02-20                   CHECK 1234    -700
2   2020-02-20   FPL DIRECT DEBIT ELEC PYMT    -400
3   2020-02-21                   CHECK 9874    -600
4   2020-02-21  ADP PAYROLL FEES ADP - FEES     -90

In [4]: vendors = ['PAYROLL CHECK', 'FPL', 'ADP Payroll fees']
   ...: pattern = '|'.join(vendors)

In [5]: def pattern_searcher(search_str:str, search_list:str):
   ...:     search_obj = re.search(search_list, search_str)
   ...:     if search_obj:
   ...:         return_str = search_str[search_obj.start(): search_obj.end()]
   ...:     else:
   ...:         return_str = 'NA'
   ...:     return return_str
   ...:     

In [6]: df['VENDOR Name'] = df['Description'].apply(lambda x: pattern_searcher(search_str=x, search_list=pattern))

In [7]: print(df)
  Posting Date                  Description  Amount VENDOR Name
0   2020-02-20                  CHECK 12345    -500          NA
1   2020-02-20                   CHECK 1234    -700          NA
2   2020-02-20   FPL DIRECT DEBIT ELEC PYMT    -400         FPL
3   2020-02-21                   CHECK 9874    -600          NA
4   2020-02-21  ADP PAYROLL FEES ADP - FEES     -90          NA

最终结果应该是这样的:

  Posting Date                  Description  Amount       VENDOR Name
0   2020-02-20      CHECK 12345 VENDOR_NAME    -500      CHECK-VENDOR
1   2020-02-20                   CHECK 1234    -700     PAYROLL CHECK
2   2020-02-20   FPL DIRECT DEBIT ELEC PYMT    -400               FPL
3   2020-02-21                   CHECK 9874    -600     PAYROLL CHECK
4   2020-02-21  ADP PAYROLL FEES ADP - FEES     -90  ADP Payroll fees

我仍然想使用上面用于分类该交易的函数(因为它有点工作),但这不是必需的。如果供应商列表确实扩大,我还想使用可以扩展的 RegEx 规则。我有点卡在这里,非常感谢任何关于我如何做到这一点的见解。

谢谢。

【问题讨论】:

  • 那么df["Description"].str.extract(f"({pattern})", flags=re.I)?
  • @HenryYik 承认 ADP 工资费用,但不承认任何支票。

标签: python python-3.x regex pandas dataframe


【解决方案1】:

您不希望匹配模式(正则表达式)。您想查找供应商名称和描述之间的相似性。这可以通过多种方式完成,但我真的很喜欢fuzzywuzzy

import pandas

from typing import Optional
from fuzzywuzzy import fuzz, process


# Your input data
df = pandas.DataFrame(
    {
        "Posting Date": [
            "2020-02-20",
            "2020-02-20",
            "2020-02-20",
            "2020-02-21",
            "2020-02-21",
        ],
        "Description": [
            "CHECK 12345",
            "CHECK 1234",
            "FPL DIRECT DEBIT ELEC PYMT",
            "CHECK 9874",
            "ADP PAYROLL FEES ADP - FEES",
        ],
        "Amount": [-500, -700, -400, -600, -90],
    }
)

# List of vendors (can be loaded from file...)
vendors = ["PAYROLL CHECK", "FPL", "ADP Payroll fees"]


def matcher(description: str) -> Optional[str]:
    """Function that matches a description of a payment to a
    vendor in a list of vendors (fuzzy match).

    Args:
        description (str): The description to read

    Returns:
        str|None: The matching vendor (if we're certain enough about the match)
    """
    match, certainty = process.extractOne(
        description, vendors, scorer=fuzz.partial_ratio
    )
    if certainty >= 50:
        return match
    else:
        return None


df["VENDOR Name"] = df["Description"].apply(matcher)
df

输出:

  Posting Date                  Description  Amount       VENDOR Name
0   2020-02-20                  CHECK 12345    -500     PAYROLL CHECK
1   2020-02-20                   CHECK 1234    -700     PAYROLL CHECK
2   2020-02-20   FPL DIRECT DEBIT ELEC PYMT    -400               FPL
3   2020-02-21                   CHECK 9874    -600     PAYROLL CHECK
4   2020-02-21  ADP PAYROLL FEES ADP - FEES     -90  ADP Payroll fees

注意:带有certainty 的部分是找到匹配的程度。这是可选的,因为您可以只返回第一个/最佳匹配。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2022-01-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-09-13
    • 1970-01-01
    相关资源
    最近更新 更多