如何比较两列不同的数据框并创建一个新的答案

【问题标题】：How to compare two columns of diffrent dataframes and create a new one如何比较两列不同的数据框并创建一个新的
【发布时间】：2020-02-04 23:10:35
【问题描述】：

大家好，我是 python 新手。我有两个数据框。一个包含药物描述，如下所示：

df1.head(5)

PID  Drug_Admin_Description
1       sodium chloride 0.9% SOLN
2       Nimodipine 30 mg oral
3       Livothirine 20 mg oral
4       Livo tab 112
5       Omega-3 Fatty Acids

其他表只有药物名称，如下所示：

df2.head(5)

Drug_Name 

Sodium chloride 0.5% SOLN
omega-3 Fatty Acids
gentamicin 40 mg/ml soln
amoxilin 123
abcd 12654

有没有办法只提取 df1 和 df2 中的药物。示例输出如下所示：

new_column

Sodium chloride
omega-3

我尝试在 python 中使用正则表达式，但无法弄清楚我将如何应用它。提前致谢

【问题讨论】：

从逻辑上讲，您如何从字符串的其余部分中识别药物名称？
是的，你是对的，这将是一个问题。但现在 df1 和 df2 之间常见的任何东西都可以在以后工作，我可以清理数据。谢谢
如果你想要通用元素，请看这里：stackoverflow.com/questions/18079563/…

标签： python regex pandas numpy

【解决方案1】：

一种可能是使用 difflib 库中的 get_close_matches。

import pandas as pd
import difflib

drug_description = ["sodium chloride 0.9% SOLN","Nimodipine 30 mg oral",
                    "Livothirine 20 mg oral", "Livo tab 112",
                    "Omega-3 Fatty Acids"]

df1 = pd.DataFrame({"Drug_Admin_Description":drug_description})


drug_name = ["Sodium chloride 0.5% SOLN", "omega-3 Fatty Acids",
            "gentamicin 40 mg/ml soln", "amoxilin 123", "abcd 12654"]

df2 = pd.DataFrame({"Drug_Name":drug_name})
# The above code is to create the dataframe with the information you provided



match_list = [] # We will append this list with the drug names that are similar to the drugs in Drug_Admin_description

for drug in df1["Drug_Admin_Description"]:
    match_test = difflib.get_close_matches(drug, drug_name, n=1)
    if len(match_test) == 0: #if the match is less then 60% similarity it will return a blank list
        pass
    else:
        match_list.append(match_test[0]) #we will take the only item in that list and append it to our match list

df3 = pd.DataFrame({"new_column":match_list}) #we will then make a dataframe of the matches.

以下是 get_close_matches 文档的链接。您可以传入截止参数来决定每个单词的匹配百分比。 https://docs.python.org/2/library/difflib.html#difflib.get_close_matches

【讨论】：

非常感谢@RamWill 我正在尝试在非常慢的服务器上实现它。谢谢您的帮助。

【解决方案2】：

一种可能的解决方案：

要从 DataFrame 的列中获取名称，请定义以下函数：

def getNames(src, colName):
    res = src.str.split(r' [\d.%]+ ?', n=1, expand=True).drop(1, 'columns')
    res.set_index(res[0].str.upper(), inplace=True)
    res.index.name = None
    res.columns = [colName]
    return res

我注意到每个药物名称都可以包含一个“数字部分” （一个空格，一个数字序列，包括一个点或一个 percent 字符）。

所以这个函数分割了这个模式中的每个名字，并且只需要第一个“段”。

然后请注意关于上/下的差异大小写，因此每个名称列表必须具有包含的索引大写相同的名称（因此可以仅在索引上加入两个名称列表）。

然后为两个源列调用此函数：

n1 = getNames(df1.Drug_Admin_Description, 'Name')
n2 = getNames(df2.Drug_Name, 'Name2')

为了得到最终结果，运行：

n1.join(n2, how='inner').drop('Name2', 'columns').reset_index(drop=True)

与您想要的结果相比，有一个差异，即Omega-3 Fatty Acids是全文的结果。

根据我选择的标准，此名称包含没有数字部分。唯一的数字 (3) 是名称的组成部分，并且没有这个地方之后的数字。所以我认为在这种情况下没有什么可以“切断”的。

【讨论】：

谢谢你，这很有用，我正在尝试在非常慢的服务器上实现它，数据非常庞大。我仍在尝试实现它。
另一种值得考虑的方法是使用模糊匹配。它涉及 fuzzywuzzy Python 模块。即使在 StackOverflow 上，您也可以找到大量问题和示例。