如何通过正则表达式合并熊猫表答案

【问题标题】：How to merge pandas table by regex如何通过正则表达式合并熊猫表
【发布时间】：2018-09-06 10:46:20
【问题描述】：

我想知道是否有一种通过python中的正则表达式合并两个熊猫表的快速方法。

例如：表A

col1 col2             
1    apple_3dollars_5        
2    apple_2dollar_4
1    orange_5dollar_3
1    apple_1dollar_3

表 B

col1 col2
good (apple|oragne)_\dollars_5
bad  .*_1dollar_.*
ok   oragne_\ddollar_\d

输出：

col1 col2              col3
1    apple_3dollars_5  good
1    orange_5dollar_3  ok
1    apple_1dollar_3   bad

这只是一个例子，我想要的不是通过一个完全匹配的 col 合并，我想通过一些正则表达式加入。谢谢！

【问题讨论】：

标签： python regex pandas join merge

【解决方案1】：

首先修复B DataFrame 中的正则表达式：

In [222]: B
Out[222]:
   col1                        col2
0  good  (apple|oragne)_\ddollars_5
1   bad               .*_1dollar_.*
2    ok          orange_\ddollar_\d

现在我们可以准备以下变量：

In [223]: to_repl = B.col2.values.tolist()

In [224]: vals = B.col1.values.tolist()

In [225]: to_repl
Out[225]: ['(apple|oragne)_\\ddollars_5', '.*_1dollar_.*', 'orange_\\ddollar_\\d']

In [226]: vals
Out[226]: ['good', 'bad', 'ok']

最后我们可以在替换函数中使用它们：

In [227]: A['col3'] = A['col2'].replace(to_repl, vals, regex=True)

In [228]: A
Out[228]:
   col1              col2             col3
0     1  apple_3dollars_5             good
1     2   apple_2dollar_4  apple_2dollar_4
2     1  orange_5dollar_3               ok
3     1   apple_1dollar_3              bad

【讨论】：

非常感谢您的回答。对不起，不好的例子。在我的真实案例中，我试图使用正则表达式列表（表 A）来匹配 URL 列表（表 B）。您的答案非常适合我的示例。我只是想知道如果这两个表有多列（而不仅仅是替换一列）怎么办。或者我们想做一个左/内连接。你有什么建议吗？为了让它更简单，有这样的东西：table1.merge(table2, how = 'inner', left_on = 'col1', right_on='col2', regex = True) 非常感谢！
执行 .tolist() 时不需要“.values”。

【解决方案2】：

我从https://python.tutorialink.com/can-i-perform-a-left-join-merge-between-two-dataframes-using-regular-expressions-with-pandas/ 那里得到了这个想法，并对其进行了一些改进，以便原始数据可以有多个列，现在我们可以使用正则表达式进行真正的左连接（合并）！

import pandas as pd
d = {'extra_colum1': ['x', 'y', 'z', 'w'],'field': ['ab', 'a', 'cd', 'e'], 'extra_colum2': ['x', 'y', 'z', 'w']}
df = pd.DataFrame(data=d)
df_dict = pd.DataFrame(['a', 'b', 'c', 'd'], columns = 
['destination'])
df_dict['field'] = '.*' + df_dict['destination'] + '.*'
df_dict.columns=['destination','field']

dataframe and dict

def merge_regex(df, df_dict, how, field):
    import re
    df_dict = df_dict.drop_duplicates()
    idx = [(i,j) for i,r in enumerate(df_dict[f'{field}']) for j,v in enumerate(df[f'{field}']) if re.match(r,v)]
    df_dict_idx, df_idx = zip(*idx)
    t = df_dict.iloc[list(df_dict_idx),0].reset_index(drop=True)
    t1 = df.iloc[list(df_idx),df.columns.get_loc(f'{field}')].reset_index(drop=True)
    df_dict_translated = pd.concat([t,t1], axis=1)
    data = pd.merge(
                df,
                df_dict_translated,
                how=f'{how}',
                left_on=f'{field}',
                right_on=f'{field}'
            )
    data = data.drop_duplicates()
    return data

【讨论】：