【问题标题】:One to multiple merge two dataframes if one column string contained in another with Python如果一个列字符串包含在另一个与 Python 中,则一对多合并两个数据帧
【发布时间】:2021-11-15 12:17:08
【问题描述】:

如果df1words 的列值包含df2keywords 的列值,我有两个数据框我想合并。我一直在尝试使用str.extract。但到目前为止还没有获得预期结果的运气。下面的例子:

df1:

[{'id': 1, 'words': 'chellomedia', 'languages': nan},
 {'id': 2, 'words': 'Moien Welt!', 'languages': 'Luxemburgish'},
 {'id': 3, 'words': 'Ahoj světe!', 'languages': 'Czech'},
 {'id': 4, 'words': 'hello world', 'languages': nan},
 {'id': 5, 'words': '¡Hola Mundo!', 'languages': 'Spanish'},
 {'id': 6, 'words': 'hello kitty', 'languages': 'English'},
 {'id': 7, 'words': 'Ciao mondo!', 'languages': 'Italian'},
 {'id': 8, 'words': 'hola world', 'languages': nan}]

df2:

[{'code': 1, 'keywords': 'Hello'},
 {'code': 2, 'keywords': 'hola'},
 {'code': 3, 'keywords': 'world'}]

我的试用码:

df1['words'] = df1['words'].str.lower()
df2['keywords'] = df2['keywords'].str.lower()

pat = '|'.join([re.escape(x) for x in df2.keywords])
df1.insert(0, 'keywords', df1['words'].str.extract('(' + pat + ')', expand=False))

pd.merge(df1, df2, on='keywords', how='left')

输出:

  keywords  id         words     languages  code
0    hello   1   chellomedia           NaN   1.0
1      NaN   2   moien welt!  Luxemburgish   NaN
2      NaN   3   ahoj světe!         Czech   NaN
3    hello   4   hello world           NaN   1.0
4     hola   5  ¡hola mundo!       Spanish   2.0
5    hello   6   hello kitty       English   1.0
6      NaN   7   ciao mondo!       Italian   NaN
7     hola   8    hola world           NaN   2.0

但是想要的应该是这样的:

  keywords  id         words     languages  code
0    hello   1   chellomedia           NaN   1.0
1      NaN   2   moien welt!  Luxemburgish   NaN
2      NaN   3   ahoj světe!         Czech   NaN
3    hello   4   hello world           NaN   1.0
4    world   4   hello world           NaN   3.0  ---> should be generated in df
5     hola   5  ¡hola mundo!       Spanish   2.0
6    hello   6   hello kitty       English   1.0
7      NaN   7   ciao mondo!       Italian   NaN
8     hola   8    hola world           NaN   2.0
9    world   8    hola world           NaN   3.0  ---> should be generated in df

我怎样才能产生预期的结果?谢谢。

【问题讨论】:

    标签: python python-3.x regex pandas dataframe


    【解决方案1】:

    您必须使用findallexplode 而不是extract,例如:

    df1.insert(0, 'keywords', df1['words'].str.findall('(' + pat + ')'))
    print(pd.merge(df1.explode('keywords'), df2, on='keywords', how='left')
            .sort_values('id').reset_index(drop=True))
    

    输出:

      keywords  id         words     languages  code
    0    hello   1   chellomedia           NaN   1.0
    1      NaN   2   moien welt!  Luxemburgish   NaN
    2      NaN   3   ahoj světe!         Czech   NaN
    3    hello   4   hello world           NaN   1.0
    4    world   4   hello world           NaN   3.0
    5     hola   5  ¡hola mundo!       Spanish   2.0
    6    hello   6   hello kitty       English   1.0
    7      NaN   7   ciao mondo!       Italian   NaN
    8    world   8    hola world           NaN   3.0
    9     hola   8    hola world           NaN   2.0
    

    和你需要的完全一样:)

    【讨论】:

      猜你喜欢
      • 2020-05-22
      • 2021-07-28
      • 1970-01-01
      • 2017-09-12
      • 1970-01-01
      • 1970-01-01
      • 2020-08-13
      • 2015-08-19
      • 2016-07-22
      相关资源
      最近更新 更多