Python：结合 str.contains 并在 pandas 中合并答案

【问题标题】：Python: combine str.contains and merge in pandasPython：结合 str.contains 并在 pandas 中合并
【发布时间】：2018-03-30 13:35:55
【问题描述】：

我有两个看起来有点像下面的数据框（df1 中的Content 列实际上是一篇文章的全部内容，而不是像我的示例中那样，只有一句话）：

    PDF     Content
1   1234    This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun!
2   1111    Johannes writes about apples and oranges and that's great.
3   8000    Content that cannot be matched to the anything in df1.    
4   3993    There is an interesting piece on bananas plus kiwis as well.
    ...

（总计：5709 个条目）

    Author        Title
1   Johannes      Apples and oranges
2   Peter         Bananas and pears and grapes
3   Hannah        Bananas plus kiwis
4   Helena        Mangos and peaches
    ...

（总计：10228 个条目）

我想通过在df1 的Content 中搜索df2 中的Title 来合并两个数据框。如果标题出现在内容的前 2500 个字符中的某处，则表示匹配。注意：重要的是保留来自df1 的所有条目。相反，我只想保留来自df2 的匹配项（即左连接）。注意：所有Titles 都是唯一值。

所需的输出（列顺序无关紧要）：

    Author     Title                        PDF     Content
1   Peter      Bananas and pears and grapes 1234    This article is about bananas and pears and grapes, but also mentions apples and oranges, so much fun!
2   Johannes   Apples and oranges           1111    Johannes writes about apples and oranges and that's great.
3   NaN        NaN                          8000    Content that cannot be matched to the anything in df2.    
4   Hannah     Bananas plus kiwis           3993    There is an interesting piece on bananas plus kiwis as well.
    ...

我想我需要 pd.merge 和 str.contains 之间的组合，但我不知道怎么做！

【问题讨论】：

如果有多个匹配项，您希望/期望什么行为？
标题列中的所有条目都是唯一的。关于 Content 列，我希望 Title 条目与 Content 条目中找到的第一个匹配项匹配。
“第一个找到的匹配”如...？在数据集中排在第一位（逐行）还是在字符串中的位置排在第一位？
尝试一个完整的笛卡尔连接然后设计你自己的过滤器？
我已经编辑了我的问题，参见 PDF 1234，同时提到了“香蕉、梨和葡萄”以及“苹果和橙子”。所以，首先是在字符串中的位置。虽然我必须说两个标题不太可能同时出现在前 2500 个字符中。

标签： python regex pandas dataframe merge

【解决方案1】：

警告：解决方案可能会很慢 :)。
1. 获取标题列表
2. 根据标题列表顺序为 df1 创建索引
3. 在 idx 上连接 df1 和 df2

  lst = [item.lower() for item in df2.Title.tolist()]
  end = len(lst)
  def func(row):
    content = row[:2500].lower()
    for i, item in enumerate(lst):
      if item in content:
        return i
    end += 1
    return end
  df1 = df1.assign(idx=df1.Content.apply(func))

  res = pd.concat([df1.set_index('idx'), df2], axis=1)

输出

      PDF                                            Content    Author  \
0  1111.0  Johannes writes about apples and oranges and t...  Johannes
1  1234.0  This article is about bananas and pears and gr...     Peter
2  3993.0  There is an interesting piece on bananas plus ...    Hannah
3     NaN                                                NaN    Helena
4  8000.0  Content that cannot be matched to the anything...       NaN

                          Title
0            Apples and oranges
1  Bananas and pears and grapes
2            Bananas plus kiwis
3            Mangos and peaches
4                           NaN

【讨论】：

我收到以下错误，即使最初，两个数据帧都只有非空对象：------------ -------------------------------------------------- -- AttributeError Traceback (most recent call last) in () 2 # 在第二个 df 的前 2500 个字符中。 3 ----> 4 lst = [item.lower() for item in df2.Title.tolist()] 5 end = len(lst) 6 def func(row): AttributeError: 'float' object has no attribute'降低'。有什么想法吗？
@NynkeLys 将内容更改为 str
我有，使用以下命令，但仍然得到相同的错误：df1.Content = df1.Content.astype('str')
@NynkeLys 将标题转换为 str
@NynkeLys 要运行代码，标题和内容必须是字符串。 :)

【解决方案2】：

你可以做一个完整的笛卡尔连接/交叉产品，然后过滤。由于您无法进行哈希查找，因此它不应该比等效的“Join”语句慢：

df1['key'] = 1
df2['key'] = 2
df3 = pd.merge(df1, df2, on='key')
df3['key'] = df3.apply(lambda row: row['Title'].lower() in row['Content'][:2500].lower(), axis=1)
df3 = df3.loc[df3['key'], ['PDF', 'Author', 'Title', 'Content']]

生成表格：

       PDF    Author                         Title  \
0   1234.0  Johannes            Apples and oranges
1   1234.0     Peter  Bananas and pears and grapes
4   1111.0  Johannes            Apples and oranges
14  3993.0    Hannah            Bananas plus kiwis

                                              Content
0   This article is about bananas and pears and gr...
1   This article is about bananas and pears and gr...
4   Johannes writes about apples and oranges and t...
14  There is an interesting piece on bananas plus ...

【讨论】：

谢谢！我试过了，但出现以下错误：ValueError: Cannot set a frame with no defined index and a value that cannot be convert to a Series。有什么想法吗？
有什么想法吗？运行您的代码会不断出现错误。我使用 Python 2.7，即使使用与我为我的问题创建的完全相同的 dfs。