在数据列中搜索字符串列表答案

【问题标题】：Searching a list of strings within a column of data在数据列中搜索字符串列表
【发布时间】：2020-04-24 10:15:11
【问题描述】：

我有一列数据如下所示：

import pandas as pd
import numpy as np

   Items
0  Product A + Product B + Product C   
1  Product A + Product B + Product B1 + Product C1 
2

我想查看这些项目并找出该列是否包含一些特定项目，这些项目与我有兴趣标记为包含在项目列中的产品有关：

My_Items = ['Product B', 'Product C', 'Product C1']

我已经尝试了以下 lambda 函数，但如果列中的产品超过 1 个，它不会拾取我正在搜索的字符串：

df['My Items'] = df['Items'].apply(lambda x: 'Contains my items' if x in My_Items else '')

有谁知道如何在 lambda 函数的列表中搜索多个字符串？

感谢您的任何帮助或建议。

亲切的问候

【问题讨论】：

预期输出是什么？

标签： python string pandas lambda

【解决方案1】：

使用Series.str.count 计算匹配值，然后使用Series.gt 进行测试以获得更大的值，例如1：

mask = df.Items.str.count('|'.join(My_Items)).gt(1)

df['My Items'] = np.where(mask,'Contains 2 or more items', '')
print (df)
                                             Items                  My Items
0                Product A + Product B + Product C  Contains 2 or more items
1  Product A + Product B + Product B1 + Product C1  Contains 2 or more items

详情：

print (df.Items.str.count('|'.join(My_Items)))
0    2
1    3
Name: Items, dtype: int64

【讨论】：

【解决方案2】：

IIUC 你可以使用str.findall 并检查我们至少得到2 匹配：

import numpy as np

m = df.Items.str.findall('|'.join(My_Items)).str.len().ge(2)
df['My items'] = np.where(m, 'Contains at least 2 items', '')

如果我们检查仅包含 1 产品的附加行：

print(df)

                        Items  \
0                Product A + Product B + Product C      
1  Product A + Product B + Product B1 + Product C1     
2                            Product A + Product D    

                    My items  
0  Contains at least 2 items  
1  Contains at least 2 items  
2

df.Items.str.findall('|'.join(My_Items)) 为您提供了一个包含所有已找到匹配项的列表：

df.Items.str.findall('|'.join(My_Items))

 [Product B, Product C]
1    [Product B, Product B, Product C]
2                                   []
Name: Items, dtype: object

【讨论】：

【解决方案3】：

谢谢你们！我正在寻找的解决方案最终是您的两个答案的组合！

我最终做的是这个面具，所以我可以过滤：

DF['My_Items'] = DF.Items.str.findall('|'.join(My_list)).str.len().gt(1)

然后这是项目列表，所以我现在可以分析组合：

DF['My_Items'] = DF.Items.str.findall('|'.join(My_list)).astype(str)

【讨论】：