python如何匹配两个大小不等的列之间的部分字符串答案

【问题标题】：python how to match partial strings between two unequal sized columnspython如何匹配两个大小不等的列之间的部分字符串
【发布时间】：2017-08-26 04:03:33
【问题描述】：

这里对 Python 很陌生，还没有完全理解如何正确使用 Python，所以请容忍我在这里的愚蠢。

假设我们有一个这样的数据框：

samp_data = pd.DataFrame([[1,'hello there',3],
                             [4,'im just saying hello',6],
                             [7,'but sometimes i say bye',9],
                             [2,'random words here',5]],
                            columns=["a", "b", "c"])
print(samp_data)
   a                        b  c
0  1              hello there  3
1  4     im just saying hello  6
2  7  but sometimes i say bye  9
3  2        random words here  5

我们设置了一个我们不想要的单词列表：

unwanted_words = ['hello', 'random']

我想编写一个函数，该函数将排除 b 列包含“unwanted_words”列表中的任何单词的所有行。所以输出应该是：

print(samp_data)
   a                        b  c
2  7  but sometimes i say bye  9

到目前为止我尝试过的包括使用内置的“isin()”函数：

data = samp_data.ix[samp_data['b'].isin(unwanted_words),:]

但这并没有像我预期的那样排除行；我尝试使用 str.contains() 函数：

for i,row in samp_data.iterrows():
    if unwanted_words.str.contains(row['b']).any():
        print('found matching words')

这会给我带来错误。

我想我只是把事情复杂化了，肯定有一些我不知道的非常简单的方法。非常感谢任何帮助！

到目前为止我读过的帖子（不限于此列表，因为我已经关闭了许多窗口）：

【问题讨论】：

您应该用“panda”标记您的问题。这不是纯 Python。

标签： python pandas

【解决方案1】：

您实际上非常接近解决方案。它使用方法 Series.str.contains。请记住，它允许使用正则表达式：

samp_data[~samp_data['b'].str.contains(r'hello|random')]

结果将是：

Out [11]:
    a                         b c
2   7   but sometimes i say bye 9

【讨论】：

哇，谢谢！我最喜欢你的解决方案！一行，和我想的差不多。

【解决方案2】：

也许不是最优雅的，但我认为它对你有用？

def in_excluded(my_str, excluded):
    """
    (str) -> bool
    """
    for each in my_str:
        if each in excluded:
            return True
    return False


def print_only_wanted(samp_data, excluded):
    """
    (list, list) -> None
    Prints each of the lists in the main list unless they contain a word 
    from excluded
    """
    for each in samp_data:
        if not in_excluded(each, excluded):
            print each

【讨论】：

【解决方案3】：

您可以使用in 来确定是否可以在另一个字符串中找到一个字符串。例如，"he" in "hello" 将返回 True。您可以将其与列表理解和 any 函数结合使用来选择您想要的行：

df_sub = samp_data.loc[samp_data['b'].apply(lambda x: not(any([badword in x for badword in unwanted_words]))]

【讨论】：

【解决方案4】：

你可以使用 str.contains

samp_data = samp_data[~samp_data.b.str.contains('hello|random')]

你得到

    a   b                       c
2   7   but sometimes i say bye 9

如果您的不需要的单词列表较长，您可能需要使用

unwanted_words = ['hello', 'random']
samp_data = samp_data[~samp_data.b.str.contains('|'.join(unwanted_words))]

【讨论】：

【解决方案5】：

这个单线怎么样？我相信其他一些pandas 爱好者会比我有一些更好的答案。

samp_data[~samp_data['b'].apply(lambda x: any(word in unwanted_words for word in x.split()))]

   a                        b  c
2  7  but sometimes i say bye  9

【讨论】：