在 Pandas 列中搜索其他列中的子字符串答案

【问题标题】：Search Pandas Column for Substring in other Column在 Pandas 列中搜索其他列中的子字符串
【发布时间】：2016-11-02 20:36:48
【问题描述】：

我有一个例子.csv，导入为df.csv，如下：

    Ethnicity, Description
  0 French, Irish Dance Company
  1 Italian, Moroccan/Algerian
  2 Danish, Company in Netherlands
  3 Dutch, French
  4 English, EnglishFrench
  5 Irish, Irish-American

我想检查 pandas test1['Description'] 中 test1['Ethnicity'] 中的字符串。这应该返回第 0、3、4 和 5 行，因为描述字符串包含种族列中的字符串。

到目前为止我已经尝试过：

df[df['Ethnicity'].str.contains('French')]['Description']

这会返回任何特定的字符串，但我想在不搜索每个特定种族值的情况下进行迭代。我还尝试将列转换为列表并进行迭代，但似乎找不到返回行的方法，因为它不再是 DataFrame()。

提前谢谢你！

【问题讨论】：

标签： python string pandas dataframe substring

【解决方案1】：

您可以将str.contains 与Ethnicity 列中的值一起使用，然后将tolist 转换为join || 中的内容regex or：

print ('|'.join(df.Ethnicity.tolist()))
French|Italian|Danish|Dutch|English|Irish

mask = df.Description.str.contains('|'.join(df.Ethnicity.tolist()))
print (mask)
0     True
1    False
2    False
3     True
4     True
5     True
Name: Description, dtype: bool

#boolean-indexing
print (df[mask])
  Ethnicity          Description
0    French  Irish Dance Company
3     Dutch               French
4   English        EnglishFrench
5     Irish       Irish-American

看来你可以省略tolist()：

print (df[df.Description.str.contains('|'.join(df.Ethnicity))])
  Ethnicity          Description
0    French  Irish Dance Company
3     Dutch               French
4   English        EnglishFrench
5     Irish       Irish-American

【讨论】：

非常感谢，非常感谢！这在实施时有效。我对正则表达式操作 (regex) 没有太多经验，我一定会阅读。

【解决方案2】：

曾经流行的双重申请：

df[df.Description.apply(lambda x: df.Ethnicity.apply(lambda y: y in x)).any(1)]

  Ethnicity          Description
0    French  Irish Dance Company
3     Dutch               French
4   English        EnglishFrench
5     Irish       Irish-American

时间

jezrael 的回答要好得多

【讨论】：

感谢您的回答！这在实施时起作用。