将 pandas 中的 2 个字符串列与两列中的不同条件相结合答案

【问题标题】：Combine 2 string columns in pandas with different conditions in both columns将 pandas 中的 2 个字符串列与两列中的不同条件相结合
【发布时间】：2022-01-21 14:33:19
【问题描述】：

我在 pandas 中有 2 列，数据看起来像这样。

code fx         category
AXD  AXDG.R     cat1
AXF  AXDG_e.FE  cat1 
333  333.R      cat1
....

还有其他类别，但我只对cat1感兴趣。

我想合并code 列中的所有内容，以及fx 列中. 之后的所有内容，并用新组合替换代码列而不影响其他行。

code    fx         category
AXD.R   AXDG.R     cat1
AXF.FE  AXDG_e.FE  cat1
333.R   333.R      cat1
.....

这是我的代码，我想我必须使用正则表达式，但我不确定如何以这种方式组合它。

df.loc[df['category']== 'cat1', 'code'] = df[df['category'] == 'cat1']['code'].str.replace(r'[a-z](?=\.)', '', regex=True).str.replace(r'_?(?=\.)','', regex=True).str.replace(r'G(?=\.)', '', regex=True)

我也不确定如何选择第二列。任何帮助将不胜感激。

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

还有其他类别，但我只对cat1感兴趣

您可以使用str.split 和series.where 来为cat1 添加扩展：

df['code'] = (df['code'].astype(str).add("."+df['fx'].str.split(".").str[-1])
             .where(df['category'].eq("cat1"),df['code']))

print(df)

     code         fx category
0   AXD.R     AXDG.R     cat1
1  AXF.FE  AXDG_e.FE     cat1
2   333.R      333.R     cat1

【讨论】：

我收到这个错误，TypeError: unsupported operand type(s) for +: 'float' and 'str'，我猜有些行可能有数字？这可能吗？
@anarchy 您的系列中可能有混合数据类型。先用df['code']=df['code'].astype(str)转成字符串再试试。
是的，我确实想通了哈哈，添加了 as 类型，它就可以工作了。谢谢！
我更新了您的答案以包含字符串转换

【解决方案2】：

替换点之前的字母数字。将结果附加到列代码。

df['code'] +=df['fx'].str.replace('(^[\w]+(?=\.))','',regex=True)




    code         fx  category
0   AXD.R     AXDG.R     cat1
1  AXF.FE  AXDG_e.FE     cat1
2   333.R      333.R     cat1

【讨论】：

【解决方案3】：

我们可以像这样使用split 得到预期的结果：

>>> df['code'] = df['code'] + '.' + df['fx'].str.split(pat=".", expand=True)[1]
>>> df
    code    fx          category    
0   AXD.R   AXDG.R      cat1        
1   AXF.FE  AXDG_e.FE   cat1        
2   333.R   333.R       cat1

要只过滤cat1，正如@anky 做得很好，我们可以添加where 语句：

>>> df['code'] = (df['code'] + '.' + df['fx'].str.split(pat=".", expand=True)[1]).where(df['category'].eq("cat1"), df['code'])

【讨论】：

只过滤“cat1”怎么样？
确实！我更新了我的答案以过滤cat1。

【解决方案4】：

你可以使用Series.str.extract:

df['code'] = df['code'].astype(str) + np.where(df['category'].eq('cat1'), df['fx'].astype(str).str.extract('(\..+)')[0], '')

输出：

>>> df
     code         fx category
0   AXD.R     AXDG.R     cat1
1  AXF.FE  AXDG_e.FE     cat1
2   333.R      333.R     cat1

【讨论】：

【解决方案5】：

您可以将extractfx 的部分附加到code：

df['code'] += df['fx'].str.extract('(\..*$)')[0]

输出：

     code         fx category
0   AXD.R     AXDG.R     cat1
1  AXF.FE  AXDG_e.FE     cat1
2   333.R      333.R     cat1

仅限于cat1：

df.loc[df['category'].eq('cat1'), 'code'] += df['fx'].str.extract('(\..*$)')[0]

【讨论】：

你能解释一下 \..*$ 部分吗？
@anarchy 匹配文字点 \. 后跟一系列字符 .* 和行尾 $。
我以为只需要一个点，第二个是什么
未转义的点表示正则表达式中的“任何字符”。