应用函数以使用 if else 逻辑修改多个列答案

【问题标题】：Apply a function to modify multiple columns with if else logic应用函数以使用 if else 逻辑修改多个列
【发布时间】：2018-11-22 10:00:53
【问题描述】：

我正在尝试使用 if-else 逻辑编写一个函数，该函数将修改我的数据框中的两列。但它不起作用。以下是我的功能

def get_comment_status(df):
    if df['address'] == 'NY':
        df['comment'] = 'call tomorrow'
        df['selection_status'] = 'interview scheduled'
        return df['comment'] 
        return df['selection_status']
    else:
        df['comment'] = 'Dont call'
        df['selection_status'] = 'application rejected'
        return df['comment']
        return df['selection_status']

然后执行函数为：

df[['comment', 'selection_status']] = df.apply(get_comment_status, axis = 1)

但我遇到了错误。我究竟做错了什么？我的猜测可能是 df.apply() 语法错误

错误信息：

TypeError: 'str' 对象不能被解释为整数 KeyError:('address', '发生在索引 0')

示例数据框：

df = pd.DataFrame({'address': ['NY', 'CA', 'NJ', 'NY', 'WS', 'OR', 'OR'],
               'name1': ['john', 'mayer', 'dylan', 'bob', 'mary', 'jake', 'rob'],
               'name2': ['mayer', 'dylan', 'mayer', 'bob', 'bob', 'tim', 'ben'],
               'comment': ['n/a', 'n/a', 'n/a', 'n/a', 'n/a', 'n/a', 'n/a'],
               'score': [90, 8, 88, 72, 34, 95, 50],
               'selection_status': ['inprogress', 'inprogress', 'inprogress', 'inprogress', 'inprogress', 'inprogress', 'inprogress']})

我也想过使用 lambda 函数，但它不起作用，因为我试图使用 '=' 为 'comment' 和 'selection_status' 列赋值

注意：我检查了this question，它与标题相似，但不能解决我的问题。

【问题讨论】：

如果你也列出错误很有用
查看你的返回语句：只有每个分支中的第一个被执行。您需要返回其他内容，基本上是同时返回两个值。
你能发布你想要的输出吗？
请注意，.apply 不适用于数据框，而是用于一行。对于您的代码，这无关紧要，但是在函数中命名变量 df 意味着您对 apply 的思考不正确，这将在以后引起混淆。
@9769953 - 这是非常有用的注释。谢谢。

标签： python python-3.x pandas if-statement

【解决方案1】：

您应该按照DyZ's solution 使用numpy.where。 Pandas 的一个主要优点是矢量化计算。但是，下面我将向您展示将如何使用pd.DataFrame.apply。注意事项：

行数据一次为您的函数提供一行，而不是一次性提供整个数据帧。因此，您应该相应地命名参数。
函数中的两个return 语句将不起作用。函数在到达 return 时停止。
相反，您需要返回一个结果列表，然后使用pd.Series.values.tolist 解包。

这是一个工作示例。

def get_comment_status(row):
    if row['address'] == 'NY':
        return ['call tomorrow', 'interview scheduled']
    else:
        return ['Dont call', 'application rejected']

df[['comment', 'selection_status']] = df.apply(get_comment_status, axis=1).values.tolist()

print(df)

  address  name1  name2        comment  score      selection_status
0      NY   john  mayer  call tomorrow     90   interview scheduled
1      CA  mayer  dylan      Dont call      8  application rejected
2      NJ  dylan  mayer      Dont call     88  application rejected
3      NY    bob    bob  call tomorrow     72   interview scheduled
4      WS   mary    bob      Dont call     34  application rejected
5      OR   jake    tim      Dont call     95  application rejected
6      OR    rob    ben      Dont call     50  application rejected

【讨论】：

这对我很有帮助。虽然从现在开始我会倾向于 np.where()，但我仍然想学习做同一件事的不同方法。

【解决方案2】：

您尝试做的事情与 Pandas 的理念不太一致。此外，apply 是一个非常低效的函数。你可能应该使用 Numpy where:

import numpy as np
df['comment'] = np.where(df['address'] == 'NY',
                  'call tomorrow', 'Dont call')
df['selection_status'] = np.where(df['address'] == 'NY',
                           'interview scheduled', 'application rejected')

或布尔索引：

df.loc[df['address'] == 'NY', ['comment', 'selection_status']] \
         = 'call tomorrow', 'interview scheduled'
df.loc[df['address'] != 'NY', ['comment', 'selection_status']] \
         = 'Dont call', 'application rejected'

【讨论】：

这是我目前所理解的——如果我需要返回多列，写一个函数是没有用的。我之前使用过 df.loc 方法 - 但在这里我想同时返回两列，而不是使用 np.where 或 df.loc 单独处理它们。但我想那不是正确的方法。
@singularity2047，Pandas 基于系列数组（列）。以矢量化方式单独更新每个系列通常比通过pd.DataFrame.apply 一起更新它们更快（这只是一个非常低效的循环）。