循环遍历 DataFrame 中的行子集答案

【问题标题】：Loop through subsets of rows in a DataFrame循环遍历 DataFrame 中的行子集
【发布时间】：2019-02-15 05:20:08
【问题描述】：

我尝试使用函数计算系列中最频繁的元素循环 DataFrame 的行。当我手动向其中提供系列时，该功能可以完美运行：

# Create DataFrame
df = pd.DataFrame({'a' : [1, 2, 1, 2, 1, 2, 1, 1],
              'b' : [1, 1, 2, 1, 1, 1, 2, 2],
              'c' : [1, 2, 2, 1, 2, 2, 2, 1]})

# Create function calculating most frequent element
from collections import Counter

def freq_value(series):
    return Counter(series).most_common()[0][0]

# Test function on one row
freq_value(df.iloc[1])

# Another test
freq_value((df.iloc[1, 0], df.iloc[1, 1], df.iloc[1, 2]))

通过这两个测试，我得到了想要的结果。但是，当我尝试在循环中通过 DataFrame 行应用此函数并将结果保存到新列中时，我收到错误 "'Series' object is not callable", 'occurred at index 0'。产生错误的行如下：

# Loop trough rows of a dataframe and write the result into new column
df['result'] = df.apply(lambda row: freq_value((row('a'), row('b'), row('c'))), axis = 1)

apply() 函数中的 row() 究竟是如何工作的？它不应该提供给我的freq_value() 'a'、'b'、'c' 列的函数值吗？

【问题讨论】：

在 apply 调用中尝试 row['a'] 而不是 row('a')。

标签： python pandas loops apply

【解决方案1】：

df['CommonValue'] = df.apply(lambda x: x.mode()[0], axis = 1)

【讨论】：

【解决方案2】：

@jpp 的回答解决了如何应用您的自定义函数，但您也可以使用df.mode 和axis=1 获得所需的结果。这将避免使用apply，并且仍会为您提供每行最常用值的列。

df['result'] = df.mode(1)

>>> df
   a  b  c  result
0  1  1  1       1
1  2  1  2       2
2  1  2  2       2
3  2  1  1       1
4  1  1  2       1
5  2  1  2       2
6  1  2  2       2
7  1  2  1       1

【讨论】：

好点，还值得注意的是mode 已经改进，现在是efficient 与所有替代品相比。

【解决方案3】：

row 不是lambda 中的函数，因此括号不合适，您应该使用__getitem__ 方法或loc 访问器来访问值。前者的语法糖是[]:

df['result'] = df.apply(lambda row: freq_value((row['a'], row['b'], row['c'])), axis=1)

使用loc 替代方案：

def freq_value_calc(row):
    return freq_value((row.loc['a'], row.loc['b'], row.loc['c']))

要准确了解为什么会出现这种情况，将lambda 重写为命名函数会有所帮助：

def freq_value_calc(row):
    print(type(row))  # useful for debugging
    return freq_value((row['a'], row['b'], row['c']))

df['result'] = df.apply(freq_value_calc, axis=1)

运行此程序，您会发现row 的类型为<class 'pandas.core.series.Series'>，即如果您使用axis=1，则按列标签索引的系列。要访问给定标签的系列值，您可以使用__getitem__ / [] 语法或loc。

【讨论】：