如何将自定义函数应用于熊猫数据框的每一行答案

【问题标题】：how to apply custom function to each row of pandas dataframe如何将自定义函数应用于熊猫数据框的每一行
【发布时间】：2017-11-18 16:01:53
【问题描述】：

我有以下例子：

import pandas as pd
import numpy as np

df = pd.DataFrame([(0,2,5), (2,4,None),(7,-5,4), (1,None,None)])

def clean(series):
    start = np.min(list(series.index[pd.isnull(series)]))
    end = len(series)
    series[start:] = series[start-1]
    return series

我的目标是获得一个数据框，其中包含 None 值的每一行都用最后一个可用的数值填充。

因此，例如，仅在数据帧的第 3 行运行此函数，我将生成以下内容：

row = df.ix[3]
test = clean(row)
test

0    1.0
1    1.0
2    1.0
Name: 3, dtype: float64

我无法使用 .apply() 方法使其工作，即 df.apply(clean,axis=1)

我应该提到这是一个玩具示例 - 我将在真实示例中编写的自定义函数在填充值方面更具动态性 - 所以我不是在寻找像 .ffill 或 .fillna 这样的基本实用程序

【问题讨论】：

apply 没有到位，它返回一个新的 DF，你保存这个新的 df 吗？
@Amen 没有完成的填充行是问题

标签： python pandas numpy dataframe

【解决方案1】：

apply 方法不起作用，因为当行完全填满时，您的 clean 函数将不知道从哪里开始索引，因为给定系列的数组是空的。

所以在更改系列数据之前使用条件，即

def clean(series):
    # Creating a copy for the sake of safety 
    series = series.copy()
    # Alter series if only there exists a None value
    if pd.isnull(series).any():

        start = np.min(list(series.index[pd.isnull(series)]))

        # for completely filled row 
        # series.index[pd.isnull(series)] will return 
        # Int64Index([], dtype='int64')

        end = len(series)
        series[start:] = series[start-1]
    return series

df.apply(clean,1)

输出：

0 1 2 0 0.0 2.0 5.0 1 2.0 4.0 4.0 2 7.0 -5.0 4.0 3 1.0 1.0 1.0

希望它能阐明为什么 apply 不起作用。我还建议考虑使用内置函数来清理数据，而不是从头开始编写函数。

【讨论】：

【解决方案2】：

首先，这是解决您的玩具问题的代码。但是这段代码不是你想要的。

df.ffill(axis=1)

接下来，我尝试测试您的代码。

df.apply(clean,axis=1)
#...start = np.min(list(series.index[pd.isnull(series)]))...
#=>ValueError: ('zero-size array to reduction operation minimum 
#                which has no identity', 'occurred at index 0')

要了解情况，请使用 lambda 函数进行测试。

df.apply(lambda series:list(series.index[pd.isnull(series)]),axis=1)
0        []
1       [2]
2        []
3    [1, 2]
dtype: object

而下一个表达式放入相同的值错误：

import numpy as np
np.min([])

总之，pandas.apply() 运行良好，但 clean 函数却不行。

【讨论】：

【解决方案3】：

您可以使用带有回填的 fillna 之类的东西吗？如果回填符合您的情况，我认为这可能会更有效..

即

df.fillna(method='backfill')

但是，这假设单元格中有一个 np.nan？

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html

【讨论】：