基于组和前一行 pandas 的前向填充（ffill）答案

【问题标题】：forward fill (ffill) based on group and previous row pandas基于组和前一行 pandas 的前向填充（ffill）
【发布时间】：2019-06-20 00:45:16
【问题描述】：

我有一个大数据框（400,000+ 行），如下所示：

data = np.array([
          [1949, '01/01/2018', np.nan, 17,     '30/11/2017'],
          [1949, '01/01/2018', np.nan, 19,      np.nan],
          [1811, '01/01/2018',     16, np.nan, '31/11/2017'],
          [1949, '01/01/2018',     15, 21,     '01/12/2017'],
          [1949, '01/01/2018', np.nan, 20,      np.nan],
          [3212, '01/01/2018',     21, 17,     '31/11/2017']
         ])
columns = ['id', 'ReceivedDate', 'PropertyType', 'MeterType', 'VisitDate']
pd.DataFrame(data, columns=columns)

结果df：

     id     ReceivedDate    PropertyType    MeterType   VisitDate
0   1949    01/01/2018       NaN              17       30/11/2017
1   1949    01/01/2018       NaN              19       NaN
2   1811    01/01/2018       16              NaN       31/11/2017
3   1949    01/01/2018       15               21       01/12/2017
4   1949    01/01/2018       NaN              20       NaN
5   3212    01/01/2018       21               17       31/11/2017

我想根据 groupby（id 和接收日期）向前填充 - 仅当它们在索引中按顺序排列时（即仅向前填充索引位置 1 和 4）。

我正在考虑有一个列说明是否应该根据标准填充，但我如何检查上面的行？

（我计划按照这个答案使用解决方案：pandas fill forward performance issue

df.isnull().astype(int)).groupby(level=0).cumsum().applymap(lambda x: None if x == 0 else 1)

因为x = df.groupby(['id','ReceivedDate']).ffill() 很慢。）

所需的df：

     id     ReceivedDate    PropertyType    MeterType   VisitDate
0   1949    01/01/2018       NaN              17       30/11/2017
1   1949    01/01/2018       NaN              19       30/11/2017
2   1811    01/01/2018       16              NaN       31/11/2017
3   1949    01/01/2018       15               21       01/12/2017
4   1949    01/01/2018       15               20       01/12/2017
5   3212    01/01/2018       21               17       31/11/2017

【问题讨论】：

df.groupby(['id', 'ReceivedDate']).ffill(limit=1)?
有时我可以连续有 2 行，我试图避免 df.groupby.ffill 因为每 1000 行需要大约 1 秒（太慢）。
但是由于您限制了前向填充的数量，它可能会变得更快吗？
不幸的是还不够，仅在 10,000 行上进行了测试。 ffill() = 11.2 秒，ffill(limit=1) = 11.1 秒。

标签： python pandas

【解决方案1】：

`groupby` 和 `ffill` 和 `limit=1`

df.groupby(['id', 'ReceivedDate']).ffill(limit=1)

     id ReceivedDate PropertyType MeterType   VisitDate
0  1949   01/01/2018          NaN        17  30/11/2017
1  1949   01/01/2018          NaN        19  30/11/2017
2  1811   01/01/2018           16        18  31/11/2017
3  1949   01/01/2018           15        21  01/12/2017
4  1949   01/01/2018           15        20  01/12/2017
5  3212   01/01/2018           21        17  31/11/2017

`groupby` 与 `mask`ing 和 `shift`

尝试用 groupby、mask 和 shift 填充 NaN -

i = df[['id', 'ReceivedDate']]
j = i.ne(i.shift().values).any(1).cumsum()

df.mask(df.isnull().astype(int).groupby(j).cumsum().eq(1), df.groupby(j).shift())

或者，

df.where(df.isnull().astype(int).groupby(j).cumsum().ne(1), df.groupby(j).shift())

     id ReceivedDate PropertyType MeterType   VisitDate
0  1949   01/01/2018          NaN        17  30/11/2017
1  1949   01/01/2018          NaN        19  30/11/2017
2  1811   01/01/2018           16        18  31/11/2017
3  1949   01/01/2018           15        21  01/12/2017
4  1949   01/01/2018           15        20  01/12/2017
5  3212   01/01/2018           21        17  31/11/2017

【讨论】：

它甚至在第 2 行向前填充，那里有一个 nan（我不想要） - 我会更新问题以显示这个“边缘情况”
@AH 为你添加了解释。我认为这应该可行，但我不能 100% 确定性能。
所以代码不会检查上一行是否有相同的id和ReceivedDate
我做到了，它给了我一些想法，（它没有完全回答我的问题，但那是因为我第一次问的问题不够好 - 对此感到抱歉）。谢谢。
@AH 那么，答案错了吗？为什么它不起作用，你能帮我理解吗？如果没有用，我宁愿删除它。另外，在时间方面，它有多大用处？

【解决方案2】：

cols_to_ffill = ['PropertyType', 'VisitDate']
i = df.copy()

newdata = pd.DataFrame(['placeholder'] )

while not newdata.index.empty:

    RowAboveid = i.id.shift()
    RowAboveRD = i.ReceivedDate.shift()
    rows_with_cols_to_ffill_all_empty = i.loc[:, cols_to_ffill].isnull().all(axis=1)
    rows_to_ffill = (i.ReceivedDate == RowAboveRD) & (i.id == RowAboveid) & (rows_with_cols_to_ffill_all_empty)
    rows_used_to_fill = i[rows_to_ffill].index-1

    newdata = i.loc[rows_used_to_fill, cols_to_ffill]
    newdata.index +=1
    i.loc[rows_to_ffill, cols_to_ffill] = newdata

一直循环直到没有更多匹配项（即所有列都向前填充。）

【讨论】：

groupby 和 ffill 和 limit=1

groupby 与 masking 和 shift

`groupby` 和 `ffill` 和 `limit=1`

`groupby` 与 `mask`ing 和 `shift`