Python“最近的值回填”很慢答案

【问题标题】：Python "Most Recent Value Backfill" Is SlowPython“最近的值回填”很慢
【发布时间】：2017-06-20 14:24:34
【问题描述】：

我目前有一个基于 R 的算法，它按日期对 data.table 进行排序，然后找到最近的非 NA / 非空值。我在以下方面取得了一些成功 StackOverflow 问题对一些比较大的数据集实现回填算法：

Computing the first non-missing value from each column in a DataFrame

我已经在 Python 和 R 中实现了一个解决方案，但我的 Python 解决方案似乎运行得非常非常慢。

library(data.table)
library(microbenchmark)

test_values <- rnorm(100000)
test_values[sample(1:length(test_values), size = 10000)] <- NA

test_values_2 <- rnorm(100000)
test_values_2[sample(1:length(test_values), size = 10000)] <- NA

test_ids <- rpois(100000, lambda = 100)
random_timestamp <- sample(x = seq(as.Date('2000-01-01'), as.Date('2017-01-01'), by = 1), size = 100000, replace = TRUE)
dt <- data.table(
    'id' = test_ids,
    'date' = random_timestamp,
    'v1' = test_values,
    'v2' = test_values_2
)


# Simple functions for backfilling
backfillFunction <- function(vector) {
    # find the vector class
    colClass <- class(vector)
    if (all(is.na(vector))) {
        # return the NA of the same class as the vector
        NA_val <- NA
        class(NA_val) <- colClass
        return(NA_val)
    } else {
        # return the first non-NA value
        return(vector[min(which(!is.na(vector)))])
    }
}

print(microbenchmark(
    dt[order(-random_timestamp), lapply(.SD, backfillFunction), by = 'id', .SDcols = c('v1', 'v2')]
))

Unit: milliseconds
                                                                                                              expr      min       lq
 dt[order(-random_timestamp), c(lapply(.SD, backfillFunction),      list(.N)), by = "id", .SDcols = c("v1", "v2")] 9.976708 12.29137
    mean   median       uq      max neval
 15.4554 14.47858 16.75997 112.9467   100

还有 Python 解决方案：

import timeit

setup_statement = """
import numpy as np
import pandas as pd
import datetime

start_date = datetime.datetime(2000, 1, 1)
end_date = datetime.datetime(2017, 1, 1)
step = datetime.timedelta(days=1)
current_date = start_date

dates = []
while current_date < end_date:
    dates.append(current_date)
    current_date += step

date_vect = np.random.choice(dates, size=100000, replace=True)
test_values = np.random.normal(size=100000)
test_values_2 = np.random.normal(size=100000)
na_loc = [np.random.randint(0, 100000, size=10000)]
na_loc_2 = [np.random.randint(0, 100000, size=10000)]
id_vector = np.random.poisson(100, size=100000)

for i in na_loc:
    test_values[i] = None

for i in na_loc_2:
    test_values_2[i] = None


DT = pd.DataFrame(
    data={
        'id': id_vector,
        'date': date_vect,
        'v1': test_values,
        'v2': test_values_2
    }
)
GT = DT.sort_values(['id', 'date'], ascending=[1, 0]).groupby('id')
"""

print(timeit.timeit('{col: GT[col].apply(lambda series: series[series.first_valid_index()] if series.first_valid_index() else None) for col in DT.columns}', number=100, setup=setup_statement)*1000/100)


66.5085821699904

我在 Python 上的平均时间是 67 毫秒，但对于 R 来说只有 15 毫秒，尽管方法看起来比较相似（在组内的每一列上应用一个函数）。为什么我的 R 代码比我的 Python 代码快得多，如何在 Python 中实现类似的性能？

【问题讨论】：

分析您的代码。
你能避免两次调用series.first_valid_index()吗？
@Roland 注意到分析 - 我在 Python 代码上做过，但不确定如何加速慢速部分。

标签： python r pandas data.table

【解决方案1】：

编辑添加另一个可能更清晰的答案。定义一个函数，它获取第一个非缺失值，除非它们都缺失，然后返回 null。

def find_first(s):
    s = s.dropna()
    if len(s) == 0:
        return np.nan
    return s.iloc[0]

GT = DT.sort_values(['id', 'date'], ascending=[True, False])
GT.groupby(['id']).agg(find_first).reset_index()

也完成了

GT.set_index('id').stack().groupby(level=[0,1]).first().unstack()

原答案

堆叠值将自动删除缺失值并将它们全部放在一列中。然后你可以只占第一行。这里有很多步骤，但其中大多数只是重塑以使其看起来正确。

DT.sort_values(['id', 'date'], ascending=[True, False])\
  .set_index(['date', 'id'])\
  .stack()\
  .reset_index()\
  .groupby(['id', 'level_2'])\
  .first()\
  .set_index('date', append=True)\
  .squeeze()\
  .unstack('level_2')\
  .reset_index()\
  .rename_axis(None, axis='columns')

输出

     id       date        v1        v2
0     53 2015-08-29       NaN  1.700798
1     59 2000-04-25 -0.560505  0.371487
2     60 2011-01-07       NaN  0.627205
3     61 2001-03-13       NaN  0.245077
4     61 2011-01-11  0.992256       NaN
5     62 2005-04-14 -0.541771 -1.559377
6     63 2016-03-25  0.338544  0.176700
7     64 2016-07-12 -0.297969 -0.977407
8     65 2009-04-24       NaN -0.429607
9     65 2009-05-04  1.829951       NaN

额外：您可以像这样大大改进数据框的构建

dates = pd.date_range('2000-1-1', '2017-1-1')

date_vect = np.random.choice(dates, size=100000, replace=True)
test_values = np.random.normal(size=100000)
test_values_2 = np.random.normal(size=100000)
na_loc = [np.random.randint(0, 100000, size=10000)]
na_loc_2 = [np.random.randint(0, 100000, size=10000)]
id_vector = np.random.poisson(100, size=100000)

test_values[na_loc] = None
test_values_2[na_loc_2] = None

【讨论】：

已接受。非常感谢！