从不同行的列中为 pandas DataFrame 列分配值的最佳方法是什么？答案

【问题标题】：What is the optimal way to assign a value to a pandas DataFrame column from a column in a different row?从不同行的列中为 pandas DataFrame 列分配值的最佳方法是什么？
【发布时间】：2021-02-24 13:20:44
【问题描述】：

我需要遍历一个由 UNIX 时间戳索引的 DataFrame，并在一个列中，在将来的特定索引时间从不同行的另一列中分配一个值。这就是我目前正在做的事情：

df = pd.DataFrame([
    [1523937600, 100.0, 0.0], 
    [1523937660, 120.0, 0.0], 
    [1523937720, 110.0, 0.0],
    [1523937780, 90.0, 0.0],
    [1523937840, 99.0, 0.0]], 
    columns=['time', 'value', 'target'])
df.set_index('time', inplace=True)

skip = 2  # mins skip-ahead
for i in range(0, df.shape[0]-1):       
    t = df.index[i] + (60*skip)
    try:
        df.iloc[i].target = df.loc[t].value
    except KeyError:
        df.iloc[i].target = 0.0

输出：

            value  target
time                     
1523937600  100.0   110.0
1523937660  120.0    90.0
1523937720  110.0    99.0
1523937780   90.0     0.0
1523937840   99.0     0.0

这可行，但我正在处理包含数百万行的数据集，并且需要很长时间。有没有更优化的方法来做到这一点？

编辑：添加示例输入/输出。请注意，重要的是我从具有计算索引时间的行中获取值，而不是仅仅向前看 n 行，因为时间之间可能存在间隙，或者两者之间可能存在额外的时间。

【问题讨论】：

请提供示例输入和预期输出以制作minimal reproducible example，以便我们更好地了解您的问题。见How to make good pandas examples
@G.Anderson 添加了示例输入/输出，谢谢。

标签： python pandas

【解决方案1】：

在这种情况下，您应该将时间作为列和索引。希望这会有所帮助：

df = pd.DataFrame([ 
    [1523937600, 100.0, 0.0], 
    [1523937660, 120.0, 0.0], 
    [1523937720, 110.0, 0.0],
    [1523937780, 90.0, 0.0],
    [1523937840, 99.0, 0.0]], 
    columns=['time', 'value', 'target'])
df.index = df['time']

df['target'] = df['time'].apply(lambda x: df.loc[x+(skip*60)].value if x+(skip*60) in df.index.values  else 0.0)

【讨论】：

在 100k 行的样本上，这对我来说慢了 5 倍（~60s vs ~12s）。我是 pandas 的新手，但是查看代码，我不希望这会更快 - 对于每一行，您在切片中执行两次值搜索以查找索引（2 次调用 idx 函数），而在迭代代码中是直接查找。
以前的代码对我来说要快得多。我认为行数会对其产生负面影响。我更新了代码，但在这种情况下，您需要将时间保留为列和索引。如果您尝试，请告诉我性能。
被拉到另一个项目，现在重新审视这个。您的更新确实比您以前的代码有所改进，并将 100k 样本减少到约 15 秒，因此节省了大量资金，但仍不比我拥有的迭代代码快。不过，我感谢您的努力。