Pandas：切片与 numpy 不兼容答案

【问题标题】：Pandas: slicing incompatible with numpy'sPandas：切片与 numpy 不兼容
【发布时间】：2018-09-22 21:21:02
【问题描述】：

我在 pandas 中发现了一种我无法向自己解释的行为。

我正在研究一个包含 N+2 列的音频特征数据库：一个 ID、时间t，以及与时间相关的 N 个音频特征t。由于各种原因，我想在每一行中也放入下一个 T 时间步的特征。（是的，相同的数据将重复最多 T 次）。因此，我编写了一个函数来创建包含来自连续时间步长的数据的附加特征列。正如您在附加代码中看到的那样，我已经以三种方式实现了它，其中一种方式不起作用，这让我感到惊讶，因为如果底层数据结构是 numpy 数组，它就可以工作。谁能解释一下为什么？

def create_datapoints_for_dnn(df, T):
    """
    Here we take the data frame with chroma features at time t and create all features at times t+1, t+2, ..., t+T-1.

    :param df: initial data frame of chroma features
    :param T: number of time steps to keep
    :return: expanded data frame of chroma features
    """
    res = df.copy()
    original_labels = df.columns.values
    n_steps = df.shape[0]  # the number of time steps in this song
    nans = pd.Series(np.full(n_steps, np.NaN)).values  # a column of nans of the correct length
    for n in range(1, T):
        new_labels = [ol + '+' + str(n) for ol in original_labels[2:]]
        for nl, ol in zip(new_labels, original_labels[2:]):
            # df.assign would use the name "nl" instead of what nl contains, so we build and unpack a dictionary
            res = res.assign(**{nl: nans})  # create a new column

            # CORRECT BUT EXTREMELY SLOW
            # for i in range(n_steps - (T - 1)):
            #     res.iloc[i, res.columns.get_loc(nl)] = df.iloc[n+i, df.columns.get_loc(ol)]

            # CORRECT AND FAST
            res.iloc[:-n, res.columns.get_loc(nl)] = df.iloc[:, df.columns.get_loc(ol)].shift(-n)

            # NOT WORKING
            # res.iloc[:-n, res.columns.get_loc(nl)] = df.iloc[n:, df.columns.get_loc(ol)]

    return res[: - (T - 1)]  # drop the last T-1 rows because time t+T-1 is not defined for them

数据示例（放入csv）：

songID,time,A_t,A#_t
CrossEra-0850,0.0,0.0,0.0
CrossEra-0850,0.1,0.0,0.0
CrossEra-0850,0.2,0.0,0.0
CrossEra-0850,0.3,0.31621,0.760299
CrossEra-0850,0.4,0.0,0.00107539
CrossEra-0850,0.5,0.0,0.142832
CrossEra-0850,0.6,0.8506459999999999,0.12481600000000001
CrossEra-0850,0.7,0.0,0.21206399999999997
CrossEra-0850,0.8,0.0796207,0.28227399999999997
CrossEra-0850,0.9,2.55144,0.169434
CrossEra-0850,1.0,3.4581699999999995,0.08014550000000001
CrossEra-0850,1.1,3.1061400000000003,0.030419599999999998

运行代码

import pandas as pd
import numpy as np

T = 4  # how many successive steps we want to put in a single row
df = pd.read_csv('path_to_csv')
res = create_datapoints_for_dnn(df, T)
res.to_csv('path_to_output', index=False)

结果：

【问题讨论】：

标签： python pandas numpy slice

【解决方案1】：

使用pd.DataFrame.shift 和concat
f-string 需要 Python 3.6。否则使用'+{}'.format(i)'

cols = ['songID', 'time']
d = df.drop(['songID', 'time'], 1)
df[cols].join(
    pd.concat(
        [d.shift(-i).add_suffix(f'+{i}') for i in range(4)],
        axis=1
    )
)

           songID  time     A_t+0    A#_t+0     A_t+1    A#_t+1     A_t+2    A#_t+2     A_t+3    A#_t+3
0   CrossEra-0850   0.0  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.316210  0.760299
1   CrossEra-0850   0.1  0.000000  0.000000  0.000000  0.000000  0.316210  0.760299  0.000000  0.001075
2   CrossEra-0850   0.2  0.000000  0.000000  0.316210  0.760299  0.000000  0.001075  0.000000  0.142832
3   CrossEra-0850   0.3  0.316210  0.760299  0.000000  0.001075  0.000000  0.142832  0.850646  0.124816
4   CrossEra-0850   0.4  0.000000  0.001075  0.000000  0.142832  0.850646  0.124816  0.000000  0.212064
5   CrossEra-0850   0.5  0.000000  0.142832  0.850646  0.124816  0.000000  0.212064  0.079621  0.282274
6   CrossEra-0850   0.6  0.850646  0.124816  0.000000  0.212064  0.079621  0.282274  2.551440  0.169434
7   CrossEra-0850   0.7  0.000000  0.212064  0.079621  0.282274  2.551440  0.169434  3.458170  0.080146
8   CrossEra-0850   0.8  0.079621  0.282274  2.551440  0.169434  3.458170  0.080146  3.106140  0.030420
9   CrossEra-0850   0.9  2.551440  0.169434  3.458170  0.080146  3.106140  0.030420       NaN       NaN
10  CrossEra-0850   1.0  3.458170  0.080146  3.106140  0.030420       NaN       NaN       NaN       NaN
11  CrossEra-0850   1.1  3.106140  0.030420       NaN       NaN       NaN       NaN       NaN       NaN

【讨论】：

感谢您的回答，但我已经有一个类似的解决方案实施。问题是为什么 res.iloc[:-n, res.columns.get_loc(nl)] = df.iloc[n:, df.columns.get_loc(ol)] 方法不起作用。
因为 Pandas 根据索引的对齐方式比较值。你没有做任何改变这些索引的事情。使用res.iloc[:-n, res.columns.get_loc(nl)] = df.iloc[n:, df.columns.get_loc(ol)].values
如果我理解正确，df.iloc[n:] 返回一个仍在 n 和 N 之间索引的 df。这与 numpy 不同，例如 a = np.arange(10)[2:10][0] 是一个复杂但完全有效的模式写作a=2。如果 numpy 处理数组的方式与 pandas 处理数据帧的方式相同，则上一行将给出 None，因为数组 np.arange(10)[2:10] 仅包含索引在 2 和 9 之间（包括在内）的元素。编写 .values 会强制 pandas 创建一个具有新索引的新数据框，从而纠正该问题。谢谢你。如果您在答案中复制您的评论，我会接受。