时间序列数据预处理 - numpy strides 技巧以节省内存答案

【问题标题】：Time series data preprocessing - numpy strides trick to save memory时间序列数据预处理 - numpy strides 技巧以节省内存
【发布时间】：2019-02-08 11:52:31
【问题描述】：

我正在预处理一个时间序列数据集，将其形状从 2 维（数据点、特征）更改为 3 维（数据点、time_window、特征）。

在这种透视图中，时间窗口（有时也称为回溯）表示作为输入变量参与预测下一个时间段的先前时间步/数据点的数量。换句话说，时间窗是机器学习算法考虑到未来单个预测的过去数据量。

这种方法（或至少在我的实现中）的问题在于它在内存使用方面的效率非常低，因为它会在窗口中带来数据冗余，从而导致输入数据变得非常繁重。

这是我迄今为止一直在使用的函数，用于将输入数据重塑为 3 维结构。

from sys import getsizeof

def time_framer(data_to_frame, window_size=1):
    """It transforms a 2d dataset into 3d based on a specific size;
    original function can be found at:
    https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
    """
    n_datapoints = data_to_frame.shape[0] - window_size
    framed_data = np.empty(
        shape=(n_datapoints, window_size, data_to_frame.shape[1],)).astype(np.float32)

    for index in range(n_datapoints):
        framed_data[index] = data_to_frame[index:(index + window_size)]
        print(framed_data.shape)

    # it prints the size of the output in MB
    print(framed_data.nbytes / 10 ** 6)
    print(getsizeof(framed_data) / 10 ** 6)

    # quick and dirty quality test to check if the data has been correctly reshaped        
    test1=list(set(framed_data[0][1]==framed_data[1][0]))
    if test1[0] and len(test1)==1:
        print('Data is correctly framed')

    return framed_data

有人建议我使用numpy's strides trick 来克服此类问题并减少重新整形数据的大小。不幸的是，到目前为止我在这个主题上找到的任何资源都集中在在二维数组上实现这个技巧，就像 excellent tutorial 一样。我一直在努力解决涉及 3 维输出的用例。这是我得出的最好的；但是，它既没有成功减小 framed_data 的大小，也没有正确地对数据进行构图，因为它没有通过质量测试。

我很确定我的错误在于我没有完全理解的 strides 参数。 new_strides 是我设法成功提供给 as_strided 的唯一值。

from numpy.lib.stride_tricks import as_strided

def strides_trick_time_framer(data_to_frame, window_size=1):

    new_strides = (data_to_frame.strides[0],
                   data_to_frame.strides[0]*data_to_frame.shape[1] ,
                   data_to_frame.strides[0]*window_size)

    n_datapoints = data_to_frame.shape[0] - window_size
    print('striding.....')
    framed_data = as_strided(data_to_frame, 
                             shape=(n_datapoints, # .flatten() here did not change the outcome
                                    window_size,
                                    data_to_frame.shape[1]),                   
                                    strides=new_strides).astype(np.float32)
    # it prints the size of the output in MB
    print(framed_data.nbytes / 10 ** 6)
    print(getsizeof(framed_data) / 10 ** 6)

    # quick and dirty test to check if the data has been correctly reshaped        
    test1=list(set(framed_data[0][1]==framed_data[1][0]))
    if test1[0] and len(test1)==1:
        print('Data is correctly framed')

    return framed_data

任何帮助将不胜感激！

【问题讨论】：

我编辑了这个问题，因为我实际上转换为 float32 以节省空间。我不知道它是否会改变什么

标签： python numpy data-structures time-series

【解决方案1】：

为此X：

In [734]: X = np.arange(24).reshape(8,3)
In [735]: X.strides
Out[735]: (24, 8)

这个as_strided 产生与你的time_framer 相同的数组

In [736]: np.lib.stride_tricks.as_strided(X, 
            shape=(X.shape[0]-3, 3, X.shape[1]), 
            strides=(24, 24, 8))
Out[736]: 
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 3,  4,  5],
        [ 6,  7,  8],
        [ 9, 10, 11]],

       [[ 6,  7,  8],
        [ 9, 10, 11],
        [12, 13, 14]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],

       [[12, 13, 14],
        [15, 16, 17],
        [18, 19, 20]]])

它像X 一样跨越最后一个维度。倒数第二。第一个前进一行，所以它也得到X.strides[0]。所以窗口大小只影响形状，而不影响步幅。

所以在您的as_strided 版本中只需使用：

 new_strides = (data_to_frame.strides[0],
                data_to_frame.strides[0] ,
                data_to_frame.strides[1])

小幅修正。将默认窗口大小设置为 2 或更大。 1 在测试中产生索引错误。

framed_data[0,1]==framed_data[1,0]

寻找getsizeof：

In [754]: sys.getsizeof(X)
Out[754]: 112
In [755]: X.nbytes
Out[755]: 192

等等，为什么X 的大小小于nbytes？因为它是view（参见上面的第 [734] 行）。

In [756]: sys.getsizeof(X.copy())
Out[756]: 304

正如另一个 SO 中所述，getsizeof 必须谨慎使用：

Why the size of numpy array is different?

现在是扩展副本：

In [757]: x2=time_framer(X,4)
...
In [758]: x2.strides
Out[758]: (96, 24, 8)
In [759]: x2.nbytes
Out[759]: 384
In [760]: sys.getsizeof(x2)
Out[760]: 512

和跨步版本

In [761]: x1=strides_trick_time_framer(X,4)
...
In [762]: x1.strides
Out[762]: (24, 24, 8)
In [763]: sys.getsizeof(x1)
Out[763]: 128
In [764]: x1.astype(int).strides
Out[764]: (96, 24, 8)
In [765]: sys.getsizeof(x1.astype(int))
Out[765]: 512

x1 大小就像一个视图（128 因为它的 3d）。但是如果我们尝试更改它的dtype，它会进行复制，并且步幅和大小与x2相同。

x1 上的许多操作将失去跨步大小优势，x1.ravel()、x1+1 等。主要是像 mean 和 sum 这样的归约操作可以真正节省空间。

【讨论】：

通过使用sys.getsizeof 我看到了改进，但是在我编辑时，我实际上将 dtype 转换为 float32 以节省内存；作为 float 32，“跨步”数组不会变得更轻
as_strided 数组是原始数组的 view。也就是说，它使用原始数据缓冲区。 astype 强制它进行复制，它将是完整的。比较带有和不带有astype 的strides 属性。在创建完整副本之前，您可以对 as_strided 数组执行有限的操作。
我添加了一些getsizeof 测试。
所以，getsizeof 对 a 视图没有用 - 这是 strides 技巧返回的内容；astype 在视图上创建原始副本 - 抵消了 strides 技巧的好处； @Daniel F 指出 nbytes 是一个不考虑共享元素的幼稚 ndarray.itemsize * ndarray.size；
是的，as_strided 的内存节省没有有意义的衡量标准。作为一个视图，它不需要任何额外的内存（除了数组对象开销），并且一个副本被扩展为完整大小。

【解决方案2】：

可以使用stride模板函数window_nd我做的here

然后跨过你只需要的第一个维度

framed_data = window_nd(data_to_frame, window_size, axis = 0)

还没有找到可以在任意轴上工作的内置窗口函数，所以除非最近在scipy.signal 或skimage 中实现了一个新函数，否则这可能是你最好的选择。

编辑：要查看内存节省，您需要使用@ali_m here 所描述的方法，因为基本的ndarray.nbytes 对共享内存来说是幼稚的。

def find_base_nbytes(obj):
    if obj.base is not None:
        return find_base_nbytes(obj.base)
    return obj.nbytes

【讨论】：

新数组通过质量检查，但内存大小没有改善
嗯。 ndarray.nbytes 似乎是个幼稚的 ndarray.itemsize * ndarray.size。它根本不考虑共享元素。如果您想确定跨步数组的实际大小，请查看here 的方法。
只有当我使用sys.getsizeof（基本属性没有改进）并且只有当我将 dtype 保持为 float64 时，我才会看到内存改进。如果我使用 float32 来节省更多内存，则生成的数组没有基本属性并且在内存方面没有改善
将getsizeof 与数组一起使用时要小心：stackoverflow.com/questions/52129595/…