创建一个具有时间步长和多个特征的新数组，例如 LSTM答案

【问题标题】：Create a new array with Timesteps and multiple features, e.g for LSTM创建一个具有时间步长和多个特征的新数组，例如 LSTM
【发布时间】：2017-04-04 12:56:26
【问题描述】：

您好，我正在使用 numpy 为 LSTM 创建一个包含时间步长和多个特征的新数组。

我已经研究了许多使用跨步和重塑的方法，但还没有找到有效的解决方案。

这是一个解决玩具问题的函数，但是我有 30,000 个样本，每个样本有 100 个特征。

    def make_timesteps(a, timesteps):
        array = []
        for j in np.arange(len(a)):
            unit = []
            for i in range(timesteps):
                unit.append(np.roll(a, i, axis=0)[j])
            array.append(unit)
        return np.array(array)

inArr = np.array([[1, 2], [3,4], [5,6]])

inArr.shape => (3, 2)

outArr = make_timesteps(inArr, 2)

outArr.shape => (3, 2, 2)

    assert(np.array_equal(outArr, 
           np.array([[[1, 2], [3, 4]], [[3, 4], [5, 6]], [[5, 6], [1, 2]]])))

=> 是的

有没有更有效的方法（必须有！）有人可以帮忙吗？

【问题讨论】：

标签： python arrays performance numpy

【解决方案1】：

一个技巧是从数组中追加最后的L-1 行并将它们追加到数组的开头。然后，这将是一个使用非常高效的NumPy strides 的简单案例。对于想知道这个技巧的成本的人来说，正如我们稍后将通过时序测试看到的那样，这无异于没有。

在代码中支持向前和向后跨步的最终目标的技巧看起来像这样 -

向后跨步：

def strided_axis0_backward(inArr, L = 2):
    # INPUTS :
    # a : Input array
    # L : Length along rows to be cut to create per subarray

    # Append the last row to the start. It just helps in keeping a view output.
    a = np.vstack(( inArr[-L+1:], inArr ))

    # Store shape and strides info
    m,n = a.shape
    s0,s1 = a.strides

    # Length of 3D output array along its axis=0
    nd0 = m - L + 1

    strided = np.lib.stride_tricks.as_strided    
    return strided(a[L-1:], shape=(nd0,L,n), strides=(s0,-s0,s1))

向前迈进：

def strided_axis0_forward(inArr, L = 2):
    # INPUTS :
    # a : Input array
    # L : Length along rows to be cut to create per subarray

    # Append the last row to the start. It just helps in keeping a view output.
    a = np.vstack(( inArr , inArr[:L-1] ))

    # Store shape and strides info
    m,n = a.shape
    s0,s1 = a.strides

    # Length of 3D output array along its axis=0
    nd0 = m - L + 1

    strided = np.lib.stride_tricks.as_strided    
    return strided(a[:L-1], shape=(nd0,L,n), strides=(s0,s0,s1))

示例运行 -

In [42]: inArr
Out[42]: 
array([[1, 2],
       [3, 4],
       [5, 6]])

In [43]: strided_axis0_backward(inArr, 2)
Out[43]: 
array([[[1, 2],
        [5, 6]],

       [[3, 4],
        [1, 2]],

       [[5, 6],
        [3, 4]]])

In [44]: strided_axis0_forward(inArr, 2)
Out[44]: 
array([[[1, 2],
        [3, 4]],

       [[3, 4],
        [5, 6]],

       [[5, 6],
        [1, 2]]])

运行时测试-

In [53]: inArr = np.random.randint(0,9,(1000,10))

In [54]: %timeit make_timesteps(inArr, 2)
    ...: %timeit strided_axis0_forward(inArr, 2)
    ...: %timeit strided_axis0_backward(inArr, 2)
    ...: 
10 loops, best of 3: 33.9 ms per loop
100000 loops, best of 3: 12.1 µs per loop
100000 loops, best of 3: 12.2 µs per loop

In [55]: %timeit make_timesteps(inArr, 10)
    ...: %timeit strided_axis0_forward(inArr, 10)
    ...: %timeit strided_axis0_backward(inArr, 10)
    ...: 
1 loops, best of 3: 152 ms per loop
100000 loops, best of 3: 12 µs per loop
100000 loops, best of 3: 12.1 µs per loop

In [56]: 152000/12.1  # Speedup figure
Out[56]: 12561.98347107438

strided_axis0 的时序保持不变，即使我们增加了输出中子数组的长度。这只是向我们展示了strides 带来的巨大好处，当然还有与原始循环版本相比的疯狂加速。

正如一开始所承诺的，这是np.vstack 的堆叠成本的时间安排 -

In [417]: inArr = np.random.randint(0,9,(1000,10))

In [418]: L = 10

In [419]: %timeit np.vstack(( inArr[-L+1:], inArr ))
100000 loops, best of 3: 5.41 µs per loop

时间安排支持堆叠是一种非常有效的想法。

【讨论】：

非常感谢 - 这真的很有帮助，我以前看过 as_strided 但直到你的例子和链接才明白！为了获得相同的顺序，我使用将第一行添加到最后一行，然后在轴 1 上添加 np.flip。我已经编辑了问题以显示我的最终代码。
@nickyzee 我想我不明白你为什么需要那个flip。你的make_timesteps 是否正确，因为我编写代码试图产生与make_timesteps 相同的结果。根据您的翻转建议，我的代码产生的结果与 make_timesteps 不同。澄清一下？
有趣 - 需要翻转才能产生与原始输出相同的输出。对输出的目视检查也证实了这一点。不知道为什么 - numpy@latest 和 py3.5 - 我原来的回报 array([[[1, 2], [3, 4]], [[3, 4], [5, 6]], [[5, 6], [7, 8]], [[7, 8], [1, 2]]])
@nickyzee 更新了向前和向后跨步的两个版本。看看这些！