如何有效地对时间序列流数据进行采样答案

【问题标题】：How to efficiently sample time series stream data如何有效地对时间序列流数据进行采样
【发布时间】：2019-01-26 04:21:47
【问题描述】：

我正在考虑如何有效地从蒸汽中的每 n 个数据中采样一个数据，n 是动态的，它会随时间变化并等于 Total data sent / capacity of the buffer。

我有一个可以包含 10,000 个点的缓冲区，数据源是一个流，它每次都会向缓冲区发送一个点。如果总共发送的原始数据量是 20,000，那么缓冲区中的点应该是 2,4,6,8,10,...20,000(index)，只是将 20,000 个点缩放为 10,000 个槽。采样点的索引不需要精确到2,4,6，但是这种情况下两个索引之间的间隔应该平均在2左右。

因为数据点的总量在变化，所以如果我每次发送一个新的数据点都进行采样，性能会很慢。（每次从N个点中提取10,000个点，并且N随着时间的推移而增加.) 所以我想知道有没有更好的算法可以减少计算但仍然保持较高的采样精度？

*我尝试使用概率来处理这个问题，它可以做类似的事情，但结果并不精确。所以我不知道如何实现我的目标。

from time import sleep

samplingCout = 10

reservoir1 = []
reservoir2 = []
reservoir3 = []
avg = []

count = 0

def sample(arr, data):
    global count
    count += 1
    if len(arr) < samplingCout:
        arr.append(data)
    else:
        if randint(0, int(count / samplingCout)) == int(count / samplingCout):
            index = randint(0, samplingCout - 1)
            sleep(0.001)
            del arr[0]
            arr.append(data)

for i in range(1, 1000):
    sample(reservoir1,i)
    sample(reservoir2,i)
    sample(reservoir3,i)

for i in range(0, samplingCout):
    avg.append(int((reservoir1[i] + reservoir2[i] + reservoir3[i])/3))

print(avg)

谢谢。

【问题讨论】：

标签： python math random statistics

【解决方案1】：

看起来您需要Reservoir Sampling - 从未知事件流中公平采样固定大小的缓冲区

一些简单的代码

import numpy as np
import random

x = 0
N = 20000
EOS = -1

def next_event():
    global x
    global N
    global EOS
    q = x
    x += 1

    if x == N:
        return EOS
    return q

def sample(count):

    sampled = list()

    index = 0
    q = next_event()
    while q >= 0:
        if index < count:
            sampled.append(q)
        else:
            r = random.randint(0, index)
            if r < count:
                    sampled[r] = q

        index += 1
        q = next_event()

    sampled.sort()
    return sampled

if __name__ == "__main__":

    random.seed(32345)
    res = sample(10000)
    diffs = np.array(res[1:]) - np.array(res[:-1])
    print(np.mean(diffs))

备注：

代码适用于整数，如果您有复杂的记录，请存储 (index, record) 的元组
最后按索引排序，应该便宜
我计算了样本之间距离的平均值，结果为 1.9997999799979997。

【讨论】：

@HaoyuanTang 不客气。请注意，样本分布之间的距离遵循几何分布（这是一种指数分布，但对于离散事件），平均值为 2