以受控随机性对 NumPy 1d 数组进行采样答案

【问题标题】：sampling NumPy 1d array with controlled randomness以受控随机性对 NumPy 1d 数组进行采样
【发布时间】：2021-11-19 15:26:47
【问题描述】：

我有一个长度为 l 的一维 NumPy 数组 a，我想从中采样 int(np.log(l)) 实例，但我希望样本是：

准均匀分布，并且
随机的。

1 我的意思是我想避免两个样本的距离小于int(l/int(np.log(l)))。
2 我的意思是我不想每次都获得与样本相同的实例。
我还需要强调，我无法更改随机种子。

一种方法是将数组拆分为int(np.log(l)) 子数组，然后从每个子数组中随机抽取一个，但我正在寻找一种更有效的实现方式，因为我需要在相当数量的数据。

import numpy as np
a = np.array([np.random.randint(1000) for _ in range(1000)])
a = np.sort(a)
l = len(a)
random_indices = np.random.randint(0, l, int(np.log(l)))
samples = a[random_indices]
samples = np.sort(samples)
samples
# array([183, 536, 644, 791, 925, 999])

感谢任何 cmets、建议和帮助。

【问题讨论】：

你考虑过使用np.random.choice吗？如果你将它与 replace=False 和 size=int(np.log(l)) 一起使用，我想你会得到你想要的
谢谢@Luckk，但它正在做我已经做过的事情：为int(np.log(l)) 提供均匀分布的样本。我想确保每个块中至少有一个样本。
分布不能同时均匀分布和约束。你所要求的在数学上是不可能的。
你是对的@obchardon 我编辑了我的问题。

标签： python numpy random sampling

【解决方案1】：

对于这个问题，我们可以使用 Python 中的内置函数：random.sample
所以代码是：

import numpy as np
import random

a = np.array([i for i in range(1000)])
l = len(a)
random_sample = random.sample(a, int(np.log(l)))
print(random_sample)

你能接受吗？

【讨论】：

感谢@Nam 的评论。我将问题编辑得更清楚。
我明白了。这个答案通常应该只是一个评论，但我在这个网站上没有这个声誉。我会找到别的东西。

【解决方案2】：

此评论已耗尽评论的限制，因此我将其发布在这里。

NumPy 数组的值重要吗？是布尔吗？复杂吗？
你写的

我想采样int(np.log(l)) 实例

这是否意味着您要从数组中选择int(np.log(l)) 索引？
你写的

避免两个样本的距离小于int(l/int(np.log(l)))

通过示例，您是指索引还是值？换句话说，数组的两个元素的距离是多少：索引的差异或值的差异？
你写道：

不想每次都获得与样本相同的实例

您的意思是不想选择一个索引两次，还是不想让两个选定的索引代表相同的值？

如果你的距离是指索引的差异，并且你不想选择和索引两次，那么我不明白你的代码是如何通过接近标准的。

import pandas as pd
import numpy as np
np.random.seed(0)
import math
L=5000
a = np.array([np.random.randint(L) for _ in range(L)])
a = np.sort(a)
random_indices = np.random.randint(0, L, int(np.log(L)))
samples = a[random_indices]
samples = np.sort(samples)
data = pd.DataFrame(data={"index":samples})
data["compared"] = data["index"].shift(-1)
data["dist"] = data["compared"] - data["index"]
print(data)
wrong_indices = data["dist"] < int(L/int(math.log(L)))
print("Indices closer than", int(L/int(math.log(L))), ":")
print(data[wrong_indices][["index","compared"]])

输出是：

   index  compared    dist
0    255     701.0   446.0
1    701    1537.0   836.0
2   1537    1671.0   134.0
3   1671    2589.0   918.0
4   2589    3393.0   804.0
5   3393    3576.0   183.0
6   3576    4828.0  1252.0
7   4828       NaN     NaN
Indices closer than 625 :
   index  compared
0    255     701.0
2   1537    1671.0
5   3393    3576.0

是否可以接受迭代（也称为不稳定）解决方案？在这种情况下，如何以有效的方式（使用 NumPy）生成 int(np.log(l)) 点数，然后丢弃坏点，再次生成丢失的点数，然后重复直到获得足够的好点？

【讨论】：