如何从给定的计数、平均值、标准差、最小值、最大值等生成数据集？答案

【问题标题】：How to generate dataset from given count, mean, standard deviation, min, max etc?如何从给定的计数、平均值、标准差、最小值、最大值等生成数据集？
【发布时间】：2020-08-30 19:12:01
【问题描述】：

我拥有 pandas DataFrame.describe() 方法中的所有统计细节，例如计数、平均值、标准差、最小值、最大值等。我需要从这些细节中生成数据集。是否有任何应用程序或 python 代码可以完成这项工作。我想生成任何具有这些统计信息的随机数据集

计数 263
平均 35.790875
标准 24.874763
最小 0.0000000
25% 16.000000
50% 32.000000
75% 49.000000
最大 99.000000

【问题讨论】：

您能否将describe 中的所有统计详细信息添加到您的问题中？
我是新用户，无法嵌入图片，我已将所有详细信息作为文本列出。
Levi，我知道你是 SO 新手。如果您认为某个答案解决了问题，请单击答案左侧的绿色复选标记将其标记为“已接受”。这有助于将注意力集中在仍然没有答案的旧 SO 问题上。当然，如果您正在等待其他答案，那很好。

标签： python pandas dataset

【解决方案1】：

您好，欢迎来到论坛！这是一个很好的问题，我喜欢它。

我认为在一般情况下这是不平凡的。您可以创建一个具有正确计数、平均值、最小值和百分位数的数据集，但标准差相当棘手。

这是一种获取满足您的示例要求的数据集的方法。它可以适用于一般情况，但预计会有许多“边界情况”。基本思想是满足从最简单到最难的每个要求，注意在前进的过程中不要使之前的要求无效。

from numpy import std
import math

COUNT = 263
MEAN = 35.790875
STD = 24.874763
MIN = 0
P25 = 16
P50 = 32
P75 = 49
MAX = 99

#Positions of the percentiles
P25_pos = floor(0.25 * COUNT) - 1
P50_pos = floor(0.5 * COUNT) - 1
P75_pos = floor(0.75 * COUNT) - 1
MAX_pos = COUNT -1

#Count requirement
v = [0] * COUNT

#Min requirement
v[0] = MIN

#Max requirement
v[MAX_pos] = MAX

#Good, we already satisfied the easiest 3 requirements. Notice that these are deterministic,
#there is only one way to satisfy them

#This will satisfy the 25th percentile requirement
for i in range(1, P25_pos):
    #We could also interpolate the value from P25 to P50, even adding a bit of randomness.
    v[i] = P25
v[P25_pos] = P25

#Actually pandas does some linear interpolation (https://*.com/questions/39581893/pandas-find-percentile-stats-of-a-given-column)
#when calculating percentiles but we can simulate that by letting the next value be also P25
if P25_pos + 1 != P50_pos:
    v[P25_pos + 1] = P25

#We do something extremely similar with the other percentiles
for i in range(P25_pos + 3, P50_pos):
    v[i] = P50

v[P50_pos] = P50
if P50_pos + 1 != P75_pos:
    v[P50_pos + 1] = P50

for i in range(P50_pos + 1, P75_pos):
    v[i] = P50

v[P75_pos] = P75
if P75_pos + 1 != v[MAX_pos]:
    v[P75_pos + 1] = P75

for i in range(P75_pos + 1, MAX_pos):
    v[i] = P75

#This will give us correct 25%, 50%, 75%, min, max, and count values. We are still missing MEAN and std.

#We are getting a mean of 24.84, and we need to increase it a little bit to get 35.790875. So we manually teak the numbers between the 75th and 100th percentile.
#That is, numbers between pos 197 and 261.
#This would be much harder to do automatically instead of with a hardcoded example.

#This increases the average a bit, but not enough!
for i in range(P75_pos + 1, 215):
    v[i] = MAX


#We solve an equation to get the necessary value for v[256] for the mean to be what we want to be.
#This equation comes from the formula for the average: AVG = SUM/COUNT. We simply clear the variable v[215] from that formula.
new_value = MEAN * COUNT - sum(v) + v[215]

#The new value for v[215] should be between P75 and MAX so we don't invalidate the percentiles.
assert(P75 <= new_value)
assert(new_value <= MAX)

v[256] = new_value


#Now comes the tricky part: we need the correct std. As of now, it is 20.916364, and it should be higher: 24.874763
#For this, as we don't want to change the average, we are going to change values in pairs,
#as we need to compensate each absolute increase with an absolute decrease

for i in range(1, P25_pos - 3):
    #We can move the values between the 0th and 25th percentile between 0 and 16
    v[i] -= 12

    #Between the 25th and 50th percentile, we can move the values between 32 and 49
    v[P25_pos + 1 + i] += 12


#As of now, this got us a std of 24.258115. We need it to be a bit higher: 24.874763

#The trick we did before of imposing a value for getting the correct mean is much harder to do here,
#because the equation is much more complicated

#So we'll just approximate the value intead with a while loop. There are faster ways than this, see: https://en.wikipedia.org/wiki/Root-finding_algorithms
current_std = math.sqrt(sum([(val - MEAN)**2 for val in v])/(COUNT - 1))
while 24.874763 - current_std >= 10e-5:
    for i in range(1, P25_pos - 3):
        #We can move the values between the 0th and 25th percentile between 0 and 16
        v[i] -= 0.00001

        #Between the 25th and 50th percentile, we can move the values between 32 and 49
        v[P25_pos + 1 + i] += 0.00001
    current_std = math.sqrt(sum([(val - MEAN)**2 for val in v])/(COUNT - 1))

#We tweak some further decimal points now
while 24.874763 - current_std >= 10e-9:
    v[1] += 0.0001

    #Between the 25th and 50th percentile, we can move the values between 32 and 49
    v[P25_pos + 2] -= 0.0001
    current_std = math.sqrt(sum([(val - MEAN)**2 for val in v])/(COUNT - 1))


df = pd.DataFrame({'col':v})

#Voila!
df.describe()

输出：

    col
count   263.000000
mean    35.790875
std     24.874763
min     0.000000
25%     16.000000
50%     32.000000
75%     49.000000
max     99.000000

【讨论】：

我想根据给定的统计信息生成原始数据集
那是不可能的。统计信息是对原始数据集的缩减
您当然可以模拟具有这些特征的数据集；这就是你想要的吗？
是的，但是如何找到具有此均值、标准差等的随机数？
太棒了！这真的奏效了。谢谢。但如果数字更随机会更好

【解决方案2】：

我只是想到了另一种让数字看起来不那么虚假的方法。它的速度要慢得多，所以只有在你不关心数据集很小的情况下才使用它。这是一个大小为 40 的数据集的示例，但如果要生成更大的数据集，可以更改 COUNT 变量的值。此外，此代码可以适应其他值要求 - 只需更改标题即可。

我们开始的方式与我之前的答案相同，满足除 MEAN 和 STD 之外的所有要求：

from math import floor


lr = 10e-6

COUNT = 40.0
MEAN = 35.790875
STD = 24.874763
MIN = 0.0
P25 = 16.0
P50 = 32.0
P75 = 49.0
MAX = 99.0


#Positions of the percentiles
P25_pos = floor(0.25 * COUNT) - 1
P50_pos = floor(0.5 * COUNT) - 1
P75_pos = floor(0.75 * COUNT) - 1
MAX_pos = int(COUNT -1)

#Count requirement
X = [0.0] * int(COUNT)

#Min requirement
X[0] = MIN

#Max requirement
X[MAX_pos] = MAX

#Good, we already satisfied the easiest 3 requirements. Notice that these are deterministic,
#there is only one way to satisfy them

#This will satisfy the 25th percentile requirement
for i in range(1, P25_pos):
    #We could also interpolate the value from P25 to P50, even adding a bit of randomness.
    X[i] = 0.0
X[P25_pos] = P25

#Actually pandas does some linear interpolation (https://*.com/questions/39581893/pandas-find-percentile-stats-of-a-given-column)
#when calculating percentiles but we can simulate that by letting the next value be also P25
if P25_pos + 1 != P50_pos:
    X[P25_pos + 1] = P25

#We do something extremely similar with the other percentiles
for i in range(P25_pos + 2, P50_pos):
    X[i] = P25

X[P50_pos] = P50
if P50_pos + 1 != P75_pos:
    X[P50_pos + 1] = P50

for i in range(P50_pos + 1, P75_pos):
    X[i] = P50

X[P75_pos] = P75
if P75_pos + 1 != X[MAX_pos]:
    X[P75_pos + 1] = P75

for i in range(P75_pos + 2, MAX_pos):
    X[i] = P75

但是那么，我们将其视为（受约束的）gradient descent 问题：我们希望最小化我们的 MEAN 和 STD 与预期的 MEAN 和 STD 之间的差异，同时保持四分位数的值。我们想要学习的值是我们数据集中的值 - 当然，我们排除了四分位数，因为我们已经对这些值必须是什么有了一个限制。

def std(X):
    return sum([(val - sum(X)/len(X))**2 for val in X])/(len(X) - 1)

#This function measures the difference between our STD and MEAN and the expected values
def cost(X):
    m = sum(X) / len(X)
    return ((sum([(val - m)**2 for val in X])/(len(X) - 1) - STD**2)) ** 2 + (m - MEAN)**4

#You have to install this library
import autograd.numpy as anp  # Thinly-wrapped numpy
from autograd import grad     #for automatically calculating gradients of functions

#This is the derivative of the cost and it is used in the gradient descent to update the values of the dataset
grad_cost = grad(cost)

def learn(lr, epochs):
    for j in range(0, epochs):
        gr = []
        for i in range(len(X)):
            gr.append(grad_cost(X)[i] * lr)

        for i in range(1, P25_pos):
            if X[i] - gr[i] >= MIN and X[i] - gr[i] <= P25:
                X[i] -= gr[i]

        for i in range(P25_pos+2, P50_pos):
            if X[i] - gr[i] >= P25 and X[i] - gr[i] <= P50:
                X[i] -= gr[i]

        for i in range(P50_pos + 2, P75_pos):
            if X[i] - gr[i] >= P50 and X[i] - gr[i] <= P75:
                X[i] -= gr[i]

        for i in range(P75_pos + 2, MAX_pos):
            if X[i] - gr[i] >= P75 and X[i] - gr[i] <= MAX:
                X[i] -= gr[i]

        if j % 100 == 0:
            print(cost(X))

        #if j % 200 == 0:
        #    print(gr)

    print(cost(X))
    print(X)

您现在可以使用 learn(learning_rate, epochs) 函数进行梯度下降。我使用的 learning_rates 介于 10e-7 和 10e-4 之间。

对于这种情况，经过一段时间的学习（大约 100K epoch，大约需要一个小时），我得到了 24.871 的 STD（与 24.874 的实际值相比）和 31.730 的平均值（与35.790 的实际值）。这些是我得到的结果：

col
count   40.000000
mean    31.730694
std     24.871651
min     0.000000
25%     16.000000
50%     32.000000
75%     49.000000
max     99.000000

具有以下排序的列值：

[0.0, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 16.0, 16.0, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 32.0, 32.0, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 49.0, 49.0, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 99.0]

这些结果肯定可以通过更多的培训得到改善。当我得到更好的结果时，我会更新答案。

【讨论】：

你也可以自己继续训练，从我得到的列值开始

【解决方案3】：

我也有类似的问题，但没那么复杂。供您参考。

def simulate_data(COUNT,MIN,P25,P50,P75,MAX):
    c = np.round(np.random.normal(0.5*COUNT, 0.25 * COUNT, COUNT),0)
    y = [MIN,P25,P50,P75,MAX]
    x = [min(c),np.percentile(c,25),np.percentile(c,50),np.percentile(c,75),max(c)]
    y_I = np.interp(c, x, y)
    return y_I

【讨论】：