我只是想到了另一种让数字看起来不那么虚假的方法。它的速度要慢得多,所以只有在你不关心数据集很小的情况下才使用它。这是一个大小为 40 的数据集的示例,但如果要生成更大的数据集,可以更改 COUNT 变量的值。此外,此代码可以适应其他值要求 - 只需更改标题即可。
我们开始的方式与我之前的答案相同,满足除 MEAN 和 STD 之外的所有要求:
from math import floor
lr = 10e-6
COUNT = 40.0
MEAN = 35.790875
STD = 24.874763
MIN = 0.0
P25 = 16.0
P50 = 32.0
P75 = 49.0
MAX = 99.0
#Positions of the percentiles
P25_pos = floor(0.25 * COUNT) - 1
P50_pos = floor(0.5 * COUNT) - 1
P75_pos = floor(0.75 * COUNT) - 1
MAX_pos = int(COUNT -1)
#Count requirement
X = [0.0] * int(COUNT)
#Min requirement
X[0] = MIN
#Max requirement
X[MAX_pos] = MAX
#Good, we already satisfied the easiest 3 requirements. Notice that these are deterministic,
#there is only one way to satisfy them
#This will satisfy the 25th percentile requirement
for i in range(1, P25_pos):
#We could also interpolate the value from P25 to P50, even adding a bit of randomness.
X[i] = 0.0
X[P25_pos] = P25
#Actually pandas does some linear interpolation (https://*.com/questions/39581893/pandas-find-percentile-stats-of-a-given-column)
#when calculating percentiles but we can simulate that by letting the next value be also P25
if P25_pos + 1 != P50_pos:
X[P25_pos + 1] = P25
#We do something extremely similar with the other percentiles
for i in range(P25_pos + 2, P50_pos):
X[i] = P25
X[P50_pos] = P50
if P50_pos + 1 != P75_pos:
X[P50_pos + 1] = P50
for i in range(P50_pos + 1, P75_pos):
X[i] = P50
X[P75_pos] = P75
if P75_pos + 1 != X[MAX_pos]:
X[P75_pos + 1] = P75
for i in range(P75_pos + 2, MAX_pos):
X[i] = P75
但是那么,我们将其视为(受约束的)gradient descent 问题:我们希望最小化我们的 MEAN 和 STD 与预期的 MEAN 和 STD 之间的差异,同时保持四分位数的值。我们想要学习的值是我们数据集中的值 - 当然,我们排除了四分位数,因为我们已经对这些值必须是什么有了一个限制。
def std(X):
return sum([(val - sum(X)/len(X))**2 for val in X])/(len(X) - 1)
#This function measures the difference between our STD and MEAN and the expected values
def cost(X):
m = sum(X) / len(X)
return ((sum([(val - m)**2 for val in X])/(len(X) - 1) - STD**2)) ** 2 + (m - MEAN)**4
#You have to install this library
import autograd.numpy as anp # Thinly-wrapped numpy
from autograd import grad #for automatically calculating gradients of functions
#This is the derivative of the cost and it is used in the gradient descent to update the values of the dataset
grad_cost = grad(cost)
def learn(lr, epochs):
for j in range(0, epochs):
gr = []
for i in range(len(X)):
gr.append(grad_cost(X)[i] * lr)
for i in range(1, P25_pos):
if X[i] - gr[i] >= MIN and X[i] - gr[i] <= P25:
X[i] -= gr[i]
for i in range(P25_pos+2, P50_pos):
if X[i] - gr[i] >= P25 and X[i] - gr[i] <= P50:
X[i] -= gr[i]
for i in range(P50_pos + 2, P75_pos):
if X[i] - gr[i] >= P50 and X[i] - gr[i] <= P75:
X[i] -= gr[i]
for i in range(P75_pos + 2, MAX_pos):
if X[i] - gr[i] >= P75 and X[i] - gr[i] <= MAX:
X[i] -= gr[i]
if j % 100 == 0:
print(cost(X))
#if j % 200 == 0:
# print(gr)
print(cost(X))
print(X)
您现在可以使用 learn(learning_rate, epochs) 函数进行梯度下降。我使用的 learning_rates 介于 10e-7 和 10e-4 之间。
对于这种情况,经过一段时间的学习(大约 100K epoch,大约需要一个小时),我得到了 24.871 的 STD(与 24.874 的实际值相比)和 31.730 的平均值(与35.790 的实际值)。这些是我得到的结果:
col
count 40.000000
mean 31.730694
std 24.871651
min 0.000000
25% 16.000000
50% 32.000000
75% 49.000000
max 99.000000
具有以下排序的列值:
[0.0, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 16.0, 16.0, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 32.0, 32.0, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 49.0, 49.0, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 99.0]
这些结果肯定可以通过更多的培训得到改善。当我得到更好的结果时,我会更新答案。