【发布时间】:2021-06-26 00:21:51
【问题描述】:
我在计算熊猫数据框的加权平均值时遇到了一个奇怪的问题。我想执行以下步骤:
(1) 计算所有数据的加权平均值
(2)计算每组数据的加权平均值
问题是当我执行第 2 步时,组均值的平均值(由每个组中的成员数加权)与所有数据的加权平均值(第 1 步)不同。数学上应该是(here)。我什至认为问题可能出在 dtype 上,所以我将所有内容都设置在 float64 上,但问题仍然存在。下面我提供了一个简单的例子来说明这个问题:
我的数据框有一个数据、一个权重和组列:
data = np.array([
0.20651903, 0.52607571, 0.60558061, 0.97468593, 0.10253621, 0.23869854,
0.82134792, 0.47035085, 0.19131938, 0.92288234
])
weights = np.array([
4.06071562, 8.82792146, 1.14019687, 2.7500913, 0.70261312, 6.27280216,
1.27908358, 7.80508994, 0.69771745, 4.15550846
])
groups = np.array([1, 1, 2, 2, 2, 2, 3, 3, 4, 4])
df = pd.DataFrame({"data": data, "weights": weights, "groups": groups})
print(df)
>>> print(df)
data weights groups
0 0.206519 4.060716 1
1 0.526076 8.827921 1
2 0.605581 1.140197 2
3 0.974686 2.750091 2
4 0.102536 0.702613 2
5 0.238699 6.272802 2
6 0.821348 1.279084 3
7 0.470351 7.805090 3
8 0.191319 0.697717 4
9 0.922882 4.155508 4
# Define a weighted mean function to apply to each group
def my_fun(x, y):
tmp = np.average(x, weights=y)
return tmp
# Mean of the population
total_mean = np.average(np.array(df["data"], dtype="float64"),
weights= np.array(df["weights"], dtype="float64"))
# Group data
group_means = df.groupby("groups").apply(lambda d: my_fun(d["data"],d["weights"]))
# number of members of each group
counts = np.array([2, 4, 2, 2],dtype="float64")
# Total mean calculated from mean of groups mean weighted by counts of each group
total_mean_from_group_means = np.average(np.array(group_means,
dtype="float64"),
weights=counts)
print(total_mean)
0.5070955626929458
print(total_mean_from_group_means)
0.5344436242465216
如您所见,从组均值计算的总均值不等于总均值。我在这里做错了什么?
编辑:修正了代码中的一个错字。
【问题讨论】:
标签: numpy pandas-groupby mean