如何在熊猫数据框中创建其他行的平均值的行？答案

【问题标题】：How to create rows in a pandas dataframe that are averages of other rows?如何在熊猫数据框中创建其他行的平均值的行？
【发布时间】：2021-08-05 15:46:21
【问题描述】：

采用这样的数据框：

import pandas as pd
info = {'Year': [2010, 2010, 2010, 2010, 2015, 2015, 2015, 2015],
        'Country': ['USA', 'Mexico', 'Canada', 'China', 'USA', 'Mexico', 'Canada', 'China'],
        'AgeAvg': [40, 44, 45, 49, 45, 46, 50, 52],
        'HeightAvg': [68, 65, 67, 68, 69, 70, 64, 67]}
df = pd.DataFrame(data=info)
df

   Year Country  AgeAvg  HeightAvg
0  2010     USA      40         68
1  2010  Mexico      44         65
2  2010  Canada      45         67
3  2010   China      49         68
4  2015     USA      45         69
5  2015  Mexico      46         70
6  2015  Canada      50         64
7  2015   China      52         67

我想添加 2011、2012、2013 和 2014 年的行。这些行将遵循相同的国家/地区，并且具有变量的平滑平均值。例如，2011 年美国年龄将是 41 岁，2012 年美国年龄 42 岁，2013 年美国年龄 43 岁，2014 年美国年龄 44 岁。这样年龄将跨越 2010 年到 2015 年。我也想对所有变量（如身高这种情况），而不仅仅是年龄。有没有办法在 Python 中使用 Pandas 做到这一点？

【问题讨论】：

当您尝试将pandas interpolate rows 放入搜索引擎时发生了什么？另外，为什么不为每个国家/地区设置一个单独的 DataFrame？

标签： python pandas dataframe

【解决方案1】：

使用pd.MultiIndex.from_product 重新索引您的数据框并插入值：

mi = pd.MultiIndex.from_product([df['Country'].unique(),
                                 range(df.Year.min(), df.Year.max()+1)])

out = df.set_index(['Country', 'Year']).reindex(mi)
out = out.groupby(level=0).apply(lambda x: x.interpolate())

>>> out
             AgeAvg  HeightAvg
USA    2010    40.0       68.0
       2011    41.0       68.2
       2012    42.0       68.4
       2013    43.0       68.6
       2014    44.0       68.8
       2015    45.0       69.0
Mexico 2010    44.0       65.0
       2011    44.4       66.0
       2012    44.8       67.0
       2013    45.2       68.0
       2014    45.6       69.0
       2015    46.0       70.0
Canada 2010    45.0       67.0
       2011    46.0       66.4
       2012    47.0       65.8
       2013    48.0       65.2
       2014    49.0       64.6
       2015    50.0       64.0
China  2010    49.0       68.0
       2011    49.6       67.8
       2012    50.2       67.6
       2013    50.8       67.4
       2014    51.4       67.2
       2015    52.0       67.0

如果您更喜欢Year，您可以交换级别。

out = out.swaplevel().sort_index()

【讨论】：

生成组合的好方法

【解决方案2】：

生成所有年份的所有组合
merge() 拥有所有行
interpolate() 为每个国家/地区 (groupby())

pd.DataFrame(
    {
        "Year": range(df["Year"].min(), df["Year"].max()+1),
        "Country": [df["Country"].unique() for y in range(df["Year"].min(), df["Year"].max()+1)],
    }
).explode("Country").merge(df, on=["Year", "Country"], how="outer").groupby(
    "Country"
).apply(
    lambda d: d.interpolate()
)

	Year	Country	AgeAvg	HeightAvg
0	2010	USA	40	68
1	2010	Mexico	44	65
2	2010	Canada	45	67
3	2010	China	49	68
4	2011	USA	41	68.2
5	2011	Mexico	44.4	66
6	2011	Canada	46	66.4
7	2011	China	49.6	67.8
8	2012	USA	42	68.4
9	2012	Mexico	44.8	67
10	2012	Canada	47	65.8
11	2012	China	50.2	67.6
12	2013	USA	43	68.6
13	2013	Mexico	45.2	68
14	2013	Canada	48	65.2
15	2013	China	50.8	67.4
16	2014	USA	44	68.8
17	2014	Mexico	45.6	69
18	2014	Canada	49	64.6
19	2014	China	51.4	67.2
20	2015	USA	45	69
21	2015	Mexico	46	70
22	2015	Canada	50	64
23	2015	China	52	67

【讨论】：