Pandas-在具有重复年份的时间序列数据中添加缺失年份答案

【问题标题】：Pandas-Add missing years in time series data with duplicate yearsPandas-在具有重复年份的时间序列数据中添加缺失年份
【发布时间】：2017-10-04 15:07:29
【问题描述】：

我有一个这样的数据集，其中缺少几年的数据。

County Year Pop
12     1999 1.1
12     2001 1.2
13     1999 1.0
13     2000 1.1

我想要类似的东西

County Year Pop
12     1999 1.1
12     2000 NaN
12     2001 1.2
13     1999 1.0
13     2000 1.1
13     2001 nan

我尝试将索引设置为年份，然后将重新索引与另一个仅年份方法的数据帧一起使用（此处提到 Pandas: Add data for missing months），但它给了我错误，无法使用重复值重新索引。我也试过 df.loc 但它有同样的问题。我什至尝试了一个完整的外部连接，只有几年的空白 df，但这也没有用。

我该如何解决这个问题？

【问题讨论】：

标签： python pandas time-series missing-data reindex

【解决方案1】：

制作一个 MultiIndex 以免重复：

df.set_index(['County', 'Year'], inplace=True)

然后用所有组合构造一个完整的 MultiIndex：

index = pd.MultiIndex.from_product(df.index.levels)

然后重新索引：

df.reindex(index)

MultiIndex 的构建未经测试，可能需要稍作调整（例如，如果所有县都完全没有一年），但我想你明白了。

【讨论】：

我正在使用这个！

【解决方案2】：

我的工作假设您可能希望在最小和最大年份之间添加所有年份。可能是您在 12 和 13 两个县都缺少 2000 年。

我将使用'County' 列中的unique 值以及'Year' 列中的最小和最大年份之间的所有整数年构造一个pd.MultiIndexfrom_product。

注意：此解决方案会填充所有缺失的年份，即使它们当前不存在。

mux = pd.MultiIndex.from_product([
        df.County.unique(),
        range(df.Year.min(), df.Year.max() + 1)
    ], names=['County', 'Year'])

df.set_index(['County', 'Year']).reindex(mux).reset_index()

   County  Year  Pop
0      12  1999  1.1
1      12  2000  NaN
2      12  2001  1.2
3      13  1999  1.0
4      13  2000  1.1
5      13  2001  NaN

【讨论】：

【解决方案3】：

你可以使用pivot_table:

In [11]: df.pivot_table(values="Pop", index="County", columns="Year")
Out[11]:
Year    1999  2000  2001
County
12       1.1   NaN   1.2
13       1.0   1.1   NaN

和stack 结果（需要一个系列）：

In [12]: df.pivot_table(values="Pop", index="County", columns="Year").stack(dropna=False)
Out[12]:
County  Year
12      1999    1.1
        2000    NaN
        2001    1.2
13      1999    1.0
        2000    1.1
        2001    NaN
dtype: float64

【讨论】：

嗨，安迪！我想我之前没有回答过你的问题:-)
@piRSquared 这肯定是不可能的！

【解决方案4】：

或者你可以尝试一些黑魔法：P

min_year, max_year = df.Year.min(), df.Year.max()

df.groupby('County').apply(lambda g: g.set_index("Year").reindex(range(min_year, max_year+1))).drop("County", axis=1).reset_index()

【讨论】：

【解决方案5】：

您提到您已尝试加入空白 df，这种方法实际上可以工作。

设置：

df = pd.DataFrame({'County': {0: 12, 1: 12, 2: 13, 3: 13},
 'Pop': {0: 1.1, 1: 1.2, 2: 1.0, 3: 1.1},
 'Year': {0: 1999, 1: 2001, 2: 1999, 3: 2000}})

解决方案

#create a new blank df with all the required Years for each County
df_2 = pd.DataFrame(np.r_[pd.tools.util.cartesian_product([df.County.unique(),np.arange(1999,2002)])].T, columns=['County','Year'])

#Left join the new dataframe to the existing dataframe to populate the Pop values.
pd.merge(df_2,df,on=['Year','County'],how='left')
Out[73]: 
   County  Year  Pop
0      12  1999  1.1
1      12  2000  NaN
2      12  2001  1.2
3      13  1999  1.0
4      13  2000  1.1
5      13  2001  NaN

【讨论】：

非常感谢，我的空白 df 中没有包含县。我现在明白我的错误了......谢谢！

【解决方案6】：

这是一个受公认答案启发的函数，但适用于时间变量在不同位置开始和停止以针对不同组 id 的情况。与公认答案的唯一区别是我手动构建了多索引。

def fill_gaps_in_panel(df, group_col, year_col):
    """
    Fills the gaps in a panel by constructing an index
    based on the group col and the sequence of years between min-year
    and max-year for each group id.
    """
    index_group = []
    index_time = []
    for group in df[group_col].unique():
        _min = df.loc[df[group_col]==group, year_col].min()
        _max = df.loc[df[group_col]==group, year_col].max() + 1
        index_group.extend([group for t in range(_min, _max)])
        index_time.extend([t for t in range(_min, _max)])
    multi_index = pd.MultiIndex.from_arrays(
        [index_group, index_time], names=(group_col, year_col))
    df.set_index([group_col, year_col], inplace=True)
    return df.reindex(multi_index)

【讨论】：