使用 NaT 从数据框中提取 Pandas 多索引答案

【问题标题】：Extracting Pandas multiindex from dataframe with NaT使用 NaT 从数据框中提取 Pandas 多索引
【发布时间】：2016-06-07 17:30:19
【问题描述】：

我正在使用 pandas 来解析 Excel 电子表格。该电子表格有几个工作表，每个工作表如下所示。请注意，每一列都有对应于不同日期的值，并且具有不同的长度：

无论出于何种原因，当 pandas 解析 Excel 电子表格时，第一个工作表会将日期的第一列解析为索引（即使 index_col 参数已指定为 None）。这仍然是可以管理的。

但是，在其他工作表中，它将索引解析为多索引：

我想要做的是最终重建数据框，以便它们都共享一个共同的日期索引，并且对于任何没有值的日期都用 NaN 填充。但是，我似乎无法从多索引中提取日期来开始这个过程。

我尝试在级别 0 和级别 1 的数据帧上执行 reset_index()，但它抱怨 IndexError: cannot do a non-empty take from an empty axes. 我也尝试过 unstack()，但抱怨 ValueError: Index contains duplicate entries, cannot reshape。

【问题讨论】：

标签： python excel pandas dataframe multi-index

【解决方案1】：

我认为您使用 read_excel 和参数 parse_cols、header、index_col。然后通过iloc 和最后concat 从每一对创建DataFrames：

import pandas as pd

df = pd.read_excel('f_name.xlsx', parse_cols=[0, 1, 3, 4, 7 , 8], index_col=0, header=0)
#if you need reset NaT in index, but it is not necessary
#df.index = df.index.to_series().fillna(0)
print df
            Column_val1 Unnamed: 1  Column_val2 Unnamed: 3  Column_val3
1999-01-01            4 2000-01-01            5 2000-01-01            5
1999-01-02            1 2000-01-02            7 2000-01-02            7
1999-01-03            2 2000-01-03            8 2000-01-03            8
1999-01-04            3 2000-01-04            3 2000-01-04            3
1999-01-05            3 2000-01-05            6 2000-01-05            6
1999-01-06            3 2000-01-06            9 2000-01-06            9
1999-01-07            4 2000-01-07            1 2000-01-07            1
1999-01-08            6 2000-01-08            5 2000-01-08            5
1999-01-09            8 2000-01-09            2 2000-01-09            2
1999-01-10            2 2000-01-10            3 2000-01-10            3
1999-01-11            4 2000-01-11           47 2000-01-11           47
1999-01-12            5 2000-01-12            2 2000-01-12            2
NaT                 NaN 2000-01-13            8 2000-01-13            8
NaT                 NaN 2000-01-14            2 2000-01-14            2
NaT                 NaN 2000-01-15           87 2000-01-15           87
NaT                 NaN 2000-01-16            6 2000-01-16            6
NaT                 NaN 2000-01-17           89 2000-01-17           89
NaT                 NaN        NaT          NaN 2000-01-18            7
NaT                 NaN        NaT          NaN 2000-01-19            8

print df['Column_val1']
1999-01-01     4
1999-01-02     1
1999-01-03     2
1999-01-04     3
1999-01-05     3
1999-01-06     3
1999-01-07     4
1999-01-08     6
1999-01-09     8
1999-01-10     2
1999-01-11     4
1999-01-12     5
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
Name: Column_val1, dtype: float64

print df.set_index(df.iloc[:, 1])['Column_val2']
Unnamed: 1
2000-01-01     5
2000-01-02     7
2000-01-03     8
2000-01-04     3
2000-01-05     6
2000-01-06     9
2000-01-07     1
2000-01-08     5
2000-01-09     2
2000-01-10     3
2000-01-11    47
2000-01-12     2
2000-01-13     8
2000-01-14     2
2000-01-15    87
2000-01-16     6
2000-01-17    89
NaT          NaN
NaT          NaN
Name: Column_val2, dtype: float64

print df.set_index(df.iloc[:, 3])['Column_val3']
Unnamed: 3
2000-01-01     5
2000-01-02     7
2000-01-03     8
2000-01-04     3
2000-01-05     6
2000-01-06     9
2000-01-07     1
2000-01-08     5
2000-01-09     2
2000-01-10     3
2000-01-11    47
2000-01-12     2
2000-01-13     8
2000-01-14     2
2000-01-15    87
2000-01-16     6
2000-01-17    89
2000-01-18     7
2000-01-19     8
Name: Column_val3, dtype: int64

df = pd.concat([df['Column_val1'], 
                df.set_index(df.iloc[:, 1])['Column_val2'], 
                df.set_index(df.iloc[:, 3])['Column_val3'] ])

#better is use sort index
df = df.sort_index()
print df
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
NaT          NaN
1999-01-01     4
1999-01-02     1
1999-01-03     2
1999-01-04     3
1999-01-05     3
1999-01-06     3
1999-01-07     4
1999-01-08     6
1999-01-09     8
1999-01-10     2
1999-01-11     4
1999-01-12     5
2000-01-01     5
2000-01-01     5
2000-01-02     7
2000-01-02     7
2000-01-03     8
2000-01-03     8
2000-01-04     3
2000-01-04     3
2000-01-05     6
2000-01-05     6
2000-01-06     9
2000-01-06     9
2000-01-07     1
2000-01-07     1
2000-01-08     5
2000-01-08     5
2000-01-09     2
2000-01-09     2
2000-01-10     3
2000-01-10     3
2000-01-11    47
2000-01-11    47
2000-01-12     2
2000-01-12     2
2000-01-13     8
2000-01-13     8
2000-01-14     2
2000-01-14     2
2000-01-15    87
2000-01-15    87
2000-01-16     6
2000-01-16     6
2000-01-17    89
2000-01-17    89
2000-01-18     7
2000-01-19     8
dtype: float64

#if you need remove rows where index is NaT
print df[pd.notnull(df.index)]
1999-01-01     4
1999-01-02     1
1999-01-03     2
1999-01-04     3
1999-01-05     3
1999-01-06     3
1999-01-07     4
1999-01-08     6
1999-01-09     8
1999-01-10     2
1999-01-11     4
1999-01-12     5
2000-01-01     5
2000-01-01     5
2000-01-02     7
2000-01-02     7
2000-01-03     8
2000-01-03     8
2000-01-04     3
2000-01-04     3
2000-01-05     6
2000-01-05     6
2000-01-06     9
2000-01-06     9
2000-01-07     1
2000-01-07     1
2000-01-08     5
2000-01-08     5
2000-01-09     2
2000-01-09     2
2000-01-10     3
2000-01-10     3
2000-01-11    47
2000-01-11    47
2000-01-12     2
2000-01-12     2
2000-01-13     8
2000-01-13     8
2000-01-14     2
2000-01-14     2
2000-01-15    87
2000-01-15    87
2000-01-16     6
2000-01-16     6
2000-01-17    89
2000-01-17    89
2000-01-18     7
2000-01-19     8
dtype: float64

【讨论】：