【问题标题】:Cleaning Excel File Using Pandas使用 Pandas 清理 Excel 文件
【发布时间】:2021-11-08 23:39:21
【问题描述】:

我有一个使用 Pandas 读取的 excel 文件,输出如下:

+------------+-----+-----+-----+
| Type       | 1   | 2   | 3   |
| Category   | A   | A   | C   |
| Dates      | NaN | NaN | NaN |
| 01/01/2021 | 12  | 12  | 9   |
| 02/01/2021 | 10  | 10  | 2   |
| 03/01/2021 | 30  | 16  | NaN |
| 04/01/2021 | 15  | 23  | 4   |
| 05/01/2021 | 14  | 20  | 5   |
+------------+-----+-----+-----+

前两行提供每个时间序列列的信息。所以对于column 1Type1CategoryA。考虑到工作表的结构,我想融合时间序列,但不太确定如何解决问题。

预期输出:

+------------+-------+----------+------+
|   Dates    | Price | Category | Type |
+------------+-------+----------+------+
| 01/01/2021 |    12 | A        |    1 |
| 02/01/2021 |    10 | A        |    1 |
| 03/01/2021 |    30 | A        |    1 |
| 04/01/2021 |    15 | A        |    1 |
| 05/01/2021 |    14 | A        |    1 |
| 01/01/2021 |    12 | B        |    2 |
| 02/01/2021 |    10 | B        |    2 |
| 03/01/2021 |    16 | B        |    2 |
| 04/01/2021 |    23 | B        |    2 |
| 05/01/2021 |    20 | B        |    2 |
| 01/01/2021 |     9 | C        |    3 |
| 02/01/2021 |     2 | C        |    3 |
| 04/01/2021 |     4 | C        |    3 |
| 05/01/2021 |     5 | C        |    3 |
+------------+-------+----------+------+

如果是Type 3Category C,因为值为NaN,我们会删除该日期。怎样才能达到预期的输出?

【问题讨论】:

    标签: python pandas dataframe melt


    【解决方案1】:

    另一种解决方案:

    df = df.T
    df.columns = df.loc["col0"]
    df = df.iloc[1:]
    
    print(
        df.melt(["Type", "Category", "Dates"])
        .drop(columns="Dates")
        .rename(columns={"col0": "Dates", "value": "Price"})
        .sort_values(by=["Type", "Category", "Dates"])
        .dropna()
        .reset_index(drop=True)
    )
    

    打印:

       Type Category       Dates Price
    0     1        A  01/01/2021    12
    1     1        A  02/01/2021    10
    2     1        A  03/01/2021    30
    3     1        A  04/01/2021    15
    4     1        A  05/01/2021    14
    5     2        B  01/01/2021    12
    6     2        B  02/01/2021    10
    7     2        B  03/01/2021    16
    8     2        B  04/01/2021    23
    9     2        B  05/01/2021    20
    10    3        C  01/01/2021     9
    11    3        C  02/01/2021     2
    12    3        C  04/01/2021     4
    13    3        C  05/01/2021     5
    

    df 已使用:

             col0 col1 col2 col3
    0        Type    1    2    3
    1    Category    A    B    C
    2       Dates  NaN  NaN  NaN
    3  01/01/2021   12   12    9
    4  02/01/2021   10   10    2
    5  03/01/2021   30   16  NaN
    6  04/01/2021   15   23    4
    7  05/01/2021   14   20    5
    

    【讨论】:

      【解决方案2】:

      假设以下输入数据帧:

               col0 col1 col2 col3
      0        Type    1    2    3
      1    Category    A    A    C
      2       Dates  NaN  NaN  NaN
      3  01/01/2021   12   12    9
      4  02/01/2021   10   10    2
      5  03/01/2021   30   16  NaN
      6  04/01/2021   15   23    4
      7  05/01/2021   14   20    5
      

      这是一个工作管道:

      (df.iloc[3:]
         .set_index('col0').rename_axis('Date') # set first column aside
         # next 3 lines to rename columns index
         .T
         .set_index(pd.MultiIndex.from_arrays(df.iloc[:2, 1:].values, names=df.iloc[:2, 0]))
         .T
         .stack(level=[0,1]) # columns to rows
         .rename('Price')    # rename last unnamed column
         .reset_index()      # all indexes back to columns
      )
      

      输出:

                Date Type Category Price
      0   01/01/2021    1        A    12
      1   01/01/2021    2        A    12
      2   01/01/2021    3        C     9
      3   02/01/2021    1        A    10
      4   02/01/2021    2        A    10
      5   02/01/2021    3        C     2
      6   03/01/2021    1        A    30
      7   03/01/2021    2        A    16
      8   04/01/2021    1        A    15
      9   04/01/2021    2        A    23
      10  04/01/2021    3        C     4
      11  05/01/2021    1        A    14
      12  05/01/2021    2        A    20
      13  05/01/2021    3        C     5
      

      【讨论】:

        猜你喜欢
        • 2017-07-03
        • 1970-01-01
        • 2014-03-27
        • 2021-12-24
        • 2022-06-10
        • 2021-04-24
        • 2021-07-01
        • 2021-08-14
        • 1970-01-01
        相关资源
        最近更新 更多