【问题标题】:Pandas - join two DataFrames by multiple index and columnsPandas - 通过多个索引和列连接两个 DataFrame
【发布时间】:2021-06-20 15:41:49
【问题描述】:

我有两个 Pandas DataFrame 对象,需要通过多个索引和列连接。

DF1 每日数据(索引为 RNK、R_ID、纬度和经度):

                                Date        FFDI
RNK R_ID latitude   longitude               
1   0   -39.20000   140.80000   1973-04-02  5.40000
    1   -39.20000   140.83786   1973-04-02  5.40000
    2   -39.20000   140.87572   1973-04-02  5.40000
    3   -39.20000   140.91359   1973-04-02  5.40000
    4   -39.20000   140.95145   1973-04-02  5.40000
    5   -39.20000   140.98930   1973-04-02  5.40000
    6   -39.20000   141.02716   1973-04-02  5.40000
    7   -39.20000   141.06502   1973-05-31  5.40000
    8   -39.20000   141.10289   1973-05-31  5.50000
    9   -39.20000   141.14075   1973-05-31  6.00000
    10  -39.20000   141.17860   1973-05-31  6.40000
    11  -39.20000   141.21646   1973-05-31  6.80000
    12  -39.20000   141.25432   1973-05-31  7.70000
    13  -39.20000   141.29219   1973-05-31  7.90000
    14  -39.20000   141.33005   1973-05-31  7.00000
    15  -39.20000   141.36790   1973-05-31  6.60000
    16  -39.20000   141.40576   1973-05-31  6.10000
    17  -39.20000   141.44362   1973-05-31  5.00000
    18  -39.20000   141.48149   1973-05-31  4.40000
    19  -39.20000   141.51935   1972-04-21  4.40000
    20  -39.20000   141.55721   1972-04-21  4.40000
    21  -39.20000   141.59506   1972-04-21  4.50000
    22  -39.20000   141.63292   1972-04-21  4.60000
    23  -39.20000   141.67079   1972-04-21  4.70000
    24  -39.20000   141.70865   1972-04-21  4.70000
    25  -39.20000   141.74651   1972-04-21  4.70000
    26  -39.20000   141.78436   1972-04-21  4.70000
    27  -39.20000   141.82222   1972-04-21  4.70000
    28  -39.20000   141.86009   1972-04-21  4.70000
    29  -39.20000   141.89795   1972-04-21  4.70000
... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ...
5   36082   -33.90000   148.90205   1972-12-24  35.70000
    36083   -33.90000   148.93991   1974-11-12  36.30000
    36084   -33.90000   148.97778   1974-11-12  35.90000
    36085   -33.90000   149.01564   1973-11-20  36.80000
    36086   -33.90000   149.05350   1973-11-20  37.00000
    36087   -33.90000   149.09135   1974-11-12  35.60000
    36088   -33.90000   149.12921   1973-01-03  35.90000
    36089   -33.90000   149.16708   1973-01-03  34.40000
    36090   -33.90000   149.20494   1973-01-03  32.90000
    36091   -33.90000   149.24280   1973-01-03  32.20000
    36092   -33.90000   149.28065   1973-01-03  32.30000
    36093   -33.90000   149.31851   1973-01-03  32.20000
    36094   -33.90000   149.35638   1973-01-03  30.20000
    36095   -33.90000   149.39424   1973-11-20  28.60000
    36096   -33.90000   149.43210   1973-11-20  28.70000
    36097   -33.90000   149.46996   1973-11-20  29.10000
    36098   -33.90000   149.50781   1973-11-20  30.10000
    36099   -33.90000   149.54568   1973-11-20  30.80000
    36100   -33.90000   149.58354   1973-01-09  30.60000
    36101   -33.90000   149.62140   1973-01-09  30.10000
    36102   -33.90000   149.65926   1973-01-09  29.50000
    36103   -33.90000   149.69711   1973-01-09  29.20000
    36104   -33.90000   149.73499   1973-01-09  29.90000
    36105   -33.90000   149.77284   1973-01-09  29.90000
    36106   -33.90000   149.81070   1973-01-09  27.60000
    36107   -33.90000   149.84856   1973-01-09  24.40000
    36108   -33.90000   149.88641   1973-01-09  23.80000
    36109   -33.90000   149.92429   1973-01-09  23.80000
    36110   -33.90000   149.96214   1973-01-09  24.10000
    36111   -33.90000   150.00000   1973-01-09  25.30000

DF2 每小时数据(索引 = R_ID):

     latitude   longitude    time                T_SFC
R_ID                
0   -39.20000   140.80000   1972-01-20 00:00:00 15.80000
0   -39.20000   140.80000   1972-01-20 01:00:00 15.90000
0   -39.20000   140.80000   1972-01-20 02:00:00 16.00000
0   -39.20000   140.80000   1972-01-20 03:00:00 16.20000
0   -39.20000   140.80000   1972-01-20 04:00:00 16.60000
0   -39.20000   140.80000   1972-01-20 05:00:00 16.60000
0   -39.20000   140.80000   1972-01-20 06:00:00 16.50000
0   -39.20000   140.80000   1972-01-20 07:00:00 16.50000
0   -39.20000   140.80000   1972-01-20 08:00:00 16.50000
0   -39.20000   140.80000   1972-01-20 09:00:00 16.40000
0   -39.20000   140.80000   1972-01-20 10:00:00 16.40000
0   -39.20000   140.80000   1972-01-20 11:00:00 16.40000
0   -39.20000   140.80000   1972-01-20 12:00:00 16.50000
0   -39.20000   140.80000   1972-01-20 13:00:00 16.60000
0   -39.20000   140.80000   1972-01-20 14:00:00 16.60000
0   -39.20000   140.80000   1972-01-20 15:00:00 16.70000
0   -39.20000   140.80000   1972-01-20 16:00:00 16.70000
0   -39.20000   140.80000   1972-01-20 17:00:00 16.60000
0   -39.20000   140.80000   1972-01-20 18:00:00 16.60000
0   -39.20000   140.80000   1972-01-20 19:00:00 16.60000
0   -39.20000   140.80000   1972-01-20 20:00:00 16.50000
0   -39.20000   140.80000   1972-01-20 21:00:00 16.50000
0   -39.20000   140.80000   1972-01-20 22:00:00 16.50000
0   -39.20000   140.80000   1972-01-20 23:00:00 16.40000
0   -39.20000   140.80000   1972-01-21 00:00:00 16.40000
0   -39.20000   140.80000   1972-01-21 01:00:00 16.30000
0   -39.20000   140.80000   1972-01-21 02:00:00 16.30000
0   -39.20000   140.80000   1972-01-21 03:00:00 16.30000
0   -39.20000   140.80000   1972-01-21 04:00:00 16.10000
0   -39.20000   140.80000   1972-01-21 05:00:00 16.00000
... ... ... ... ...
36111   -38.87551   141.14075   1974-12-30 18:00:00 14.10000
36111   -38.87551   141.14075   1974-12-30 19:00:00 14.10000
36111   -38.87551   141.14075   1974-12-30 20:00:00 14.10000
36111   -38.87551   141.14075   1974-12-30 21:00:00 14.10000
36111   -38.87551   141.14075   1974-12-30 22:00:00 14.20000
36111   -38.87551   141.14075   1974-12-30 23:00:00 14.30000
36111   -38.87551   141.14075   1974-12-31 00:00:00 14.40000
36111   -38.87551   141.14075   1974-12-31 01:00:00 14.50000
36111   -38.87551   141.14075   1974-12-31 02:00:00 14.50000
36111   -38.87551   141.14075   1974-12-31 03:00:00 14.50000
36111   -38.87551   141.14075   1974-12-31 04:00:00 14.50000
36111   -38.87551   141.14075   1974-12-31 05:00:00 14.50000
36111   -38.87551   141.14075   1974-12-31 06:00:00 14.60000
36111   -38.87551   141.14075   1974-12-31 07:00:00 14.50000
36111   -38.87551   141.14075   1974-12-31 08:00:00 14.30000
36111   -38.87551   141.14075   1974-12-31 09:00:00 14.40000
36111   -38.87551   141.14075   1974-12-31 10:00:00 14.30000
36111   -38.87551   141.14075   1974-12-31 11:00:00 14.30000
36111   -38.87551   141.14075   1974-12-31 12:00:00 14.40000
36111   -38.87551   141.14075   1974-12-31 13:00:00 14.50000
36111   -38.87551   141.14075   1974-12-31 14:00:00 14.40000
36111   -38.87551   141.14075   1974-12-31 15:00:00 14.30000
36111   -38.87551   141.14075   1974-12-31 16:00:00 14.30000
36111   -38.87551   141.14075   1974-12-31 17:00:00 14.30000
36111   -38.87551   141.14075   1974-12-31 18:00:00 14.30000
36111   -38.87551   141.14075   1974-12-31 19:00:00 14.40000
36111   -38.87551   141.14075   1974-12-31 20:00:00 14.50000
36111   -38.87551   141.14075   1974-12-31 21:00:00 14.60000
36111   -38.87551   141.14075   1974-12-31 22:00:00 14.70000
36111   -38.87551   141.14075   1974-12-31 23:00:00 14.80000

DF1 有一个 Date 列,其每日值从 1972-01-20 到 1974-12-31,而 DF2 有一个 Time 列,其每小时值从 1972-01-20T00:00:00 到 1974-12 -31T23:00:00。 DF1 按 RNK(Rank)和 FFDI 排序,DF2 按 R_ID 和时间排序。一个 R_ID 是对应于一对唯一纬度和经度的参考 ID。 DF2 将与 DF1 以相同的 R_ID 和相同的日期连接到 DF2 的时间列所属的日期。也就是说,DF1 中的每一行(天)将有来自DF2 的 24(小时)行,并且天的值相同。

输出 df 将如下所示:

                                                              time                   T_SFC
RNK  R_ID       latitude    longitude   Date        FFDI
1    0          -39.20000   140.80000   1973-04-02  5.40000   1973-04-02 00:00:00    13.8
                                                              1973-04-02 01:00:00    13.9
                                                              1973-04-02 02:00:00    13.0
                                                              1973-04-02 03:00:00    13.2
                                                              1973-04-02 04:00:00    13.6
                                                              ... ... ... ...
     1          -39.20000   140.83786   1973-04-02  5.40000   1973-04-02 00:00:00    13.8
                                                              1973-04-02 01:00:00    13.9
                                                              1973-04-02 02:00:00    13.0
                                                              1973-04-02 03:00:00    13.2
                                                              1973-04-02 04:00:00    13.6
                                                              ... ... ... ...
     2          -39.20000   140.87572   1973-04-02  5.40000   1973-04-02 00:00:00    13.8
                                                              1973-04-02 01:00:00    13.9
                                                              1973-04-02 02:00:00    13.0
                                                              1973-04-02 03:00:00    13.2
                                                              1973-04-02 04:00:00    13.6
                                                              ... ... ... ...
     ... ... ... ...
2    0          -39.20000   140.80000   1974-03-07  5.60000   1974-03-07 00:00:00    15.8
                                                              1974-03-07 01:00:00    15.9
                                                              1974-03-07 02:00:00    16.0
                                                              1974-03-07 03:00:00    16.2
                                                              1974-03-07 04:00:00    16.6
                                                              ... ... ... ...
     1          -39.20000   140.83786   1973-03-09  5.40000   1973-03-09 00:00:00    15.8
                                                              1973-03-09 01:00:00    15.9
                                                              1973-03-09 02:00:00    16.0
                                                              1973-03-09 03:00:00    15.2
                                                              1973-03-09 04:00:00    15.6
                                                              ... ... ... ...
... ... ... ...
... ... ... ...
5    36082     -33.90000    148.90205   1972-12-24  35.70000  1972-12-24 00:00:00    19.8
                                                              1972-12-24 01:00:00    19.1
                                                              1972-12-24 02:00:00    22.0
                                                              1972-12-24 03:00:00    24.2
                                                              1972-12-24 04:00:00    21.6
                                                              ... ... ... ...
     ... ... ... ...
     36111     -33.90000    150.00000   1973-01-09  25.30000  1973-01-09 00:00:00    19.8
                                                              1973-01-09 01:00:00    19.1
                                                              1973-01-09 02:00:00    22.0
                                                              1973-01-09 03:00:00    24.2
                                                              1973-01-09 04:00:00    21.6
                                                              ... ... ... ...
                                                              1973-01-09 23:00:00    19.1
4,333,440 rows x 2 columns

按照@politinsa 的回答,我试过了

# Add a new column Date and save date part of the time column to it.
df2['Date'] = df2['time'].dt.date.astype('datetime64[ns]')

df_joined = pd.merge(df1, df2, on=['REF_ID', 'Date'], how='inner')

输出的问题是 df1 的多索引未保留,输出 df 中缺少 RNK。

print(df_joined)

        time    FFDI         latitude   longitude   T_SFC       time_original
REF_ID                      
0   1973-04-02  5.40000     -39.20000   140.80000   16.40000    1973-04-02 00:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   16.00000    1973-04-02 01:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.70000    1973-04-02 02:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.40000    1973-04-02 03:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.20000    1973-04-02 04:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.10000    1973-04-02 05:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.10000    1973-04-02 06:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.10000    1973-04-02 07:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.10000    1973-04-02 08:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.10000    1973-04-02 09:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.10000    1973-04-02 10:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.20000    1973-04-02 11:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.20000    1973-04-02 12:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.20000    1973-04-02 13:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.00000    1973-04-02 14:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.10000    1973-04-02 15:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.30000    1973-04-02 16:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.40000    1973-04-02 17:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.40000    1973-04-02 18:00:00

... ... ... ...
12000 rows × 6 columns

【问题讨论】:

    标签: python pandas numpy python-xarray


    【解决方案1】:

    您可以在 DF2 中创建一个包含日期(而不是日期时间)的列,即在行 1973-04-02 01:00:00 您将有一个包含 1973-04-02 的列 Date

    然后使用经典的内部连接 ​​(pd.merge(df1, df2, on=['R_ID', 'Date'], how='inner')),它应该可以解决问题。

    【讨论】:

    • 谢谢@politinsa。我试过你的代码。我还在合并部分之前添加了df2['Date'] = df2['time'].dt.date.astype('datetime64[ns]')。但是,输出数据帧似乎有 RANK 消失。 DF2 有 MultiIndex,但输出似乎失去了这个。
    猜你喜欢
    • 2018-03-16
    • 2020-12-04
    • 2019-11-13
    • 2014-01-09
    • 2019-06-29
    • 2023-01-13
    • 2022-01-12
    • 1970-01-01
    • 2017-06-20
    相关资源
    最近更新 更多