【问题标题】:Merge 2 dataframe columns into one by matching date通过匹配日期将 2 个数据框列合并为一个
【发布时间】:2020-08-03 22:25:48
【问题描述】:

df

           Id   timestamp               data    Date
30424   30665   2020-01-04 19:40:23.827 17.5    2020-01-04
31054   31295   2020-01-05 22:26:39.860 17.0    2020-01-05
32150   32391   2020-01-06 23:00:14.607 18.0    2020-01-06
33236   33477   2020-01-07 22:52:56.757 18.0    2020-01-07
34314   34555   2020-01-08 20:45:48.927 18.0    2020-01-08
35592   35833   2020-01-09 20:56:21.320 18.0    2020-01-09
36528   36769   2020-01-10 20:41:36.323 19.5    2020-01-10
37054   37295   2020-01-11 19:35:50.553 18.5    2020-01-11
37652   37893   2020-01-12 19:28:22.823 17.0    2020-01-12
38828   39069   2020-01-13 23:48:12.533 21.5    2020-01-13
40004   40245   2020-01-14 22:50:56.873 18.5    2020-01-14

df1

    Date        data 
0   2020-01-04  NaN
1   2020-01-07  NaN
2   2020-01-08  19.0
3   2020-01-09  NaN
4   2020-01-11  NaN
5   2020-01-12  NaN
6   2020-01-16  NaN
7   2020-01-17  NaN
8   2020-01-24  18.5

如果df1['data'] 的值不是NaN,我想用df1['data'] 中的值替换df 中的data

预期结果:

        Id      timestamp               data    Date
30424   30665   2020-01-04 19:40:23.827 17.5    2020-01-04
31054   31295   2020-01-05 22:26:39.860 17.0    2020-01-05
32150   32391   2020-01-06 23:00:14.607 18.0    2020-01-06
33236   33477   2020-01-07 22:52:56.757 18.0    2020-01-07
34314   34555   2020-01-08 20:45:48.927 19.0    2020-01-08  # This row changed
35592   35833   2020-01-09 20:56:21.320 18.0    2020-01-09
36528   36769   2020-01-10 20:41:36.323 19.5    2020-01-10
37054   37295   2020-01-11 19:35:50.553 18.5    2020-01-11
37652   37893   2020-01-12 19:28:22.823 17.0    2020-01-12
38828   39069   2020-01-13 23:48:12.533 21.5    2020-01-13
40004   40245   2020-01-14 22:50:56.873 18.5    2020-01-14

This answer 与我的问题类似,但情况并不完全相同。

我试过了:

pd.merge(df, df1, how='left', on='Date')

返回:

       Id   timestamp               data_x  Date       data_y
0   30665   2020-01-04 19:40:23.827 17.5    2020-01-04  NaN
1   31295   2020-01-05 22:26:39.860 17.0    2020-01-05  NaN
2   32391   2020-01-06 23:00:14.607 18.0    2020-01-06  NaN
3   33477   2020-01-07 22:52:56.757 18.0    2020-01-07  NaN
4   34555   2020-01-08 20:45:48.927 18.0    2020-01-08  19.0
5   35833   2020-01-09 20:56:21.320 18.0    2020-01-09  NaN
6   36769   2020-01-10 20:41:36.323 19.5    2020-01-10  NaN
7   37295   2020-01-11 19:35:50.553 18.5    2020-01-11  NaN

更新:

试过了:

df['data'] = df['Date'].map(df1.set_index('Date')['data']).fillna(df['Date'])

data 列似乎有问题:

          Id    timestamp               data            Date
30424   30665   2020-01-04 19:40:23.827 1.578096e+18    2020-01-04
31054   31295   2020-01-05 22:26:39.860 1.578182e+18    2020-01-05
32150   32391   2020-01-06 23:00:14.607 1.578269e+18    2020-01-06
33236   33477   2020-01-07 22:52:56.757 1.578355e+18    2020-01-07
34314   34555   2020-01-08 20:45:48.927 1.900000e+01    2020-01-08
35592   35833   2020-01-09 20:56:21.320 1.578528e+18    2020-01-09
36528   36769   2020-01-10 20:41:36.323 1.578614e+18    2020-01-10

【问题讨论】:

    标签: python pandas numpy dataframe merge


    【解决方案1】:

    首先使用Series.map by Date 列,如果没有匹配缺失值,则将数据替换为Series.fillna 的原始数据:

    df['data'] = df['Date'].map(df1.set_index('Date')['data']).fillna(df['data'])
    print (df)
              Id                timestamp  data        Date
    30424  30665  2020-01-04 19:40:23.827  17.5  2020-01-04
    31054  31295  2020-01-05 22:26:39.860  17.0  2020-01-05
    32150  32391  2020-01-06 23:00:14.607  18.0  2020-01-06
    33236  33477  2020-01-07 22:52:56.757  18.0  2020-01-07
    34314  34555  2020-01-08 20:45:48.927  19.0  2020-01-08
    35592  35833  2020-01-09 20:56:21.320  18.0  2020-01-09
    36528  36769  2020-01-10 20:41:36.323  19.5  2020-01-10
    37054  37295  2020-01-11 19:35:50.553  18.5  2020-01-11
    37652  37893  2020-01-12 19:28:22.823  17.0  2020-01-12
    38828  39069  2020-01-13 23:48:12.533  21.5  2020-01-13
    40004  40245  2020-01-14 22:50:56.873  18.5  2020-01-14
    

    详情

    print (df['Date'].map(df1.set_index('Date')['data']))
    30424     NaN
    31054     NaN
    32150     NaN
    33236     NaN
    34314    19.0
    35592     NaN
    36528     NaN
    37054     NaN
    37652     NaN
    38828     NaN
    40004     NaN
    Name: Date, dtype: float64
    

    【讨论】:

    • 您知道更新后的问题中data 列发生了什么吗?
    • @nilsinelabore - 当然,有错字,.fillna(df['Date']) 需要 .fillna(df['data'])
    • @nilsinelabore - 原因是日期时间被转换为本机格式,unix 格式(如this),因此用这种格式的日期时间替换缺失值
    • 为什么一定要设置日期索引(df1.set_index('Date'))?这是否意味着许多函数都是基于索引的?
    • @nilsinelabore 如果需要在 2 个 DataFrame 之间以相同方式匹配值的原因。这里需要Dates 匹配。函数映射工作获取索引系列的值,如字典的键并用于分配新值。如果没有 set_index 则 df1['data'] 具有索引 0,1,2... 并且因为此值不在 Date 列中,所以在 (df['Date'].map(df1['data'])) 之后的列中获取 NaN
    猜你喜欢
    • 2015-10-30
    • 2017-04-17
    • 1970-01-01
    • 2021-12-29
    • 1970-01-01
    • 2021-12-16
    • 2018-06-06
    • 2023-04-10
    • 2018-03-21
    相关资源
    最近更新 更多