【问题标题】:Combining DataFrames based on column labels DatetimeIndex基于列标签 DatetimeIndex 组合 DataFrame
【发布时间】:2015-08-09 23:20:40
【问题描述】:

我将天气数据存储在许多单独的文件中,其中的列用于特定的测量仪器,每一行对应于特定日期的平均读数。假设一个文件如下所示:

first = pd.DataFrame(np.random.random((10,3)), 
                     pd.date_range('1950-01-01', periods=10), 
                     columns=['A','B','C'])

first
Out[21]: 
                   A         B         C
1950-01-01  0.939932  0.504543  0.091025
1950-01-02  0.121418  0.725333  0.444813
1950-01-03  0.338385  0.783398  0.116468
1950-01-04  0.847905  0.846147  0.226074
1950-01-05  0.156315  0.704804  0.524886
1950-01-06  0.412284  0.425379  0.427246
1950-01-07  0.165859  0.406347  0.114586
1950-01-08  0.392670  0.789526  0.174001
1950-01-09  0.246180  0.776304  0.019368
1950-01-10  0.142213  0.731748  0.954076

还有一秒钟,看起来像这样,

second = pd.DataFrame(np.random.random((10,3)), 
                      pd.date_range('1950-01-11', periods=10), 
                      columns=['A','B','D'])



second
Out[30]: 
                   A         B         D
1950-01-11  0.190767  0.905640  0.325411
1950-01-12  0.109964  0.754694  0.414402
1950-01-13  0.058164  0.305405  0.768333
1950-01-14  0.267644  0.919876  0.631083
1950-01-15  0.981333  0.454678  0.533075
1950-01-16  0.831600  0.823845  0.980366
1950-01-17  0.303585  0.091634  0.338517
1950-01-18  0.723445  0.088020  0.570779
1950-01-19  0.639665  0.954577  0.763810
1950-01-20  0.370629  0.716066  0.628383

我想将这两者合并在一起,以便所有仪器(即 A、B、C、D ......)可以显示在具有所有测量时间段的同一文件中。预期结果如下所示:

                   A         B         C         D
1950-01-01  0.939932  0.504543  0.091025
1950-01-02  0.121418  0.725333  0.444813
1950-01-03  0.338385  0.783398  0.116468
1950-01-04  0.847905  0.846147  0.226074
1950-01-05  0.156315  0.704804  0.524886
1950-01-06  0.412284  0.425379  0.427246
1950-01-07  0.165859  0.406347  0.114586
1950-01-08  0.392670  0.789526  0.174001
1950-01-09  0.246180  0.776304  0.019368
1950-01-10  0.142213  0.731748  0.954076
1950-01-11  0.190767  0.905640           0.325411
1950-01-12  0.109964  0.754694           0.414402
1950-01-13  0.058164  0.305405           0.768333
1950-01-14  0.267644  0.919876           0.631083
1950-01-15  0.981333  0.454678           0.533075
1950-01-16  0.831600  0.823845           0.980366
1950-01-17  0.303585  0.091634           0.338517
1950-01-18  0.723445  0.088020           0.570779
1950-01-19  0.639665  0.954577           0.763810
1950-01-20  0.370629  0.716066           0.628383

为了得到这个我已经尝试过:

first.merge(second, how='outer', left_index=True, right_index=True)
Out[34]: 
                 A_x       B_x         C       A_y       B_y         D
1950-01-01  0.939932  0.504543  0.091025       NaN       NaN       NaN
1950-01-02  0.121418  0.725333  0.444813       NaN       NaN       NaN
1950-01-03  0.338385  0.783398  0.116468       NaN       NaN       NaN
1950-01-04  0.847905  0.846147  0.226074       NaN       NaN       NaN
1950-01-05  0.156315  0.704804  0.524886       NaN       NaN       NaN
1950-01-06  0.412284  0.425379  0.427246       NaN       NaN       NaN
1950-01-07  0.165859  0.406347  0.114586       NaN       NaN       NaN
1950-01-08  0.392670  0.789526  0.174001       NaN       NaN       NaN
1950-01-09  0.246180  0.776304  0.019368       NaN       NaN       NaN
1950-01-10  0.142213  0.731748  0.954076       NaN       NaN       NaN
1950-01-11       NaN       NaN       NaN  0.190767  0.905640  0.325411
1950-01-12       NaN       NaN       NaN  0.109964  0.754694  0.414402
1950-01-13       NaN       NaN       NaN  0.058164  0.305405  0.768333
1950-01-14       NaN       NaN       NaN  0.267644  0.919876  0.631083
1950-01-15       NaN       NaN       NaN  0.981333  0.454678  0.533075
1950-01-16       NaN       NaN       NaN  0.831600  0.823845  0.980366
1950-01-17       NaN       NaN       NaN  0.303585  0.091634  0.338517
1950-01-18       NaN       NaN       NaN  0.723445  0.088020  0.570779
1950-01-19       NaN       NaN       NaN  0.639665  0.954577  0.763810
1950-01-20       NaN       NaN       NaN  0.370629  0.716066  0.628383

但正如您所见,需要合并的列已被拆分,因为没有公共行索引。我觉得这个功能对 pandas 来说是一个非常有用的补充。这个可以吗?

【问题讨论】:

  • first.combine_first(second)?尽管这可能会用另一个数据帧覆盖一个数据帧,这可能是也可能不是问题。或者first.append(second)?正确答案可能取决于是否可以重叠,如果可以,您将如何处理。
  • 我认为combine_first 实际上正是我想要的。为调用框架提供存在,但我的数据没有任何重叠。似乎 `combine' 是更灵活的版本,因为它允许使用函数来处理冲突的情况。
  • 是的,这似乎是一个很好的表征。没有重叠,我认为appendconcatenate 将是更常见和直接的方法,但我认为使用combinecombine_first 没有任何问题。

标签: python pandas merge split-apply-combine


【解决方案1】:

另一种方法是使用.combine 函数,它将结果的形状更改为两个轴上的并集。

combiner = lambda x, y: np.where(pd.isnull(x), y, x)
first.combine(second, combiner)

                 A       B       C       D
1950-01-01  0.7917  0.5289  0.5680     NaN
1950-01-02  0.9256  0.0710  0.0871     NaN
1950-01-03  0.0202  0.8326  0.7782     NaN
1950-01-04  0.8700  0.9786  0.7992     NaN
1950-01-05  0.4615  0.7805  0.1183     NaN
1950-01-06  0.6399  0.1434  0.9447     NaN
1950-01-07  0.5218  0.4147  0.2646     NaN
1950-01-08  0.7742  0.4562  0.5684     NaN
1950-01-09  0.0188  0.6176  0.6121     NaN
1950-01-10  0.6169  0.9437  0.6818     NaN
1950-01-11  0.3595  0.4370     NaN  0.6976
1950-01-12  0.0602  0.6668     NaN  0.6706
1950-01-13  0.2104  0.1289     NaN  0.3154
1950-01-14  0.3637  0.5702     NaN  0.4386
1950-01-15  0.9884  0.1020     NaN  0.2089
1950-01-16  0.1613  0.6531     NaN  0.2533
1950-01-17  0.4663  0.2444     NaN  0.1590
1950-01-18  0.1104  0.6563     NaN  0.1382
1950-01-19  0.1966  0.3687     NaN  0.8210
1950-01-20  0.0971  0.8379     NaN  0.0961

【讨论】:

  • 这和combine_first有什么不同吗?
  • 有趣,我以前从未见过这个函数...所以该函数将第一个数据帧中的缺失值替换为第二个数据帧中的缺失值?
  • 另外值得注意的是combine_first。对于大帧,实现看起来要快得多,并且应该对您正在使用的函数做同样的事情
【解决方案2】:

假设firstdf1seconddf2,使用concat 似乎可以解决您的问题。

>>> pd.concat([df1, df2])
                   A         B         C         D
1950-01-01  0.939932  0.504543  0.091025       NaN
1950-01-02  0.121418  0.725333  0.444813       NaN
1950-01-03  0.338385  0.783398  0.116468       NaN
1950-01-04  0.847905  0.846147  0.226074       NaN
1950-01-05  0.156315  0.704804  0.524886       NaN
1950-01-06  0.412284  0.425379  0.427246       NaN
1950-01-07  0.165859  0.406347  0.114586       NaN
1950-01-08  0.392670  0.789526  0.174001       NaN
1950-01-09  0.246180  0.776304  0.019368       NaN
1950-01-10  0.142213  0.731748  0.954076       NaN
1950-01-11  0.190767  0.905640       NaN  0.325411
1950-01-12  0.109964  0.754694       NaN  0.414402
1950-01-13  0.058164  0.305405       NaN  0.768333
1950-01-14  0.267644  0.919876       NaN  0.631083
1950-01-15  0.981333  0.454678       NaN  0.533075
1950-01-16  0.831600  0.823845       NaN  0.980366
1950-01-17  0.303585  0.091634       NaN  0.338517
1950-01-18  0.723445  0.088020       NaN  0.570779
1950-01-19  0.639665  0.954577       NaN  0.763810
1950-01-20  0.370629  0.716066       NaN  0.628383

【讨论】:

  • 我猜在这个简单的情况下连接是有效的,但是当我之前用我的实际数据尝试过这个时,出现了一个问题,其中一个数据帧不包含另一个数据帧并且长度不同......我将不得不更新问题以涵盖这种情况。
猜你喜欢
  • 2016-11-22
  • 2018-02-05
  • 2022-12-05
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-03-05
  • 2023-02-14
相关资源
最近更新 更多