【发布时间】:2015-08-09 23:20:40
【问题描述】:
我将天气数据存储在许多单独的文件中,其中的列用于特定的测量仪器,每一行对应于特定日期的平均读数。假设一个文件如下所示:
first = pd.DataFrame(np.random.random((10,3)),
pd.date_range('1950-01-01', periods=10),
columns=['A','B','C'])
first
Out[21]:
A B C
1950-01-01 0.939932 0.504543 0.091025
1950-01-02 0.121418 0.725333 0.444813
1950-01-03 0.338385 0.783398 0.116468
1950-01-04 0.847905 0.846147 0.226074
1950-01-05 0.156315 0.704804 0.524886
1950-01-06 0.412284 0.425379 0.427246
1950-01-07 0.165859 0.406347 0.114586
1950-01-08 0.392670 0.789526 0.174001
1950-01-09 0.246180 0.776304 0.019368
1950-01-10 0.142213 0.731748 0.954076
还有一秒钟,看起来像这样,
second = pd.DataFrame(np.random.random((10,3)),
pd.date_range('1950-01-11', periods=10),
columns=['A','B','D'])
second
Out[30]:
A B D
1950-01-11 0.190767 0.905640 0.325411
1950-01-12 0.109964 0.754694 0.414402
1950-01-13 0.058164 0.305405 0.768333
1950-01-14 0.267644 0.919876 0.631083
1950-01-15 0.981333 0.454678 0.533075
1950-01-16 0.831600 0.823845 0.980366
1950-01-17 0.303585 0.091634 0.338517
1950-01-18 0.723445 0.088020 0.570779
1950-01-19 0.639665 0.954577 0.763810
1950-01-20 0.370629 0.716066 0.628383
我想将这两者合并在一起,以便所有仪器(即 A、B、C、D ......)可以显示在具有所有测量时间段的同一文件中。预期结果如下所示:
A B C D
1950-01-01 0.939932 0.504543 0.091025
1950-01-02 0.121418 0.725333 0.444813
1950-01-03 0.338385 0.783398 0.116468
1950-01-04 0.847905 0.846147 0.226074
1950-01-05 0.156315 0.704804 0.524886
1950-01-06 0.412284 0.425379 0.427246
1950-01-07 0.165859 0.406347 0.114586
1950-01-08 0.392670 0.789526 0.174001
1950-01-09 0.246180 0.776304 0.019368
1950-01-10 0.142213 0.731748 0.954076
1950-01-11 0.190767 0.905640 0.325411
1950-01-12 0.109964 0.754694 0.414402
1950-01-13 0.058164 0.305405 0.768333
1950-01-14 0.267644 0.919876 0.631083
1950-01-15 0.981333 0.454678 0.533075
1950-01-16 0.831600 0.823845 0.980366
1950-01-17 0.303585 0.091634 0.338517
1950-01-18 0.723445 0.088020 0.570779
1950-01-19 0.639665 0.954577 0.763810
1950-01-20 0.370629 0.716066 0.628383
为了得到这个我已经尝试过:
first.merge(second, how='outer', left_index=True, right_index=True)
Out[34]:
A_x B_x C A_y B_y D
1950-01-01 0.939932 0.504543 0.091025 NaN NaN NaN
1950-01-02 0.121418 0.725333 0.444813 NaN NaN NaN
1950-01-03 0.338385 0.783398 0.116468 NaN NaN NaN
1950-01-04 0.847905 0.846147 0.226074 NaN NaN NaN
1950-01-05 0.156315 0.704804 0.524886 NaN NaN NaN
1950-01-06 0.412284 0.425379 0.427246 NaN NaN NaN
1950-01-07 0.165859 0.406347 0.114586 NaN NaN NaN
1950-01-08 0.392670 0.789526 0.174001 NaN NaN NaN
1950-01-09 0.246180 0.776304 0.019368 NaN NaN NaN
1950-01-10 0.142213 0.731748 0.954076 NaN NaN NaN
1950-01-11 NaN NaN NaN 0.190767 0.905640 0.325411
1950-01-12 NaN NaN NaN 0.109964 0.754694 0.414402
1950-01-13 NaN NaN NaN 0.058164 0.305405 0.768333
1950-01-14 NaN NaN NaN 0.267644 0.919876 0.631083
1950-01-15 NaN NaN NaN 0.981333 0.454678 0.533075
1950-01-16 NaN NaN NaN 0.831600 0.823845 0.980366
1950-01-17 NaN NaN NaN 0.303585 0.091634 0.338517
1950-01-18 NaN NaN NaN 0.723445 0.088020 0.570779
1950-01-19 NaN NaN NaN 0.639665 0.954577 0.763810
1950-01-20 NaN NaN NaN 0.370629 0.716066 0.628383
但正如您所见,需要合并的列已被拆分,因为没有公共行索引。我觉得这个功能对 pandas 来说是一个非常有用的补充。这个可以吗?
【问题讨论】:
-
first.combine_first(second)?尽管这可能会用另一个数据帧覆盖一个数据帧,这可能是也可能不是问题。或者first.append(second)?正确答案可能取决于是否可以重叠,如果可以,您将如何处理。 -
我认为
combine_first实际上正是我想要的。为调用框架提供存在,但我的数据没有任何重叠。似乎 `combine' 是更灵活的版本,因为它允许使用函数来处理冲突的情况。 -
是的,这似乎是一个很好的表征。没有重叠,我认为
append或concatenate将是更常见和直接的方法,但我认为使用combine或combine_first没有任何问题。
标签: python pandas merge split-apply-combine