Python，Pandas：合并多个数据帧会导致 NaN 值不均匀的行重复答案

【问题标题】：Python, Pandas: Merging several dataframes results in duplication of rows with uneven NaN valuesPython，Pandas：合并多个数据帧会导致 NaN 值不均匀的行重复
【发布时间】：2017-06-07 21:23:07
【问题描述】：

我有 4 个 dfs，如下所示

df1

                     _id        bs        ds        as        pf
0    2017-05-01 00:00:00  0.982218  0.906662  0.614119  0.999471
1    2017-05-01 00:05:00  0.983751  0.913266  0.585237  0.999571
2    2017-05-01 00:10:00  0.983012  0.914875  0.592698  0.999631
3    2017-05-01 00:15:00  0.981884  0.910922  0.589013  0.999536
4    2017-05-01 00:20:00  0.982611  0.913082  0.601056  0.999556
5    2017-05-01 00:25:00  0.982386  0.912358  0.598856  0.999650

df2

                    _id  avg_time_serve  
0   2017-05-01 00:00:00        0.520681            
1   2017-05-01 00:05:00        0.521580            
2   2017-05-01 00:10:00        0.517993            
3   2017-05-01 00:15:00        0.520662            
4   2017-05-01 00:20:00        0.514146            
5   2017-05-01 00:25:00        0.513723

df3

                    _id   total_distinct_ips    
0   2017-05-01 00:00:00             291094.0     
1   2017-05-01 00:05:00             287922.0     
2   2017-05-01 00:10:00             292103.0     
3   2017-05-01 00:15:00             295675.0     
4   2017-05-01 00:20:00             297813.0     
5   2017-05-01 00:25:00             302406.0

df4

                    _id  total_40x  total_50x
0   2017-05-01 00:00:00     162034          0
1   2017-05-01 00:05:00     162497          0
2   2017-05-01 00:10:00     161079          0
3   2017-05-01 00:15:00     163338          0
4   2017-05-01 00:20:00     167901          0
5   2017-05-01 00:25:00     164394          0

我正在尝试通过 '_id' 列组合它们。 '_id' 列是时间戳格式。

我尝试使用以下方法：

**Approach 1**

from functools import reduce

dfs = [df1, df2, df3, df4]
final_df = reduce(lambda left,right: pd.merge(left, right, on='_id', 
           how='outer'), dfs)

**Approach 2**
final_df = pd.Dataframe()

for df in dfs:
    if final_df.empty:
        final_df = df
    else:
        final_df = pd.merge(final_df, df, how='outer', on='_id')

两种方法都给出以下结果：

                    _id        bs        ds        as        pf  \
0   2017-05-01 00:00:00  0.982218  0.906662  0.614119  0.999471
1   2017-05-01 00:00:00       NaN       NaN       NaN       NaN
2   2017-05-01 00:05:00  0.983751  0.913266  0.585237  0.999571
3   2017-05-01 00:05:00       NaN       NaN       NaN       NaN
4   2017-05-01 00:10:00  0.983012  0.914875  0.592698  0.999631
5   2017-05-01 00:10:00       NaN       NaN       NaN       NaN

     avg_time_serve  total_distinct_ips  total_40x  total_50x
0               NaN            291094.0     162034          0
1          0.520681            291094.0     162034          0
2               NaN            287922.0     162497          0
3          0.521580            287922.0     162497          0
4               NaN            292103.0     161079          0
5          0.517993            292103.0     161079          0

方法 3

我从 dfs 列表中取出 'df1'，并添加了一个 'join'。

from functools import reduce

dfs = [df2, df3, df4]
final_df = reduce(lambda left,right: pd.merge(left, right, on='_id', 
           how='outer'), dfs)
final_df = final_df.join(df1.set_index('_id'), on='_id')

终于得到了正确的结果

                    _id  avg_time_serve  total_distinct_ips  total_40x 
0   2017-05-01 00:00:00        0.520681            291094.0     162034
1   2017-05-01 00:05:00        0.521580            287922.0     162497
2   2017-05-01 00:10:00        0.517993            292103.0     161079
3   2017-05-01 00:15:00        0.520662            295675.0     163338
4   2017-05-01 00:20:00        0.514146            297813.0     167901
5   2017-05-01 00:25:00        0.513723            302406.0     164394

     total_50x        bs        ds        as        pf
0            0  0.982218  0.906662  0.614119  0.999471
1            0  0.983751  0.913266  0.585237  0.999571
2            0  0.983012  0.914875  0.592698  0.999631
3            0  0.981884  0.910922  0.589013  0.999536
4            0  0.982611  0.913082  0.601056  0.999556
5            0  0.982386  0.912358  0.598856  0.999650

问题：

方法#1 和#2 不应该适用于合并在一起的任何数量的数据帧吗？
为什么方法 1 和 2 会创建重复的 '_id' 并插入 NaN 值？

【问题讨论】：

根据您的确切数据，我无法重现问题。没有NAN 行出现链或循环合并。你的环境是什么？ Python/熊猫版本？此外，您的数据是否真的类似于发布。
我在 conda 环境中使用 python 3.6.1 和 pandas 0.20.2。我从 mongoDB 中检索了所有数据。我必须使用 df1 的 python datetime 创建我自己的 datetime。我检查了所有其他 dfs (df2-4) 的日期类型，每个都有日期类型。所以，我将 df1 datetime 转换为，但仍然接近 1 和 2 不起作用。

标签： python pandas

【解决方案1】：

您也可以将 pd.concat 与 set_index 一起使用

pd.concat([df1.set_index('_id'), df2.set_index('_id'), df3.set_index('_id'), df4.set_index('_id')], axis = 1).reset_index()

【讨论】：