熊猫在日期上合并列不起作用答案

【问题标题】：pandas merge on date like columns does not work熊猫在日期上合并列不起作用
【发布时间】：2017-12-29 17:38:06
【问题描述】：

首先，该解决方案在我的代码pandas merge on date column issue 中不起作用

我有两个来自 mysql 查询结果的数据框，它们都有 'captureDate' 列。在 mysql 表中，数据类型为“日期”。在数据框中，数据类型为对象。

df1['captureDate'] 数据

0    2017-06-28
1    2017-06-28
2    2017-06-28
3    2017-06-28
4    2017-06-28
5    2017-06-28
6    2017-06-28
Name: captureDate, dtype: object

df2['captureDate'] 数据

0    2017-06-28
1    2017-06-28
2    2017-06-28
3    2017-06-28
4    2017-06-28
5    2017-06-28
6    2017-06-28
Name: captureDate, dtype: object

当我比较 df1 和 df2 的列时，它返回 True

print df1['captureDate'].equals(df2['captureDate'])

我的合并代码

inner = pd.merge(df1, df2,  on='captureDate', how='inner')

但是，结果是错误的，它返回了 49 行。内情爆棚：

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49 entries, 0 to 48
Data columns (total 20 columns):
rule_id_x          49 non-null int64
monitor_sites_x    49 non-null object
rule_type_x        49 non-null int64
lower_limit_x      49 non-null int64
upper_limit_x      49 non-null int64
actual_x           49 non-null int64
predict_x          49 non-null int64
captureDate        49 non-null object
deviation_x        49 non-null float32
create_time_x      49 non-null int64
actual_y           49 non-null int64
create_time_y      49 non-null int64
deviation_y        49 non-null object
id                 49 non-null int64
lower_limit_y      49 non-null int64
monitor_sites_y    49 non-null object
predict_y          49 non-null int64
rule_id_y          49 non-null object
rule_type_y        49 non-null int64
upper_limit_y      49 non-null int64

那么，为什么会发生这种情况以及如何处理这个问题？

【问题讨论】：

重复有问题。最简单的解决方案是删除重复项。有可能吗？

标签： mysql python-2.7 pandas datetime merge

【解决方案1】：

示例：

df1 = pd.DataFrame({'captureDate':['2017-06-22'] *3 +['2017-06-25'] * 3 +['2017-06-28'] * 2,
                   'rule_id':[40,10,20,30,70,10,60,10]})
print (df1)
  captureDate  rule_id
0  2017-06-22       40
1  2017-06-22       10
2  2017-06-22       20
3  2017-06-25       30
4  2017-06-25       70
5  2017-06-25       10
6  2017-06-28       60
7  2017-06-28       10
df2 = pd.DataFrame({'captureDate':['2017-06-22'] *3 +['2017-06-25'] * 3 +['2017-06-28'] * 2,
                   'rule_id':[1,2,3,4,5,6,7,8]})
print (df2)
  captureDate  rule_id
0  2017-06-22        1
1  2017-06-22        2
2  2017-06-22        3
3  2017-06-25        4
4  2017-06-25        5
5  2017-06-25        6
6  2017-06-28        7
7  2017-06-28        8

首先由to_datetime转换为日期时间：

df1['captureDate'] = pd.to_datetime(df1['captureDate'])
df2['captureDate']  = pd.to_datetime(df2['captureDate'])

问题是两列中的重复：

print (df1['captureDate'].equals(df2['captureDate']))
True

inner = pd.merge(df1, df2,  on='captureDate', how='inner')
print (inner)
   captureDate  rule_id_x  rule_id_y
0   2017-06-22         40          1
1   2017-06-22         40          2
2   2017-06-22         40          3
3   2017-06-22         10          1
4   2017-06-22         10          2
5   2017-06-22         10          3
6   2017-06-22         20          1
7   2017-06-22         20          2
8   2017-06-22         20          3
9   2017-06-25         30          4
10  2017-06-25         30          5
11  2017-06-25         30          6
12  2017-06-25         70          4
13  2017-06-25         70          5
14  2017-06-25         70          6
15  2017-06-25         10          4
16  2017-06-25         10          5
17  2017-06-25         10          6
18  2017-06-28         60          7
19  2017-06-28         60          8
20  2017-06-28         10          7
21  2017-06-28         10          8

可能的解决方案

将concat 与set_index 一起使用，然后将MultiIndex 扁平化为map 和join：

df3 = pd.concat([df1.set_index('captureDate'), 
                 df2.set_index('captureDate')], 
                 axis=1, 
                 keys=('a', 'b'))
df3.columns = df3.columns.map('_'.join)
print (df3)
             a_rule_id  b_rule_id
captureDate                      
2017-06-22          40          1
2017-06-22          10          2
2017-06-22          20          3
2017-06-25          30          4
2017-06-25          70          5
2017-06-25          10          6
2017-06-28          60          7
2017-06-28          10          8

或通过drop_duplicates 删除重复项或通过captureDate 在df 中聚合数据：

df1 = df1.drop_duplicates('captureDate')
df2 = df2.drop_duplicates('captureDate')
print (df1)
  captureDate  rule_id
0  2017-06-22       40
3  2017-06-25       30
6  2017-06-28       60

print (df2)
  captureDate  rule_id
0  2017-06-22        1
3  2017-06-25        4
6  2017-06-28        7

inner = pd.merge(df1, df2,  on='captureDate', how='inner')
print (inner)
  captureDate  rule_id_x  rule_id_y
0  2017-06-22         40          1
1  2017-06-25         30          4
2  2017-06-28         60          7

编辑1：

您可以使用cumcount 按列captureDate 计算重复次数，然后使用merge。最后删除辅助列 new by drop：

df1 = pd.DataFrame({'captureDate':['2017-06-22']* 3 + ['2017-06-25']* 3 + ['2017-06-28'] * 2,
                   'rule_id':[40,10,20,30,70,10,60,10]})

df2 = pd.DataFrame({'captureDate':['2017-06-22'] * 3 + ['2017-06-25'] * 3,
                   'rule_id':[1,2,3,4,5,6]})


df1['new'] = df1.groupby('captureDate').cumcount()
df2['new'] = df2.groupby('captureDate').cumcount()
print (df1)
  captureDate  rule_id  new
0  2017-06-22       40    0
1  2017-06-22       10    1
2  2017-06-22       20    2
3  2017-06-25       30    0
4  2017-06-25       70    1
5  2017-06-25       10    2
6  2017-06-28       60    0
7  2017-06-28       10    1

print (df2)
  captureDate  rule_id  new
0  2017-06-22        1    0
1  2017-06-22        2    1
2  2017-06-22        3    2
3  2017-06-25        4    0
4  2017-06-25        5    1
5  2017-06-25        6    2

df3 = pd.merge(df1, df2, on=['captureDate','new']).drop('new', axis=1)
print (df3)
  captureDate  rule_id_x  rule_id_y
0  2017-06-22         40          1
1  2017-06-22         10          2
2  2017-06-22         20          3
3  2017-06-25         30          4
4  2017-06-25         70          5
5  2017-06-25         10          6

【讨论】：

非常感谢。好吧，我想要 df3 结果。但是两帧有不同的列数，不能是'concat'
我曾经合并两个类似日期的列，它可以工作。为什么这次没有？
所以print (df1['captureDate'].equals(df2['captureDate'])) 是假的？
这是真的。我很困惑为什么它不能在这两个相同的列上合并。顺便说一句，我尝试将 captureDate 转换为 str 类型，它们仍然彼此相等，而仍然无法合并。奇怪的。 @jezrael
cumcount 有效。我会尝试找出其余的问题。感谢您的时间和耐心。祝你有美好的一天。