【问题标题】:Matches not found by pd.DataFrame.mergepd.DataFrame.merge 找不到匹配项
【发布时间】:2019-12-01 21:40:16
【问题描述】:

我有一个三个pd.DataFrames

df1 = pd.DataFrame({'var1': {0: 2210, 1: 2210, 2: 2210, 3: 2210, 4: 2210, 5: 2210, 6: 2210, 7: 2210, 8: 2210, 9: 2210, 10: 2210, 11: 2210, 12: 2210, 13: 2210, 14: 2210, 15: 2210, 16: 2210, 17: 2210, 18: 2210, 19: 2210, 20: 2210, 21: 2210}, 'var2': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 2, 6: 1, 7: 2, 8: 1, 9: 2, 10: 1, 11: 2, 12: 1, 13: 2, 14: 1, 15: 2, 16: 1, 17: 2, 18: 1, 19: 2, 20: 1, 21: 2}, 'var3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}, 'var4': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}, 'var5': {0: '121160', 1: '20066', 2: ' 58621', 3: ' 201084', 4: ' 100180', 5: ' 74230', 6: ' 27789', 7: ' 66975', 8: ' 57410', 9: ' 49413', 10: ' 57112', 11: ' 19188', 12: ' 61366', 13: ' 27341', 14: ' 59859', 15: ' 173954', 16: ' 205651', 17: ' 54861', 18: ' 165809', 19: ' 60252', 20: ' 182156', 21: ' 82403'}})

df2 = pd.DataFrame({'var1': {349176: 2210, 349225: 2210, 349913: 2210, 350247: 2210, 350342: 2210, 350518: 2210}, 'var2': {349176: 2, 349225: 1, 349913: 1, 350247: 2, 350342: 1, 350518: 2}, 'var5': {349176: 58786.0, 349225: 37572.0, 349913: 103955.0, 350247: 19197.0, 350342: 14664.0, 350518: 75773.0}, 'var3': {349176: 19, 349225: 22, 349913: 56, 350247: 75, 350342: 80, 350518: 95}, 'var4': {349176: 8, 349225: 52, 349913: 42, 350247: 0, 350342: 50, 350518: 17}})

df3 = pd.DataFrame({'var1': {349175: 2210, 349224: 2210, 349912: 2210, 350246: 2210, 350341: 2210, 350517: 2210, 350521: 2210}, 'var2': {349175: 2, 349224: 1, 349912: 1, 350246: 2, 350341: 1, 350517: 2, 350521: 1}, 'var5': {349175: 19188.0, 349224: 205651.0, 349912: 59859.0, 350246: 27341.0, 350341: 165809.0, 350517: 19197.0, 350521: 61366.0}, 'var6': {349175: 19, 349224: 22, 349912: 56, 350246: 75, 350341: 80, 350517: 95, 350521: 95}, 'var7': {349175: 8, 349224: 52, 349912: 42, 350246: 0, 350341: 50, 350517: 17, 350521: 40}})

我需要将df1df2 堆叠在一起,然后根据多个变量将它们与df3 左连接:var1, var2, var5

所以我写了:

pd.concat([df1, df2], axis = 0, sort = False).merge(df3, how = 'left', on = ['var1', 'var2', 'var5'])

但它没有找到所有匹配的行。将类型更改为外连接,我们可以观察到,例如有两行具有相同的值 var1, var2var3 - 第 11 行和第 28 行,但它们尚未加入:

pd.concat([df1, df2], axis = 0, sort = False).merge(df3, how = 'outer', on = ['var1', 'var2', 'var5'])

我正在努力寻找这种行为的原因。我想也许数据类型在连接列中是不同的,但不是——它们是相同的。我对熊猫比较陌生,所以也许我在这里遗漏了一些明显的东西?这种(意外)行为的原因是什么?

【问题讨论】:

    标签: pandas merge left-join


    【解决方案1】:

    当我在我的计算机上运行您的代码,然后使用 df#.dtypes 获取类型时,df1var5 列的 dtype 是 object,而在 df2df3 它是 @ 987654327@。 concat 运行良好(在 concat 之后,dtype 是object),但是当我尝试运行合并(外部或左侧)时,我得到了一个 ValueError:

    ValueError: You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat
    

    我建议再次检查类型(我知道您已经检查过了)。如果它们在您的计算机上确实相同,我不确定发生了什么。

    【讨论】:

    • 我再次检查过,var5 在所有初始数据帧以及由pd.concat() 产生的数据帧中都是object
    • 嗯,当您将数据帧复制到 stackoverflow 时,可能发生了一些变化......它看起来确实像 df2 和 df3 中的浮点数:'var5': {349176: 58786.0,'var5': {349175: 19188.0,。你能确认这和你所拥有的一样吗?
    【解决方案2】:
    df1 = pd.DataFrame({'var1': {0: 2210, 1: 2210, 2: 2210, 3: 2210, 4: 2210, 5: 2210, 6: 2210, 7: 2210, 8: 2210, 9: 2210, 10: 2210, 11: 2210, 12: 2210, 13: 2210, 14: 2210, 15: 2210, 16: 2210, 17: 2210, 18: 2210, 19: 2210, 20: 2210, 21: 2210}, 'var2': {0: 1, 1: 2, 2: 1, 3: 2, 4: 1, 5: 2, 6: 1, 7: 2, 8: 1, 9: 2, 10: 1, 11: 2, 12: 1, 13: 2, 14: 1, 15: 2, 16: 1, 17: 2, 18: 1, 19: 2, 20: 1, 21: 2}, 'var3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}, 'var4': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0}, 'var5': {0: '121160', 1: '20066', 2: ' 58621', 3: ' 201084', 4: ' 100180', 5: ' 74230', 6: ' 27789', 7: ' 66975', 8: ' 57410', 9: ' 49413', 10: ' 57112', 11: ' 19188', 12: ' 61366', 13: ' 27341', 14: ' 59859', 15: ' 173954', 16: ' 205651', 17: ' 54861', 18: ' 165809', 19: ' 60252', 20: ' 182156', 21: ' 82403'}})
    
    df2 = pd.DataFrame({'var1': {349176: 2210, 349225: 2210, 349913: 2210, 350247: 2210, 350342: 2210, 350518: 2210}, 'var2': {349176: 2, 349225: 1, 349913: 1, 350247: 2, 350342: 1, 350518: 2}, 'var5': {349176: 58786.0, 349225: 37572.0, 349913: 103955.0, 350247: 19197.0, 350342: 14664.0, 350518: 75773.0}, 'var3': {349176: 19, 349225: 22, 349913: 56, 350247: 75, 350342: 80, 350518: 95}, 'var4': {349176: 8, 349225: 52, 349913: 42, 350247: 0, 350342: 50, 350518: 17}})
    
    df3 = pd.DataFrame({'var1': {349175: 2210, 349224: 2210, 349912: 2210, 350246: 2210, 350341: 2210, 350517: 2210, 350521: 2210}, 'var2': {349175: 2, 349224: 1, 349912: 1, 350246: 2, 350341: 1, 350517: 2, 350521: 1}, 'var5': {349175: 19188.0, 349224: 205651.0, 349912: 59859.0, 350246: 27341.0, 350341: 165809.0, 350517: 19197.0, 350521: 61366.0}, 'var6': {349175: 19, 349224: 22, 349912: 56, 350246: 75, 350341: 80, 350517: 95, 350521: 95}, 'var7': {349175: 8, 349224: 52, 349912: 42, 350246: 0, 350341: 50, 350517: 17, 350521: 40}})
    
    pd.concat([df1, df2], axis = 0).dtypes
    

    结果

    var1     int64
    var2     int64
    var3     int64
    var4     int64
    var5    object
    dtype: object
    

    你可以在 concat 之后看到 var5 是一个对象。如果此时合并,您将不会得到任何结果,因为 df3 中的 var5 是浮点数。

    以下是我的建议:

    df1['var5'] = df1['var5'].astype(float)
    df2['var5'] = df2['var5'].astype(float)
    df3['var5'] = df3['var5'].astype(float)
    pd.concat([df1, df2], axis = 0).merge(df3, how = 'left', on = ['var1', 'var2', 'var5'])
    

    这将产生以下DataFrame:

        var1  var2  var3  var4      var5  var6  var7
    0   2210     1     0     0  121160.0   NaN   NaN
    1   2210     2     0     0   20066.0   NaN   NaN
    2   2210     1     0     0   58621.0   NaN   NaN
    3   2210     2     0     0  201084.0   NaN   NaN
    4   2210     1     0     0  100180.0   NaN   NaN
    5   2210     2     0     0   74230.0   NaN   NaN
    6   2210     1     0     0   27789.0   NaN   NaN
    7   2210     2     0     0   66975.0   NaN   NaN
    8   2210     1     0     0   57410.0   NaN   NaN
    9   2210     2     0     0   49413.0   NaN   NaN
    10  2210     1     0     0   57112.0   NaN   NaN
    11  2210     2     0     0   19188.0  19.0   8.0
    12  2210     1     0     0   61366.0  95.0  40.0
    13  2210     2     0     0   27341.0  75.0   0.0
    14  2210     1     0     0   59859.0  56.0  42.0
    15  2210     2     0     0  173954.0   NaN   NaN
    16  2210     1     0     0  205651.0  22.0  52.0
    17  2210     2     0     0   54861.0   NaN   NaN
    18  2210     1     0     0  165809.0  80.0  50.0
    19  2210     2     0     0   60252.0   NaN   NaN
    20  2210     1     0     0  182156.0   NaN   NaN
    21  2210     2     0     0   82403.0   NaN   NaN
    22  2210     2    19     8   58786.0   NaN   NaN
    23  2210     1    22    52   37572.0   NaN   NaN
    24  2210     1    56    42  103955.0   NaN   NaN
    25  2210     2    75     0   19197.0  95.0  17.0
    26  2210     1    80    50   14664.0   NaN   NaN
    27  2210     2    95    17   75773.0   NaN   NaN
    

    【讨论】:

    • var5 的性质实际上是分类的,它在df3 中作为浮点数的表示一定是来自pd.DataFrame(df3.to_dict()) 的一些奇怪行为,因为d3.dtypes 在将其转入之前将其显示为object dict。我已将您的方法替换为转换为 int 并且确实有效。将两个数据帧中的var5 显式转换为object 类型并在之后加入它们不起作用。如果var5floatint,但不是object,您似乎可以加入 - 知道为什么吗?
    • @jakes,既然您使用了我的方法,请随时接受我的回答:)。数据类型 int 和 float 不会加入,也不会加入其中之一。当您连接 df1 和 df2 时,列成为对象,因为 df1 是一个对象。您始终可以将 int / floats 转换为对象,但不一定相反。所以 Pandas 会将这些转换为 concat 上的对象。如果你 concat float 和 ints 你将得到一个 float 列。对于您的问题,您永远无法将文本加入数字。
    • 我明白了,但我仍在思考为什么以下方法不起作用:pd.concat([df1, df2], axis = 0).astype({'var5': 'object'}).merge(df3.astype({'var5': 'object'}), how = 'left', on = ['var1', 'var2', 'var5'])
    • @jakes 我明白你现在在说什么。当浮点数转换为对象时,小数点仍然存在。所以字符串 '19188.0' 不等于字符串 '19188' 如果你想转换为对象然后加入你可以做.astype({'var5':'int64'}).astype({'var5':'object'})
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-01-30
    • 2021-05-07
    • 2020-08-17
    • 2013-06-30
    相关资源
    最近更新 更多