熊猫加入：不识别加入列答案

【问题标题】：Pandas join: Does not recognize joining column熊猫加入：不识别加入列
【发布时间】：2023-03-15 19:07:01
【问题描述】：

我不知道发生了什么，标题只是一阶近似值。我正在尝试加入两个数据框：

>>> df_sum.head()
         TUCASEID  t070101  t070102  t070103  t070104  t070105  t070199  \
0  20030100013280        0        0        0        0        0        0   
1  20030100013344        0        0        0        0        0        0   
2  20030100013352       60        0        0        0        0        0   
3  20030100013848        0        0        0        0        0        0   
4  20030100014165        0        0        0        0        0        0   

   t070201  t070299  shopping  year  
0        0        0         0  2003  
1        0        0         0  2003  
2        0        0        60  2003  
3        0        0         0  2003  
4        0        0         0  2003  
>>> emp.head()
         TUCASEID status
0  20030100013280    emp
1  20030100013344    emp
2  20030100013352    emp
4  20030100014165    emp
5  20030100014169    emp

那是数据框，我想把它们加入公共列TUCASEID，其中有交叉点：

>>> np.intersect1d(emp.TUCASEID, df_sum.TUCASEID)
array([20030100013280, 20030100013344, 20030100013352, ..., 20131212132462,
       20131212132469, 20131212132475])

现在...

>>> df_sum.join(emp, on='TUCASEID', how='inner')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3829, in join
    rsuffix=rsuffix, sort=sort)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3843, in _join_compat
    suffixes=(lsuffix, rsuffix), sort=sort)
  File "/usr/local/lib/python2.7/site-packages/pandas/tools/merge.py", line 39, in merge
    return op.get_result()
  File "/usr/local/lib/python2.7/site-packages/pandas/tools/merge.py", line 193, in get_result
    rdata.items, rsuf)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 3873, in items_overlap_with_suffix
    to_rename)
ValueError: columns overlap but no suffix specified: Index([u'TUCASEID'], dtype='object')

嗯，这很奇怪，唯一出现在两个数据框中的列是要加入的列，但是，让我们同意[1]：

>>> df_sum.join(emp, on='TUCASEID', how='inner', rsuffix='r')
Empty DataFrame
Columns: [TUCASEID, t070101, t070102, t070103, t070104, t070105, t070199, t070201, t070299, shopping, year, TUCASEIDr, status]
Index: []

尽管有一个巨大的十字路口。这是怎么回事？

>>> pd.__version__
'0.15.0'

[1]：我实际上为连接列的 dtype 强制执行整数，因为它在那里说“对象”，没有区别：

>>> emp.dtypes
TUCASEID     int64
status      object
dtype: object
>>> df_sum.dtypes
TUCASEID    int64
(...)
shopping    int64
year        int64
dtype: object

【问题讨论】：

您的索引值不匹配为什么不合并它们df_sum.merge(emp, on='TUCASEID', how='outer') 或者您只是想为每个“TUCASEID”行添加“状态”列？在这种情况下做df_sum['status'] = df['sum['TUCASEID'].map(emp.set_index('TUCASEID')
@EdChum 好的，我会研究替代方案。为什么索引值不匹配是相关的？我已经指定了替代的 on= 列。
不知道，但join 通常加入索引，我可以重新创建的行为很奇怪，但我建议的其他方法应该可以工作
@EdChum 你的最后一个命令有错别字，我猜是df_sum['TUCASEID'].map(emp.set_index('TUCASEID')) 得到TypeError: 'DataFrame' object is not callable
抱歉尝试：df_sum['status'] = df_sum.TUCASEID.map(emp.set_index('TUCASEID')['status'])，顺便说一下df_sum.join(emp, on='TUCASEID', how='outer', rsuffix='r') 有效，但我不知道这是否是您想要的

标签： python join pandas inner-join

【解决方案1】：

df.join 通常调用pd.merge（除非在特殊情况下调用concat）。因此，join 可以做的任何事情，merge 可以做还。尽管可能并不完全正确，但我倾向于仅在以下情况下使用df.join 加入索引并使用pd.merge 加入列。

因此，我可以重现您描述的问题：

import numpy as np
import pandas as pd

df_sum = pd.DataFrame(np.arange(6*2).reshape((6,2)), 
                      index=list('ABCDEF'), columns=list('XY'))
emp =  pd.DataFrame(np.arange(6*2).reshape((6,2)), 
                    index=list('ABCDEF'), columns=list('XZ'))
print(df_sum.join(emp, on='X', rsuffix='_r', how='inner'))

# Empty DataFrame
# Columns: [X, Y, X_r, Z]
# Index: []

但是pd.merge 可以正常工作——而且无需提供rsuffix：

print(pd.merge(df_sum, emp, on='X')

产量

    X   Y   Z
0   0   1   1
1   2   3   3
2   4   5   5
3   6   7   7
4   8   9   9
5  10  11  11

Under the hood、df_sum.join 调用以这种方式合并：

    if isinstance(other, DataFrame):
        return merge(self, other, left_on=on, how=how,
                     left_index=on is None, right_index=True,
                     suffixes=(lsuffix, rsuffix), sort=sort)

因此，即使您使用 df_sum.join(emp, on='...')，Pandas 还是会将其转换为 pd.merge(df_sum, emp, left_on='...')。此外，以这种方式调用时，合并是空的：

In [228]: pd.merge(df_sum, emp, left_on='X', left_index=False, right_index=True)
Out[228]: 
Empty DataFrame
Columns: [X, X_x, Y, X_y, Z]
Index: []

因为left_on='X' 需要为on='X' 才能根据需要成功合并：

In [233]: pd.merge(df_sum, emp, on='X', left_index=False, right_index=True)
Out[233]: 
    X   Y   Z
A   0   1   1
B   2   3   3
C   4   5   5
D   6   7   7
E   8   9   9
F  10  11  11

【讨论】：