【发布时间】:2023-03-15 19:07:01
【问题描述】:
我不知道发生了什么,标题只是一阶近似值。我正在尝试加入两个数据框:
>>> df_sum.head()
TUCASEID t070101 t070102 t070103 t070104 t070105 t070199 \
0 20030100013280 0 0 0 0 0 0
1 20030100013344 0 0 0 0 0 0
2 20030100013352 60 0 0 0 0 0
3 20030100013848 0 0 0 0 0 0
4 20030100014165 0 0 0 0 0 0
t070201 t070299 shopping year
0 0 0 0 2003
1 0 0 0 2003
2 0 0 60 2003
3 0 0 0 2003
4 0 0 0 2003
>>> emp.head()
TUCASEID status
0 20030100013280 emp
1 20030100013344 emp
2 20030100013352 emp
4 20030100014165 emp
5 20030100014169 emp
那是数据框,我想把它们加入公共列TUCASEID,其中有交叉点:
>>> np.intersect1d(emp.TUCASEID, df_sum.TUCASEID)
array([20030100013280, 20030100013344, 20030100013352, ..., 20131212132462,
20131212132469, 20131212132475])
现在...
>>> df_sum.join(emp, on='TUCASEID', how='inner')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3829, in join
rsuffix=rsuffix, sort=sort)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3843, in _join_compat
suffixes=(lsuffix, rsuffix), sort=sort)
File "/usr/local/lib/python2.7/site-packages/pandas/tools/merge.py", line 39, in merge
return op.get_result()
File "/usr/local/lib/python2.7/site-packages/pandas/tools/merge.py", line 193, in get_result
rdata.items, rsuf)
File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 3873, in items_overlap_with_suffix
to_rename)
ValueError: columns overlap but no suffix specified: Index([u'TUCASEID'], dtype='object')
嗯,这很奇怪,唯一出现在两个数据框中的列是要加入的列,但是,让我们同意[1]:
>>> df_sum.join(emp, on='TUCASEID', how='inner', rsuffix='r')
Empty DataFrame
Columns: [TUCASEID, t070101, t070102, t070103, t070104, t070105, t070199, t070201, t070299, shopping, year, TUCASEIDr, status]
Index: []
尽管有一个巨大的十字路口。这是怎么回事?
>>> pd.__version__
'0.15.0'
[1]:我实际上为连接列的 dtype 强制执行整数,因为它在那里说“对象”,没有区别:
>>> emp.dtypes
TUCASEID int64
status object
dtype: object
>>> df_sum.dtypes
TUCASEID int64
(...)
shopping int64
year int64
dtype: object
【问题讨论】:
-
您的索引值不匹配为什么不合并它们
df_sum.merge(emp, on='TUCASEID', how='outer')或者您只是想为每个“TUCASEID”行添加“状态”列?在这种情况下做df_sum['status'] = df['sum['TUCASEID'].map(emp.set_index('TUCASEID') -
@EdChum 好的,我会研究替代方案。为什么索引值不匹配是相关的?我已经指定了替代的
on=列。 -
不知道,但
join通常加入索引,我可以重新创建的行为很奇怪,但我建议的其他方法应该可以工作 -
@EdChum 你的最后一个命令有错别字,我猜是
df_sum['TUCASEID'].map(emp.set_index('TUCASEID'))得到TypeError: 'DataFrame' object is not callable -
抱歉尝试:
df_sum['status'] = df_sum.TUCASEID.map(emp.set_index('TUCASEID')['status']),顺便说一下df_sum.join(emp, on='TUCASEID', how='outer', rsuffix='r')有效,但我不知道这是否是您想要的
标签: python join pandas inner-join