【发布时间】:2018-06-06 18:36:14
【问题描述】:
我有一个包含多个表格的数据集,每个表格都以国家、年份和一些指标的形式出现。我已将所有 excel 表转换为 csv 文件,然后将它们合并到一个表中。
问题是我有一些表拒绝合并,出现如下信息TypeError: '>' not supported between instances of 'int' and 'str'
我尽我所能,但没有运气,仍然出现同样的错误!
此外,我尝试了数百个不同的文件,但仍有数十个文件面临这个问题。
对于示例文件file17.csv 和file35.csv(以防有人需要重复)。 这是我使用的代码:
# To load the first file
import pandas as pd
filename1 = 'file17.csv'
df1 = pd.read_csv(filename1, encoding='cp1252', low_memory=False)
df1.set_index(['Country', 'Year'], inplace=True)
df1.dropna(axis=0, how='all', inplace=True)
df1.head()
出来>>>
+-------------+------+--------+--------+
| | | ind500 | ind356 |
| Country | Year | | |
| Afghanistan | 1800 | 603.0 | NaN |
| | 1801 | 603.0 | NaN |
| | 1802 | 603.0 | NaN |
| | 1803 | 603.0 | NaN |
| | 1804 | 603.0 | NaN |
+-------------+------+--------+--------+
在>>>
# To load the second file
filename2 = 'file35.csv'
df2 = pd.read_csv(filename2, encoding='cp1252', low_memory=False)
df2.set_index(['Country', 'Year'], inplace=True)
df2.dropna(axis=0, how='all', inplace=True)
df2.head()
出来>>>
# To merge the two dataframes
gross_df = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
gross_df.dropna(axis=0, how='all', inplace=True)
print (gross_df.shape)
gross_df.to_csv('merged.csv')
重要提示:
我注意到在所有成功的文件中,列名以升序出现,即ind001、ind009、ind012,因为它们是自动排序的。而有错误的文件有一个或多个位置错误的列,例如第一个表中的ind500,后跟in356,同样适用于提供的第二个示例。
请注意,两个 dataframesindiceswo 索引(Country 和 year)
错误
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\algorithms.py in safe_sort(values, labels, na_sentinel, assume_unique)
480 try:
--> 481 sorter = values.argsort()
482 ordered = values.take(sorter)
TypeError: '>' not supported between instances of 'int' and 'str'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-11-960b2698de60> in <module>()
----> 1 gross_df = pd.merge(df1, df2, left_index=True, right_index=True, how='outer', sort=False)
2 gross_df.dropna(axis=0, how='all', inplace=True)
3 print (gross_df.shape)
4 gross_df.to_csv('merged.csv')
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)
52 right_index=right_index, sort=sort, suffixes=suffixes,
53 copy=copy, indicator=indicator)
---> 54 return op.get_result()
55
56
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in get_result(self)
567 self.left, self.right)
568
--> 569 join_index, left_indexer, right_indexer = self._get_join_info()
570
571 ldata, rdata = self.left._data, self.right._data
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in _get_join_info(self)
720 join_index, left_indexer, right_indexer = \
721 left_ax.join(right_ax, how=self.how, return_indexers=True,
--> 722 sort=self.sort)
723 elif self.right_index and self.how == 'left':
724 join_index, left_indexer, right_indexer = \
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\indexes\base.py in join(self, other, how, level, return_indexers, sort)
2995 else:
2996 return self._join_non_unique(other, how=how,
-> 2997 return_indexers=return_indexers)
2998 elif self.is_monotonic and other.is_monotonic:
2999 try:
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\indexes\base.py in _join_non_unique(self, other, how, return_indexers)
3076 left_idx, right_idx = _get_join_indexers([self.values],
3077 [other._values], how=how,
-> 3078 sort=True)
3079
3080 left_idx = _ensure_platform_int(left_idx)
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in _get_join_indexers(left_keys, right_keys, sort, how, **kwargs)
980
981 # get left & right join labels and num. of levels at each location
--> 982 llab, rlab, shape = map(list, zip(* map(fkeys, left_keys, right_keys)))
983
984 # get flat i8 keys from label lists
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in _factorize_keys(lk, rk, sort)
1409 if sort:
1410 uniques = rizer.uniques.to_array()
-> 1411 llab, rlab = _sort_labels(uniques, llab, rlab)
1412
1413 # NA group
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\reshape\merge.py in _sort_labels(uniques, left, right)
1435 labels = np.concatenate([left, right])
1436
-> 1437 _, new_labels = algos.safe_sort(uniques, labels, na_sentinel=-1)
1438 new_labels = _ensure_int64(new_labels)
1439 new_left, new_right = new_labels[:l], new_labels[l:]
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\algorithms.py in safe_sort(values, labels, na_sentinel, assume_unique)
483 except TypeError:
484 # try this anyway
--> 485 ordered = sort_mixed(values)
486
487 # labels:
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\pandas\core\algorithms.py in sort_mixed(values)
469 str_pos = np.array([isinstance(x, string_types) for x in values],
470 dtype=bool)
--> 471 nums = np.sort(values[~str_pos])
472 strs = np.sort(values[str_pos])
473 return _ensure_object(np.concatenate([nums, strs]))
C:\ProgramData\Anaconda2\envs\conda_python3\lib\site-packages\numpy\core\fromnumeric.py in sort(a, axis, kind, order)
820 else:
821 a = asanyarray(a).copy(order="K")
--> 822 a.sort(axis=axis, kind=kind, order=order)
823 return a
824
TypeError: '>' not supported between instances of 'int' and 'str'
【问题讨论】:
-
请不要截图错误——复制文本并粘贴为代码块!
-
好的,完成...抱歉
-
确保合并的 DF 中的索引具有相同的 dtypes
-
@MaxU 非常感谢,我真的发现
'Year'列在第一个数据帧中存储为int64,在第二个数据帧中存储为'object'。我会将int64更改为string然后会回复您。 -
@MohammadElNesr,您能否将
df1.head()的输出发布为文本,以便我们复制和粘贴它?
标签: python pandas csv merge typeerror