【发布时间】:2013-05-31 12:16:33
【问题描述】:
所以我正在读取来自 NOAA 的站代码 csv 文件,如下所示:
"USAF","WBAN","STATION NAME","CTRY","FIPS","STATE","CALL","LAT","LON","ELEV(.1M)","BEGIN","END"
"006852","99999","SENT","SW","SZ","","","+46817","+010350","+14200","",""
"007005","99999","CWOS 07005","","","","","-99999","-999999","-99999","20120127","20120127"
前两列包含气象站的代码,有时它们有前导零。当 pandas 在没有指定 dtype 的情况下导入它们时,它们会变成整数。这没什么大不了的,因为我可以遍历数据帧索引并用"%06d" % i 之类的东西替换它们,因为它们总是六位数,但你知道……这是懒惰的方式。
使用以下代码获取csv:
file = urllib.urlopen(r"ftp://ftp.ncdc.noaa.gov/pub/data/inventories/ISH-HISTORY.CSV")
output = open('Station Codes.csv','wb')
output.write(file.read())
output.close()
这一切都很好,但是当我去尝试阅读它时:
import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': np.str, 'WBAN': np.str})
或
import pandas as pd
df = pd.io.parsers.read_csv("Station Codes.csv",dtype={'USAF': str, 'WBAN': str})
我收到一条令人讨厌的错误消息:
File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 401, in parser
_f
return _read(filepath_or_buffer, kwds)
File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 216, in _read
return parser.read()
File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 633, in read
ret = self._engine.read(nrows)
File "C:\Python27\lib\site-packages\pandas-0.11.0-py2.7-win32.egg\pandas\io\parsers.py", line 957, in read
data = self._reader.read(nrows)
File "parser.pyx", line 654, in pandas._parser.TextReader.read (pandas\src\parser.c:5931)
File "parser.pyx", line 676, in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6148)
File "parser.pyx", line 752, in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6962)
File "parser.pyx", line 837, in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7898)
File "parser.pyx", line 887, in pandas._parser.TextReader._convert_tokens (pandas\src\parser.c:8483)
File "parser.pyx", line 953, in pandas._parser.TextReader._convert_with_dtype (pandas\src\parser.c:9535)
File "parser.pyx", line 1283, in pandas._parser._to_fw_string (pandas\src\parser.c:14616)
TypeError: data type not understood
这是一个相当大的 csv(31k 行),所以可能与它有关?
【问题讨论】:
-
我发现使用 object 可以保持前导零:dtype={'USAF': object, 'WBAN': object} 来自这篇文章:stackoverflow.com/questions/13293810/…
-
str/np.str 不能正常工作有点奇怪......:S 我想知道这是否是一个错误,可能值得作为issue on github 发布。
-
是的,我也觉得这很奇怪,因为我可以在那里使用其他数字数据类型。
-
这基本上是两个月前的确切问题:github.com/pydata/pandas/issues/3209 似乎没有修复它的计划。
-
我想我记得 Wes 谈到过这个,我想他说在很多情况下使用 numpys(固定长度)字符串对象会非常昂贵......当你只是通过在常规字符串中(因为它在每个元素处使用 最大 字符串的内存)。我看看能不能找到。