当我通过 skip_footer arg 时，Pandas read_csv 忽略列 dtypes答案

【问题标题】：Pandas read_csv ignoring column dtypes when I pass skip_footer arg当我通过 skip_footer arg 时，Pandas read_csv 忽略列 dtypes
【发布时间】：2014-09-05 19:42:10
【问题描述】：

当我尝试将 csv 文件导入数据框时，pandas (0.13.1) 忽略了 dtype 参数。有没有办法阻止 pandas 自行推断数据类型？

我正在合并几个 CSV 文件，有时客户包含作为字符串的字母和熊猫导入。当我尝试合并两个数据框时，我收到一个错误，因为我试图合并两种不同的类型。我需要将所有内容存储为字符串。

数据sn-p：

|WAREHOUSE|ERROR|CUSTOMER|ORDER NO|
|---------|-----|--------|--------|
|3615     |     |03106   |253734  |
|3615     |     |03156   |290550  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |
|3615     |     |03175   |262207  |

导入行：

df = pd.read_csv("SomeFile.csv", 
                 header=1,
                 skip_footer=1, 
                 usecols=[2, 3], 
                 dtype={'ORDER NO': str, 'CUSTOMER': str})

df.dtypes 输出：

ORDER NO    int64
CUSTOMER    int64
dtype: object

【问题讨论】：

我正在使用 dtype 作为答案中的建议。它不能解决问题。
0.13.1 并不冗长，因为我认为因为usecols 而你正在回退到 python 解析器。它默默地忽略了dtype。尝试使用 0.14.0 它会 a) 工作 IIRC，b) 会在发生这种情况时发出警告（您可以尝试使用 engine='c' 强制引擎，此时我认为它会抱怨（即使在 0.13.1 中）
0.13.1 即使有明确的engine='c' 也不会抱怨。我更新到 0.14.1，但它仍然无法正常工作，但你是正确的，警告为什么。 ValueError: Falling back to the 'python' engine because the 'c' engine does not support skip_footer, but this causes 'dtype' to be ignored as it is not supported by the 'python' engine. (Note the 'converters' option provides similar functionality.)
好的，是的，警告更好。另一种选择是明确地转换它，例如df['ORDER NO'] = df['ORDER NO'].astype(object) 创建后。
我需要保持前导 0，因为有时所有内容都作为字符串导入（例如，如果 CUSTOMER 包含 X3615）。我想我可以df['CUSTOMER'] = df['CUSTOMER'].apply(lambda x: ('00000' + str(x))[-5:]) 除非有更好的方法

标签： python python-2.7 csv pandas

【解决方案1】：

看起来上面 Ripster 的答案解决了 OP 的问题。但对我来说，虽然这对某些人来说似乎很明显，但我的问题是我的 CVS 中的标题/列名称都是大写的，而我在代码中的 dtype={...} 中将它们作为小写。将它们全部切换为大写，并且 read_csv 不再忽略我的显式输入。 SQL 是我的母语，在大多数情况下，列名的大小写无关紧要。有几个小时我不会回来...

【讨论】：

【解决方案2】：

不幸的是，使用转换器或更新的 pandas 版本并不能解决更普遍的问题，即始终确保 read_csv 不会推断 float64 dtype。对于 pandas 0.15.2，以下示例使用包含 NULL 条目的十六进制整数的 CSV，表明使用转换器名称暗示它们应该用于的用途，会干扰 dtype 规范。

In [1]: df = pd.DataFrame(dict(a = ["0xff", "0xfe"], b = ["0xfd", None], c = [None, "0xfc"], d = [None, None]))
In [2]: df.to_csv("H:/tmp.csv", index = False)
In [3]: ef = pd.read_csv("H:/tmp.csv", dtype = {c: object for c in "abcd"}, converters = {c: lambda x: None if x == "" else int(x, 16) for c in "abcd"})
In [4]: ef.dtypes.map(lambda x: x)
Out[4]:
a      int64
b    float64
c    float64
d     object
dtype: object

对象的指定 dtype 仅适用于全 NULL 列。在这种情况下，float64 值只能转换为整数，但根据鸽子洞原理，并非所有 64 位整数都可以表示为 float64。

对于这种更一般的情况，我发现的最佳解决方案是让 pandas 将可能有问题的列作为字符串读取，如前所述，然后将切片转换为需要转换的值（而不是将转换映射到列上，如这将再次导致自动 dtype = float64 推断）。

In [5]: ff = pd.read_csv("H:/tmp.csv", dtype = {c: object for c in "bc"}, converters = {c: lambda x: None if x == "" else int(x, 16) for c in "ad"})
In [6]: ff.dtypes
Out[6]:
a     int64
b    object
c    object
d    object
dtype: object
In [7]: for c in "bc":
   .....:     ff.loc[~pd.isnull(ff[c]), c] = ff[c][~pd.isnull(ff[c])].map(lambda x: int(x, 16))
   .....:
In [8]: ff.dtypes
Out[8]:
a     int64
b    object
c    object
d    object
dtype: object
In [9]: [(ff[c][i], type(ff[c][i])) for c in ff.columns for i in ff.index]
Out[9]:
[(255, numpy.int64),
 (254, numpy.int64),
 (253L, long),
 (nan, float),
 (nan, float),
 (252L, long),
 (None, NoneType),
 (None, NoneType)]

据我所知，至少到 0.15.2 版为止，在这种情况下无法避免对字符串值进行后处理。

【讨论】：

【解决方案3】：

Pandas 0.13.1 默默地忽略了 dtype 参数，因为 c engine 不支持 skip_footer。这导致 Pandas 退回到不支持 dtype 的 python engine。

解决方案？使用converters

df = pd.read_csv('SomeFile.csv', 
                 header=1,
                 skip_footer=1, 
                 usecols=[2, 3], 
                 converters={'CUSTOMER': str, 'ORDER NO': str},
                 engine='python')

输出：

In [1]: df.dtypes
Out[2]:
CUSTOMER    object
ORDER NO    object
dtype: object

In [3]: type(df['CUSTOMER'][0])
Out[4]: str

In [5]: df.head()
Out[6]:
  CUSTOMER ORDER NO
0    03106   253734
1    03156   290550
2    03175   262207
3    03175   262207
4    03175   262207

原始文件中的前导 0 被保留，所有数据都存储为字符串。

【讨论】：

我怎样才能使用转换器来实现相同的事情，但对于所有列并且不指定每个列名？