将 pandas 列（包含浮点数和 NaN 值）从 float64 转换为可为空的 int8答案

【问题标题】：Convert pandas column (containing floats and NaN values) from float64 to nullable int8将 pandas 列（包含浮点数和 NaN 值）从 float64 转换为可为空的 int8
【发布时间】：2020-07-23 02:52:17
【问题描述】：

我有一个看起来有点像这样的大型数据框：

    a   b   c
0   2.2 6.0 0.0
1   3.3 7.0 NaN
2   4.4 NaN 3.0
3   5.5 9.0 NaN

b 列和 c 列包含浮点值，可以是正数、自然数或 NaN。但是，它们存储为 float64，这是一个问题，因为（无需进一步详细说明）此数据帧是要求这些为整数的管道的输入，因此我想将它们存储为这样。输出应如下所示：

    a   b   c
0   2.2 6   0
1   3.3 7   NaN
2   4.4 NaN 3
3   5.5 9   NaN

我在 pandas 文档中读到，可空整数仅在 pandas 数据类型“Int8”中受支持（注意：这与 np.int8 不同），所以很自然地，我尝试了这个：

df = df.astype({'b':pd.Int8Dtype(), 'c':pd.Int8Dtype()})

当我在我的 Jupyter 笔记本中运行它时，它可以工作，但是当我将它集成到一个更大的函数中时，我得到了这个错误：

TypeError: cannot safely cast non-equivalent float64 to int8

我明白为什么会出现错误，因为 x == int(x) 对于 NaN 值将是 False，因此程序认为这种转换是不安全的，即使所有值都是 NaN 或自然数。所以接下来，我尝试了：

'df = df.astype({'b':pd.Int8Dtype(), 'c':pd.Int8Dtype()}, errors='ignore')

我认为这将消除“不安全转换”问题，因为我 100% 确定所有 float64 值都是自然数。但是，当我使用这条线时，我所有的数字仍然存储为浮点数！真气！

有人有解决办法吗？

【问题讨论】：

您不能将带有NaN 的列存储为整数类型。您将不得不替换 NaN 值或在上游处理它

标签： python pandas dataframe integer nan

【解决方案1】：

我遇到了完全相同的问题，导致我进入此页面。对于这个问题，我没有真正好的解决方案，我自己也在寻找一个……但我确实找到了解决方法。在进入之前，我想回答在原始问题上发表的评论：允许将NA 甚至None 值分配给int8 这样的“简单”类型系列是尝试的重点进行这些 dtype 转换。可以对一系列这些 dtype 执行典型操作，例如 isna() （等等）（请参阅 pd.IntXDtype() where 'X'代表位数）。我通过使用这些 dtypes 探索的优势在于内存占用，例如：

In[56]: test_df = pd.Series(np.zeros(1_000_000), dtype=np.float64)

In[57]: test_df.memory_usage()
Out[57]: 8000128

In[58]: test_df = pd.Series(np.zeros(1_000_000), dtype=pd.Int8Dtype())

In[59]: test_df.memory_usage()
Out[59]: 2000128

In[60]: test_df.iloc[:500_000] = None

In[61]: test_df.memory_usage()
Out[61]: 2000128

In[62]: test_df.isna().sum()
Out[62]: 500000

这样你就可以两全其美了。

现在解决方法：

In[33]: my_df
Out[33]: 
     a    s      d
0    0 -500 -1.000
1    1 -499 -0.998
2    2 -498 -0.996
3    3 -497 -0.994
4    4 -496 -0.992

In[34]: my_df.dtypes
Out[34]: 
a      int64
s      int64
d    float64
dtype: object

In[35]: df_converted_to_int_first = my_df.astype(
   ...:     dtype={
   ...:         'a': np.int8,
   ...:         's': np.int16,
   ...:         'd': np.float16,
   ...:     },
   ...: )

In[36]: df_converted_to_int_first
Out[36]: 
     a    s         d
0    0 -500 -1.000000
1    1 -499 -0.998047
2    2 -498 -0.996094
3    3 -497 -0.994141
4    4 -496 -0.992188

In[37]: df_converted_to_int_first.dtypes
Out[37]: 
a       int8
s      int16
d    float16
dtype: object

In[38]: df_converted_to_special_int_after = df_converted_to_int_first.astype(
   ...:     dtype={
   ...:         'a': pd.Int8Dtype(),
   ...:         's': pd.Int16Dtype(),
   ...:     }
   ...: )

In[39]: df_converted_to_special_int_after.dtypes
Out[39]: 
a       Int8
s      Int16
d    float16
dtype: object

In[40]: df_converted_to_special_int_after.a.iloc[3] = None

In[41]: df_converted_to_special_int_after
Out[41]: 
       a     s         d
0      0  -500 -1.000000
1      1  -499 -0.998047
2      2  -498 -0.996094
3   <NA>  -497 -0.994141
4      4  -496 -0.992188

在我看来，这仍然不是一个可接受的解决方案......但如上所述，ir 构成了原始问题中提出的解决方法。

编辑缺少一些测试，从 np.float64 到 pd.Int8Dtype()：

In[67]: my_df.astype(
   ...:     dtype={
   ...:         'a': np.int8,
   ...:         's': np.int16,
   ...:         'd': np.int16,
   ...:     },
   ...: ).astype(    
   ...:     dtype={
   ...:         'a': np.int8,
   ...:         's': np.int16,
   ...:         'd': pd.Int8Dtype(),
   ...:     },
   ...: ).dtypes

Out[67]: 
a     int8
s    int16
d     Int8
dtype: object

【讨论】：