为什么分配为 None 时会创建副本？答案

【问题标题】：Why does a copy get created when assigned with None?为什么分配为 None 时会创建副本？
【发布时间】：2014-10-29 14:09:51
【问题描述】：

In[216]: foo = pd.DataFrame({'a':[1,2,3], 'b':[3,4,5]})
In[217]: bar = foo.ix[:1]
In[218]: bar
Out[218]: 
   a  b
0  1  3
1  2  4

按预期创建视图。

In[219]: bar['a'] = 100
In[220]: bar
Out[220]: 
     a  b
0  100  3
1  100  4
In[221]: foo
Out[221]: 
     a  b
0  100  3
1  100  4
2    3  5

如果视图被修改，那么原始数据框 foo 也会被修改。但是，如果使用 None 完成分配，则似乎制作了一份副本。任何人都可以阐明正在发生的事情以及背后的逻辑吗？

In[222]: bar['a'] = None
In[223]: bar
Out[223]: 
      a  b
0  None  3
1  None  4
In[224]: foo
Out[224]: 
     a  b
0  100  3
1  100  4
2    3  5

【问题讨论】：

我不像 numpy 那样了解 Pandas 的详细信息，但我敢打赌，通过强制列将其 dtype 从 I4 更改为 @ 987654325@，您导致它为该列分配一个新数组，然后您写入该新数组而不是与原始 DataFrame 共享的数组。（我将其发布为评论而不是答案，因为即使我是对的，一个好的答案也应该准确解释这是如何工作的，而不仅仅是挥手致意……）
@abarnert 这正是幕后发生的事情。继续发帖作为答案。
@Jeff：好的，但我仍然认为最好在文档中给出解释的指针，而不是一个 numpy 用户可以猜测 Pandas 可能是如何实现的......跨度>
我提出了一个答案。它在很多地方都得到了很好的警告/记录。如果用户不阅读文档，则无能为力。
感谢杰夫和其他人！我确实遇到了文档的“返回视图与副本”部分。很抱歉没有详细介绍。现在会这样做:)

标签： python pandas dataframe

【解决方案1】：

当您分配 bar['a'] = None 时，您将强制列将其 dtype 从例如 I4 更改为 object。

这样做会强制它为该列分配一个新的 object 数组，然后它当然会写入该新数组，而不是写入与原始 DataFrame 共享的旧数组。

【讨论】：

【解决方案2】：

您正在执行一种链式分配，请参阅here 为什么这是一个非常糟糕的主意。

也看到这个问题here

Pandas 通常会警告您正在修改视图（在 0.15.0 中更是如此）。

In [49]: foo = pd.DataFrame({'a':[1,2,3], 'b':[3,4,5]})

In [51]: foo
Out[51]: 
   a  b
0  1  3
1  2  4
2  3  5

In [52]: bar = foo.ix[:1]

In [53]: bar
Out[53]: 
   a  b
0  1  3
1  2  4

In [54]: bar.dtypes
Out[54]: 
a    int64
b    int64
dtype: object

# this is an internal method (but is for illustration)
In [56]: bar._is_view
Out[56]: True

# this will warn in 0.15.0
In [57]: bar['a'] = 100
/usr/local/bin/ipython:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  #!/usr/local/bin/python

In [58]: bar._is_view
Out[58]: True

# bar is now a copied object (and will replace the existing dtypes with new ones).
In [59]: bar['a'] = None

In [60]: bar.dtypes
Out[60]: 
a    object
b     int64
dtype: object

你应该从不依赖某物是否是视图（即使在 numpy 中），除非在某些非常高效的情况下。它不是一个有保证的构造，具体取决于底层数据的内存布局。

您应该非常非常非常少地尝试设置数据以通过视图进行传播。当您混合 dtypes 时，在pandas 中执行此操作几乎总是会造成麻烦。（在 numpy 中，你 can 只能查看单个 dtype；我什至不确定 dtype 的 changes 多类型数组的视图是什么，或者如果它甚至允许）。

【讨论】：