【问题标题】:Cryptic warning pops up when doing pandas assignment with loc and iloc使用 loc 和 iloc 进行 pandas 分配时会弹出神秘警告
【发布时间】:2015-04-28 10:13:34
【问题描述】:

我的代码中有一条语句:

df.loc[i] = [df.iloc[0][0], i, np.nan]

其中i 是我在该语句所在的for 循环中使用的迭代变量,np 是我导入的 numpy 模块,df 是一个看起来像这样的 DataFrame:

   build_number   name  cycles
0           390  adpcm   21598
1           390    aes    5441
2           390  dfadd     463
3           390  dfdiv    1323
4           390  dfmul     167
5           390  dfsin   39589
6           390    gsm    6417
7           390   mips    4205
8           390  mpeg2    1993
9           390    sha  348417

如您所见,我的代码中的语句用于将新行插入到我的 DataFrame df 中,并用 NaN 值填充 cycles 下的最后一列(在新插入的行内)。

但是,这样做时,我收到以下警告消息:

/usr/local/bin/ipython:28: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

查看文档,我仍然不明白我在这里遇到的问题或风险是什么。我认为使用lociloc 已经遵循建议了吗?

谢谢。

在此编辑 应@EdChum 的要求,我在使用上述语句的函数中添加了以下内容:

def patch_missing_benchmarks(refined_dataframe):
'''
Patches up a given DataFrame, ensuring that all build_numbers have the complete
set of benchmark names, inserting NaN values at the column where the data is
supposed to be residing in.

Accepts:
--------
* refined_dataframe
DataFrame that was returned from the remove_early_retries() function and that 
contains no duplicates of benchmarks within a given build number and also has been
sorted nicely to ensure that build numbers are in alphabetical order.
However, this function can also accept the DataFrame that has not been sorted, so
long as it has no repitition of benchmark names within a given build number.

Returns:
-------
* patched_benchmark_df
DataFrame with all Build numbers filled with the complete set of benchmark data,
with those previously missing benchmarks now having NaN values for their data.
'''
patched_df_list = []
benchmark_list = ['adpcm', 'aes', 'blowfish', 'dfadd', 'dfdiv', 'dfmul', 
                'dfsin', 'gsm', 'jpeg', 'mips', 'mpeg2', 'sha']
benchmark_series = pd.Series(data = benchmark_list)

for number in refined_dataframe['build_number'].drop_duplicates().values:
  # df must be a DataFrame whose data has been sorted according to build_number
  # followed by benchmark name
  df = refined_dataframe.query('build_number == %d' % number)

  # Now we compare the benchmark names present in our section of the DataFrame
  # with the Series containing the complete collection of Benchmark names and 
  # get back a boolean DataFrame telling us precisely what benchmark names 
  # are missing
  boolean_bench = benchmark_series.isin(df['name'])
  list_names = []
  for i in range(0, len(boolean_bench)):
    if boolean_bench[i] == False:
      name_to_insert = benchmark_series[i]
      list_names.append(name_to_insert)
    else:
      continue
  print 'These are the missing benchmarks for build number',number,':'
  print list_names

  for i in list_names:
    # create a new row with index that is benchmark name itself to avoid overwriting 
    # any existing data, then insert the right values into that row, filling in the 
    # space name with the right benchmark name, and missing data with NaN
    df.loc[i] = [df.iloc[0][0], i, np.nan]  

    patched_for_benchmarks_df = df.sort_index(by=['build_number',
                                          'name']).reset_index(drop = True)

    patched_df_list.append(patched_for_benchmarks_df)

  # we make sure we call a dropna method at threshold 2 to drop those rows whose benchmark
  # names as well as cycles names are NaN, leaving behind the newly inserted rows with
  # benchmark names but that now have the data as NaN values
  patched_benchmark_df = pd.concat(objs = patched_df_list, ignore_index = 
                               True).sort_index(by= ['build_number',
                              'name']).dropna(thresh = 2).reset_index(drop = True)

  return patched_benchmark_df

【问题讨论】:

  • 我什至对熊猫一无所知,但阅读您链接的文档让我觉得您需要将df.iloc[0][0] 更改为df.iloc[:, (0, 0)]
  • 即使您使用的是iloc,但您使用的是双下标,这会产生警告,您能否显示使用此行的代码,我有点不清楚,请将其编辑到您的问题中
  • 我认为要使您的代码更具可读性,最好使用df.iloc[0]['build_number']
  • 是的,这更具可读性-但是这样做仍然在做双下标吗?另外,我有点注意到.iloc 的工作方式就像.iloc[index_number][column_name OR column_number] 是这样吗?第一个下标告诉它应该获取索引指示的哪一行,然后第二个下标告诉我们想要获取该特定行中哪一列的值?只是想验证我的理解。

标签: python pandas numpy pandas-loc


【解决方案1】:

没有看到你是如何做到这一点的,如果你只是想设置“循环”列,那么以下内容将起作用而不会发出任何警告:

In [344]:

for i in range(len(df)):
    df.loc[i,'cycles'] = np.nan
df
Out[344]:
   build_number   name  cycles
0           390  adpcm     NaN
1           390    aes     NaN
2           390  dfadd     NaN
3           390  dfdiv     NaN
4           390  dfmul     NaN
5           390  dfsin     NaN
6           390    gsm     NaN
7           390   mips     NaN
8           390  mpeg2     NaN
9           390    sha     NaN

如果您只想设置整个列,则无需循环,只需执行以下操作:df['cycles'] = np.NaN

【讨论】:

  • 感谢您的建议,但我并不想将整个列设置为获取NaN。相反,我的情况是我缺少name,我需要插入以完成build_number,当我将它们插入时,我想让它们在同一行的对应cycles值是NaN。例如,在我的问题中参考我上面的df,我想插入额外的390 jpeg NaN390 blowfish NaN 行以完成给定build_number 390 的所有names 的列表。
  • 所以您只想追加新行,您可以将所需的输出发布到您的问题中
  • 对于我想要的输出,请在该问题中查看我在stackoverflow.com/questions/28739931/… 上的帖子,我试图相乘的两个数据帧都是所需的输出。获得所需的输出没有问题,但我只关心警告消息以及我是否可以并且应该避免它。
  • 嗯,你应该按照我的回答语义,避免双重下标,这可能就是你得到这个错误的原因
  • 好吧,这真的很有趣,但现在即使我用自己的方式使用[0][0] 进行双重下标,我也不会遇到任何错误。这很奇怪......但谢谢你的建议。您的答案示意图df.loc[i,'cycles'] = np.nan 可能不起作用,因为我无法将cycles 硬编码到;它也必须是fmax。除此之外,我并不想遍历每个索引。我想遍历缺少的基准名称列表并将它们中的每一个插入到数据框中。