FutureWarning：元素比较失败；返回标量，但将来将执行元素比较答案

【问题标题】：FutureWarning: elementwise comparison failed; returning scalar, but in the future will perform elementwise comparisonFutureWarning：元素比较失败；返回标量，但将来将执行元素比较
【发布时间】：2017-04-01 05:59:49
【问题描述】：

我在 Python 3 上使用 Pandas 0.19.1。我收到关于这些代码行的警告。我正在尝试获取一个列表，其中包含字符串Peter 存在于列Unnamed: 5 的所有行号。

df = pd.read_excel(xls_path)
myRows = df[df['Unnamed: 5'] == 'Peter'].index.tolist()

它会产生一个警告：

"\Python36\lib\site-packages\pandas\core\ops.py:792: FutureWarning: elementwise 
comparison failed; returning scalar, but in the future will perform 
elementwise comparison 
result = getattr(x, name)(y)"

什么是 FutureWarning，我应该忽略它，因为它似乎有效。

【问题讨论】：

标签： python python-3.x pandas numpy matplotlib

【解决方案1】：

此 FutureWarning 不是来自 Pandas，它来自 numpy，并且该错误也会影响 matplotlib 和其他，这里是如何在更接近问题根源的地方重现警告：

import numpy as np
print(np.__version__)   # Numpy version '1.12.0'
'x' in np.arange(5)       #Future warning thrown here

FutureWarning: elementwise comparison failed; returning scalar instead, but in the 
future will perform elementwise comparison
False

使用双等号运算符重现此错误的另一种方法：

import numpy as np
np.arange(5) == np.arange(5).astype(str)    #FutureWarning thrown here

在其 quiver plot 实现下受此 FutureWarning 影响的 Matplotlib 示例：https://matplotlib.org/examples/pylab_examples/quiver_demo.html

这里发生了什么？

在将字符串与 numpy 的数字类型进行比较时，Numpy 和原生 python 之间存在分歧。注意右边的操作数是python的地盘，一个原始字符串，中间的操作是python的地盘，但左边的操作数是numpy的地盘。您应该返回 Python 风格的标量还是布尔的 Numpy 风格的 ndarray？ Numpy 说 ndarray of bool，Pythonic 开发人员不同意。经典对峙。

如果元素存在于数组中，应该是元素比较还是标量？

如果您的代码或库使用in 或== 运算符将python 字符串与numpy ndarrays 进行比较，则它们不兼容，因此如果您尝试它，它会返回一个标量，但仅限于现在。警告表明，将来这种行为可能会改变，因此如果 python/numpy 决定采用 Numpy 样式，您的代码就会到处乱扔垃圾。

提交的错误报告：

Numpy 和 Python 处于对峙状态，目前该操作返回一个标量，但将来可能会改变。

https://github.com/numpy/numpy/issues/6784

https://github.com/pandas-dev/pandas/issues/7830

两种变通解决方案：

锁定您的 python 和 numpy 版本，忽略警告并期望行为不会改变，或者将 == 和 in 的左右操作数转换为 numpy 类型或原始 python 数字类型。

全局禁止警告：

import warnings
import numpy as np
warnings.simplefilter(action='ignore', category=FutureWarning)
print('x' in np.arange(5))   #returns False, without Warning

逐行抑制警告。

import warnings
import numpy as np

with warnings.catch_warnings():
    warnings.simplefilter(action='ignore', category=FutureWarning)
    print('x' in np.arange(2))   #returns False, warning is suppressed

print('x' in np.arange(10))   #returns False, Throws FutureWarning

只需按名称禁止警告，然后在它旁边放一个响亮的评论，提到当前版本的 python 和 numpy，说这段代码很脆弱，需要这些版本，并在这里放一个链接。把罐子踢下去。

TLDR： pandas 是绝地； numpy 是小屋； python 是银河帝国。

【讨论】：

呃。因此，如果我有一些数量 thing（可能是也可能不是 numpy 类型；我不知道）并且我想看看 thing == 'some string' 是否得到一个简单的 bool 结果，我该怎么办？ np.atleast_1d(thing)[0] == 'some string'?但这对于一些将'some string' 放在数组的第一个元素中的小丑来说并不可靠。我想我必须先测试thing 的类型，然后只测试== 是否是字符串（或不是numpy 对象）。
实际上，每当您尝试将 numpy.ndarray 与空列表进行比较时，也会引发此未来警告。例如，执行np.array([1, 2]) == [] 也会引发警告。
我会发现看到一个这样做的例子很有帮助：or babysit your left and right operands to be from a common turf
关于这个问题的质量信息水平惊人。
所以我想摆脱这段代码的警告：df.loc[df.cName == '', 'cName'] = '10004'。换句话说，什么是熊猫/numpy相当于python的''（空字符串）

【解决方案2】：

当我尝试将index_col 读取文件设置到Panda 的数据帧中时，我得到了同样的错误：

df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=['0'])  ## or same with the following
df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=[0])

我以前从未遇到过这样的错误。我仍在试图找出这背后的原因（使用@Eric Leschinski 的解释和其他人）。

无论如何，在我找出原因之前，以下方法暂时解决了这个问题：

df = pd.read_csv('my_file.tsv', sep='\t', header=0)  ## not setting the index_col
df.set_index(['0'], inplace=True)

我会在找出这种行为的原因后立即更新。

【讨论】：

我对@987654325@ 有同样的问题。在我看来，pandas 需要修复。
谢谢！为我节省了很多工作——我猜。 pd__version__: 0.22.0; np.__version__: 1.15.4
这里有同样的问题，在使用 index_col 参数时，read_csv 中显然有一些 numpy 调用。我测试了两种不同结果的设置：1. numpy 1.19.2 版，Pandas 1.1.2 版：FutureWarning: elementwise comparison failed... 2. numpy 1.19.2 版，Pandas 1.1.3 版：TypeError: ufunc ' isnan' 不支持...

【解决方案3】：

我对相同警告消息的体验是由 TypeError 引起的。

TypeError: 无效类型比较

因此，您可能需要检查Unnamed: 5 的数据类型

for x in df['Unnamed: 5']:
  print(type(x))  # are they 'str' ?

以下是我如何复制警告消息：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 2), columns=['num1', 'num2'])
df['num3'] = 3
df.loc[df['num3'] == '3', 'num3'] = 4  # TypeError and the Warning
df.loc[df['num3'] == 3, 'num3'] = 4  # No Error

希望对你有帮助。

【讨论】：

您的代码有很多不必要的移动部分来说明警告。 Pandas 为您提供了额外的 TypeError，但这是来自 Pandas 的损害控制，源警告是 Numpy 和 Python 之间的分歧，并在评估 df['num3'] == '3' 时发生。
df.loc[df['num3'] == 3, 'num3'] = 4 # No Error 这部分对我有帮助。谢谢

【解决方案4】：

无法击败 Eric Leschinski 非常详细的答案，但这里有一个我认为尚未提及的原始问题的快速解决方法 - 将字符串放在列表中并使用 .isin 而不是 ==

例如：

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Peter", "Joe"], "Number": [1, 2]})

# Raises warning using == to compare different types:
df.loc[df["Number"] == "2", "Number"]

# No warning using .isin:
df.loc[df["Number"].isin(["2"]), "Number"]

【讨论】：

我想知道我是否可以用这个语法做同样的事情 -> if "-" in dfN['Drate'].unique()

【解决方案5】：

对此的快速解决方法是使用numpy.core.defchararray。我也遇到了同样的警告信息，并且能够使用上述模块解决它。

import numpy.core.defchararray as npd
resultdataset = npd.equal(dataset1, dataset2)

【讨论】：

【解决方案6】：

Eric 的回答有助于解释问题在于将 Pandas 系列（包含 NumPy 数组）与 Python 字符串进行比较。不幸的是，他的两种解决方法都只是抑制了警告。

要编写不会导致警告的代码，请将您的字符串显式与 Series 的每个元素进行比较，并为每个元素获取一个单独的布尔值。例如，您可以使用 map 和匿名函数。

myRows = df[df['Unnamed: 5'].map( lambda x: x == 'Peter' )].index.tolist()

【讨论】：

【解决方案7】：

如果你的数组不是太大或者你没有太多的数组，你可以通过将== 的左侧强制为字符串来逃避：

myRows = df[str(df['Unnamed: 5']) == 'Peter'].index.tolist()

但是如果df['Unnamed: 5'] 是一个字符串，这会慢 1.5 倍，如果df['Unnamed: 5'] 是一个小的 numpy 数组（长度 = 10），则慢 25-30 倍，如果它是一个 numpy 数组，则慢 150-160 倍长度 100（500 次试验的平均时间）。

a = linspace(0, 5, 10)
b = linspace(0, 50, 100)
n = 500
string1 = 'Peter'
string2 = 'blargh'
times_a = zeros(n)
times_str_a = zeros(n)
times_s = zeros(n)
times_str_s = zeros(n)
times_b = zeros(n)
times_str_b = zeros(n)
for i in range(n):
    t0 = time.time()
    tmp1 = a == string1
    t1 = time.time()
    tmp2 = str(a) == string1
    t2 = time.time()
    tmp3 = string2 == string1
    t3 = time.time()
    tmp4 = str(string2) == string1
    t4 = time.time()
    tmp5 = b == string1
    t5 = time.time()
    tmp6 = str(b) == string1
    t6 = time.time()
    times_a[i] = t1 - t0
    times_str_a[i] = t2 - t1
    times_s[i] = t3 - t2
    times_str_s[i] = t4 - t3
    times_b[i] = t5 - t4
    times_str_b[i] = t6 - t5
print('Small array:')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_a), mean(times_str_a)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_a)/mean(times_a)))

print('\nBig array')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_b), mean(times_str_b)))
print(mean(times_str_b)/mean(times_b))

print('\nString')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_s), mean(times_str_s)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_s)/mean(times_s)))

结果：

Small array:
Time to compare without str conversion: 6.58464431763e-06 s. With str conversion: 0.000173756599426 s
Ratio of time with/without string conversion: 26.3881526541

Big array
Time to compare without str conversion: 5.44309616089e-06 s. With str conversion: 0.000870866775513 s
159.99474375821288

String
Time to compare without str conversion: 5.89370727539e-07 s. With str conversion: 8.30173492432e-07 s
Ratio of time with/without string conversion: 1.40857605178

【讨论】：

在 == 的左边加上 str 对我来说是一个很好的解决方案，它几乎不会影响 150 万行的性能，而且以后不会比这更大。

【解决方案8】：

在我的例子中，警告的发生只是因为布尔索引的常规类型——因为该系列只有 np.nan。演示（pandas 1.0.3）：

>>> import pandas as pd
>>> import numpy as np
>>> pd.Series([np.nan, 'Hi']) == 'Hi'
0    False
1     True
>>> pd.Series([np.nan, np.nan]) == 'Hi'
~/anaconda3/envs/ms3/lib/python3.7/site-packages/pandas/core/ops/array_ops.py:255: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  res_values = method(rvalues)
0    False
1    False

我认为对于 pandas 1.0，他们真的希望你使用新的 'string' 数据类型，它允许 pd.NA 值：

>>> pd.Series([pd.NA, pd.NA]) == 'Hi'
0    False
1    False
>>> pd.Series([np.nan, np.nan], dtype='string') == 'Hi'
0    <NA>
1    <NA>
>>> (pd.Series([np.nan, np.nan], dtype='string') == 'Hi').fillna(False)
0    False
1    False

不喜欢他们修改布尔索引等日常功能。

【讨论】：

【解决方案9】：

我收到此警告是因为我认为我的列包含空字符串，但在检查时，它包含 np.nan！

if df['column'] == '':

将我的列更改为空字符串有帮助:)

【讨论】：

【解决方案10】：

我比较了一些可能的方法，包括 pandas、几种 numpy 方法和列表解析方法。

首先，让我们从基线开始：

>>> import numpy as np
>>> import operator
>>> import pandas as pd

>>> x = [1, 2, 1, 2]
>>> %time count = np.sum(np.equal(1, x))
>>> print("Count {} using numpy equal with ints".format(count))
CPU times: user 52 µs, sys: 0 ns, total: 52 µs
Wall time: 56 µs
Count 2 using numpy equal with ints

所以，我们的基线是计数应该是正确的2，我们应该取大约50 us。

现在，我们试试朴素的方法：

>>> x = ['s', 'b', 's', 'b']
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 145 µs, sys: 24 µs, total: 169 µs
Wall time: 158 µs
Count NotImplemented using numpy equal
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  """Entry point for launching an IPython kernel.

在这里，我们得到了错误的答案 (NotImplemented != 2)，这需要我们很长时间，并且会引发警告。

所以我们将尝试另一种幼稚的方法：

>>> %time count = np.sum(x == 's')
>>> print("Count {} using ==".format(count))
CPU times: user 46 µs, sys: 1 µs, total: 47 µs
Wall time: 50.1 µs
Count 0 using ==

再次，错误的答案 (0 != 2)。这更加阴险，因为没有后续警告（0 可以像 2 一样被传递）。

现在，让我们尝试一个列表推导：

>>> %time count = np.sum([operator.eq(_x, 's') for _x in x])
>>> print("Count {} using list comprehension".format(count))
CPU times: user 55 µs, sys: 1 µs, total: 56 µs
Wall time: 60.3 µs
Count 2 using list comprehension

我们在这里得到正确答案，而且速度非常快！

另一种可能，pandas：

>>> y = pd.Series(x)
>>> %time count = np.sum(y == 's')
>>> print("Count {} using pandas ==".format(count))
CPU times: user 453 µs, sys: 31 µs, total: 484 µs
Wall time: 463 µs
Count 2 using pandas ==

缓慢，但正确！

最后，我要使用的选项：将 numpy 数组转换为 object 类型：

>>> x = np.array(['s', 'b', 's', 'b']).astype(object)
>>> %time count = np.sum(np.equal('s', x))
>>> print("Count {} using numpy equal".format(count))
CPU times: user 50 µs, sys: 1 µs, total: 51 µs
Wall time: 55.1 µs
Count 2 using numpy equal

快速正确！

【讨论】：

所以 IIUC，要修复 'x' in np.arange(5)，您建议只需执行 'x' in np.arange(5).astype(object)（或类似：'x' == np.arange(5).astype(object)）。正确的？恕我直言，这是这里展示的最优雅的解决方法，所以我对缺乏支持感到困惑。也许编辑您的答案以从底线开始，然后进行出色的性能分析？
谢谢@Oren，我会试试看，看看有什么效果。

【解决方案11】：

我有这个导致错误的代码：

for t in dfObj['time']:
  if type(t) == str:
    the_date = dateutil.parser.parse(t)
    loc_dt_int = int(the_date.timestamp())
    dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int

我改成这样了：

for t in dfObj['time']:
  try:
    the_date = dateutil.parser.parse(t)
    loc_dt_int = int(the_date.timestamp())
    dfObj.loc[t == dfObj.time, 'time'] = loc_dt_int
  except Exception as e:
    print(e)
    continue

为了避免比较，这是抛出警告 - 如上所述。由于for循环中的dfObj.loc，我只需要避免异常，也许有一种方法可以告诉它不要检查它已经更改的行。

【讨论】：