使用 numpy 数组列值的条件过滤 Pandas DataFrame答案

【问题标题】：Filtering Pandas DataFrame using a condition on column values that are numpy arrays使用 numpy 数组列值的条件过滤 Pandas DataFrame
【发布时间】：2021-01-05 07:27:45
【问题描述】：

我有一个名为“dt”的 Pandas DataFrame，它有两列名为“A”和“B”。 'B' 列的值是 numpy 数组；像这样的：

index   A   B
0       a   [1,2,3]
1       b   [2,3,4]
2       c   [3,4,5]

地点：

type (dt["B"][0])

返回：numpy.ndarray

我想过滤此 DataFrame 以获得另一个 DataFrame，其中仅存在存储在 'B' 中的 numpy 数组中具有特定元素的行。

我试过了：

dt [element in dt["B"]]

例如：

dt [2 in dt["B"]]

应该返回：

index   A   B
0       a   [1,2,3]
1       b   [2,3,4]

但这会导致错误，即“KeyError: True”

如果“B”列的值是字符串，我可以做同样的事情而不会出错：

dt [dt["B"]==value]

所以我想知道为什么我的代码不起作用，“KeyError: True”是什么意思。

完整的错误是这样的：

KeyError                                  Traceback (most recent call last)
~/Applications/Conda/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2645             try:
-> 2646                 return self._engine.get_loc(key)
   2647             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: True

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-151-aa9ea046a48f> in <module>
----> 1 quotes_of_base["BTC" in quotes_of_base["quote"]]

~/Applications/Conda/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2798             if self.columns.nlevels > 1:
   2799                 return self._getitem_multilevel(key)
-> 2800             indexer = self.columns.get_loc(key)
   2801             if is_integer(indexer):
   2802                 indexer = [indexer]

~/Applications/Conda/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2646                 return self._engine.get_loc(key)
   2647             except KeyError:
-> 2648                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2649         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2650         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: True

【问题讨论】：

使用isin而不是in如下：dt[dt["B"].isin(element)]
请提供minimal reproducible example。

标签： python pandas dataframe conditional-statements filtering

【解决方案1】：

假设你有类似的东西：

      A         B
  0  10   [11, 0]
  1  20  [11, 10]
  2  30  [11, 10]
  3  40   [10, 0]
  4  50   [11, 0]
  5  60   [10, 0]

并且想只过滤包含 10 的数组中的那些

      A         B
  1  20  [11, 10]
  2  30  [11, 10]
  3  40   [10, 0]
  5  60   [10, 0]

你可以使用 .apply

  #create the dataframe
  df = pd.DataFrame(columns = ['A','B'])
  df.A = [10,20,30,40,50,60]
  df.B = [[11,0],[11,10],[11,10],[10,0],[11,0],[10,0]]

  # results is a boolean indicating whether the value is found in the list
  # apply the filter in the column 'B' of the dataframe
  results = df.B.apply(lambda a: 10 in a)

  # filter the dataframe based on the boolean
  df_filtered = df[results]
  print(df_filtered)

然后你得到：

            A   B
  1         20  [11, 10]
  2         30  [11, 10]
  3         40   [10, 0]
  5         60   [10, 0]

您可以在以下位置找到更多详细信息：https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

【讨论】：

【解决方案2】：

我结合了评论者的回答。请注意，当我读取列表中的数据时，列表中的数据以字符串形式出现，因此您可能不得不使用 str(2) 这部分内容。

df[df.apply(lambda x: True if str(2) in x['B'] else False, axis=1)]

   A        B
0  a  [1,2,3]
1  b  [2,3,4]

【讨论】：

【解决方案3】：

请记住，索引数据框需要一个 True/False 值列表，因此如果推到了紧要关头，您仍然可以在其他地方构建该列表（列表理解/ for 循环）并将其传递到 df 中，如 dt[contructed_true_false_list]。只需确保您的 df 的每一行都有一个条目。

如果没有具体示例，很难提出解决方案，但您可以尝试以下方法：

[True if np.any(my_np_array == element) else False for my_np_array in dt["B"].values]

【讨论】：

我添加了一个示例和完整的错误。请你再看看好吗？在我看来，应该有一个更简单的解决方案。非常感谢！
dt[[np.any(np.array(array)==2) for array in dt["B"]]] 工作得很好。
我再次将您的 ndarray 转换为一个简单的数组，因为 ndarray 表示多维恶作剧。由于无法查看原始数据，我补充说，作为预防措施，您可以省略它。