Python Pandas：使用数组值查找熊猫系列的索引答案

【问题标题】：Python Pandas: Finding the index of a pandas series with an array valuePython Pandas：使用数组值查找熊猫系列的索引
【发布时间】：2018-02-21 04:41:19
【问题描述】：

我有一个熊猫系列，其中一个数组作为每个值的值，如下所示：

             'Node'
    ..        ....
    97     [355.0, 296.0]
    98      [53.0, 177.0]
    99      [294.0, 14.0]
    100     [330.0, 15.0]
    101    [100.0, 160.0]
    102     [10.0, 220.0]
    103    [330.0, 290.0]

我想查找包含值 330.0 的所有行的索引，即 100 和 103。

到目前为止我尝试过的是：

vals = [item for item in df.Node if item[0] == 330.0]

这给了我[array([ 330., 15.]), array([ 330., 290.])]

然后：

for val in vals:
    id = pd.Index(df.Node).get_loc(val)

这会引发错误提示 TypeError: '[ 330. 15.]' is an invalid key

如何解决这个问题并获取值的行索引？

编辑：这是一个行数少得多的示例数据框。

0     [139.0, 105.0]
1     [290.0, 200.0]
2     [257.0, 243.0]
3       [235.0, 7.0]
4      [12.0, 115.0]
5     [168.0, 135.0]
6     [105.0, 258.0]
7      [339.0, 64.0]
8       [6.0, 148.0]
9      [33.0, 286.0]
10      [62.0, 26.0]
11    [307.0, 185.0]
12     [34.0, 269.0]
13     [206.0, 60.0]
14    [327.0, 127.0]
15    [127.0, 202.0]
16     [297.0, 48.0]
17    [131.0, 151.0]
18      [326.0, 1.0]
19     [304.0, 35.0]
20     [329.0, 23.0]
21    [314.0, 287.0]
22      [1.0, 233.0]
23    [260.0, 280.0]
24     [313.0, 56.0]
25     [294.0, 33.0]
26    [243.0, 256.0]
27    [151.0, 174.0]
28    [271.0, 295.0]
29    [141.0, 184.0]
30    [105.0, 157.0]
31    [288.0, 269.0]
32    [118.0, 210.0]
33     [38.0, 194.0]
34     [49.0, 154.0]
35     [40.0, 204.0]
36     [317.0, 27.0]
37     [359.0, 33.0]
38     [56.0, 184.0]
39     [359.0, 39.0]
40     [48.0, 170.0]
41     [314.0, 51.0]
42    [175.0, 184.0]
43     [28.0, 200.0]
44     [35.0, 169.0]
45     [330.0, 15.0]
46    [100.0, 160.0]
47     [10.0, 220.0]
48    [330.0, 290.0]
Name: Node, dtype: object

【问题讨论】：

您能否提供格式良好的数据框示例，以便我们测试一些代码？但看起来您的问题是由于列表格式造成的。
你得到错误的原因是因为 list 在 Python 中是不可散列的。
@MattR 现在用行数更少的数据框更新问题。
我强烈考虑使用@Alexander 的回答。将数据放入正确的格式以进行数据分析很重要。在单个单元格中保留多个值并不是存储数据的最佳方式。首先转换为具有两列的正确 DataFrame，然后进行正常的布尔选择。

标签： python pandas

【解决方案1】：

还有一个：）

df.index[df['Node'].apply(lambda x: 330.0 in x )].tolist()

你得到

[100, 103]

这个好像也是最快的

%timeit df.index[df['Node'].apply(lambda x: 330.0 in x )].tolist()
1000 loops, best of 3: 262 µs per loop

%timeit df[df.Node.apply(lambda x: True if 330.0 in x else False)].index 
1000 loops, best of 3: 704 µs per loop

%timeit df.loc[(df['x'] == 330) | (df['y'] == 330), 'Node']
1000 loops, best of 3: 1.3 ms per loop

【讨论】：

谢谢，这似乎是我想要的解决方案！
仅供未来读者参考 - 如果有一段时间 Node 包含一个不是列表的值，用户将收到类型错误：TypeError: argument of type 'int' is not iterable。这基本上与OP的原始问题相反。该解决方案目前仅适用于一列列表。

【解决方案2】：

一个关键问题是为什么该列首先包含一个元组列表。这将存储为对象数据类型，这是您效率最低的选择。您可能应该将列表分成两个单独的列（根据您的示例数据，这将是 np.float64），然后检查值。

df = pd.DataFrame({'Node': [
    [355., 296.], 
    [53., 177.], 
    [294., 14.], 
    [330., 15.], 
    [100., 160.],
    [10., 220.],
    [330., 290.]]}, index=range(97, 104))

df[['x', 'y']] = df.Node.apply(pd.Series)
>>> df.loc[(df['x'] == 330) | (df['y'] == 330), 'Node']
100     [330.0, 15.0]
103    [330.0, 290.0]
Name: Node, dtype: object

【讨论】：

【解决方案3】：

你可以得到你想要的

df[df.Node.apply(lambda x: True if 330.0 in x else False)].index

完整示例：

>>> import pandas as pd 
>>> df = pd.DataFrame({'Node': [
...     [355., 296.], 
...     [53., 177.], 
...     [294., 14.], 
...     [330., 15.], 
...     [100., 160.],
...     [10., 220.],
...     [330., 290.]]}, index=range(97, 104))
>>> df
               Node
97   [355.0, 296.0]
98    [53.0, 177.0]
99    [294.0, 14.0]
100   [330.0, 15.0]
101  [100.0, 160.0]
102   [10.0, 220.0]
103  [330.0, 290.0]
>>> df[df.Node.apply(lambda x: True if 330.0 in x else False)]
               Node
100   [330.0, 15.0]
103  [330.0, 290.0]
>>> df[df.Node.apply(lambda x: True if 330.0 in x else False)].index 
Int64Index([100, 103], dtype='int64')
>>> 
>>> df[df.Node.apply(lambda x: True if 330.0 in x else False)].index.tolist()  
[100, 103]
>>>

【讨论】：

【解决方案4】：

这个怎么样：

import pandas as pd

df = pd.DataFrame()
df['Node'] = [[1, 2], [1, 3], [330.0, 5]]

for idx, value in enumerate(df['Node']):
    if 330.0 in value:
        print(idx)

【讨论】：

【解决方案5】：

避免 Pandas 中的循环。使用.loc:

一个例子：

df.loc[df['Node'] == 330.0].index.tolist()

这将为您提供“节点”等于 330 的索引列表。您可能需要对其进行一些更改。查看this SO answer 了解如何使用lambda 表达式和pandas 来帮助您处理列表

编辑：

我 left a comment 声明除非整个 Node 列包含列表值，否则接受的答案将失败。一个肮脏的工作是使值成为字符串并使用contains。您可以尝试以下方法：

df.loc[df['Node'].astype(str).str.contains('330.0')].index.tolist()

这使list 成为string，然后您可以检查它是否包含330.0 的string

【讨论】：

感谢您的回答，但我收到一条错误消息，提示 ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
谢谢，这个链接有帮助！
很高兴，我将更新我的答案，类似于您已经接受的答案。这是最好的答案。
@PVasish，我更新了我的答案，以防您遇到Node 包含除列表之外的任何内容的问题。我以前遇到过这个问题，给你一个简单的解决方法