按单元格中包含的列表成员选择行答案

【问题标题】：Selecting rows by members of a list contained in a cell按单元格中包含的列表成员选择行
【发布时间】：2014-07-19 06:45:11
【问题描述】：

在我的数据框中，我有一列包含项目列表。我只想选择那些包含所有或多个项目的行。至少匹配一个列表会很棒。

import pandas as pd
df = pd.DataFrame([[2,[2,3,8]]], columns=['a','b'])
df

我尝试了以下方法：

 df[df['b'] == [2,3,8]]
 df[[2,3,8] in df['b']] # and etc.

我觉得这里被蒙上了眼睛......

到 FooBar：

我正在做科学领域分析。该列表包含不同科学领域的代码。其中行代表案例，当这些科学领域共同出现时。我可以将列表成员保存在不同的列中，但问题是 coocuring 字段的数量正在变化。因此我认为在单元格中保留一个列表是可以的。

【问题讨论】：

这种设置（将列表作为单列）不是标准的并且有很多缺点。您是否了解此决定的利弊并有目的地这样设置您的数据库？如果没有，请分享更多有关您的最终目标或您尝试组织的数据类型的信息，我们可能会建议更好的数据库策略。
（您可以通过答案的复杂程度看出设置的尴尬——这通常应该是小菜一碟）
@FooBar 你是对的 - 一个单元格中的许多元素是antipattern 之一，在书 SQL Antipatterns 中描述
@Aidis 我添加了一个答案，可以将您的数据更改为更合适的格式。我希望拥有区分标准数据和您的科学领域的两个关键是令人满意的。

标签： python pandas selection dataframe

【解决方案1】：

我认为你可以做到以下几点：

idx = []

S = [2,3,8]

for i, line in df.iterrows():
     if set(S).issubset(line['b']):
           idx.append(i)

现在，您可以只选择您感兴趣的行：

df_subset = df.ix[idx]

【讨论】：

@Aidis 您可以以更简单的方式使用它：df[ df['b'].apply(lambda x:set(x).issubset([2,3,8])) ] 或交换 x 和 [2,3,8] df[ df['b'].apply(lambda x:set([2,3,8]).issubset(x)) ]

【解决方案2】：

比较元组没有问题

import pandas as pd

data = [
    [1, (2,3,8)],
    [2, (12,13,18)],
    [3, (2,3,8)],
    [4, (1,2,3,8,10)],
    [5, (8,3,2)],
]

#----------------------------------------------

df_tuple = pd.DataFrame( data, columns=['a','b'])

print '\n DataFrame with tuples \n'
print df_tuple

print '\n tuple == : \n'

print df_tuple['b'] == (2,3,8)
print df_tuple[ df_tuple['b'] == (2,3,8) ]

print '\n tuple eq() : \n'

print df_tuple['b'].eq((2,3,8))
print df_tuple[ df_tuple['b'].eq((2,3,8)) ]

#----------------------------------------------

结果

 DataFrame with tuples 

   a                 b
0  1         (2, 3, 8)
1  2      (12, 13, 18)
2  3         (2, 3, 8)
3  4  (1, 2, 3, 8, 10)
4  5         (8, 3, 2)

 tuple == : 

0     True
1    False
2     True
3    False
4    False
Name: b, dtype: bool
   a          b
0  1  (2, 3, 8)
2  3  (2, 3, 8)

 tuple eq() : 

0     True
1    False
2     True
3    False
4    False
Name: b, dtype: bool
   a          b
0  1  (2, 3, 8)
2  3  (2, 3, 8)

但是比较列表存在问题，我不知道为什么。

但是您需要包含列表 [2,3,8] 中所有或多个项目的行，所以我会使用 apply() 和自己的函数。

import pandas as pd

#----------------------------------------------

data = [
    [1, [2,3,8]],
    [2, [12,13,18]],
    [3, [2,3,8]],
    [4, [1,2,3,8,10]],
    [5, [8,3,2]],
]

#----------------------------------------------

df_list = pd.DataFrame( data, columns=['a','b'])

print '\n DataFrame with lists \n'
print df_list

print '\n test: \n'

# test if any element from data list is in [2,3,8]
def test(data):
    return any( x in [2,3,8] for x in data )

print df_list['b'].apply(test)
print df_list[ df_list['b'].apply(test) ]

#----------------------------------------------

结果

 DataFrame with lists 

   a                 b
0  1         [2, 3, 8]
1  2      [12, 13, 18]
2  3         [2, 3, 8]
3  4  [1, 2, 3, 8, 10]
4  5         [8, 3, 2]

 test: 

0     True
1    False
2     True
3     True
4     True
Name: b, dtype: bool
   a                 b
0  1         [2, 3, 8]
2  3         [2, 3, 8]
3  4  [1, 2, 3, 8, 10]
4  5         [8, 3, 2]

更有用的版本 - 带有第二个参数：

test_any return True 如果 data 列表中的 any 元素在 expected 列表中

def test_any(data, expected):
    return any( x in expected for x in data )

print df_list['b'].apply(lambda x:test_any(x,[2,3,8]) )
print df_list[ df_list['b'].apply(lambda x:test_any(x,[2,3,8]) ) ]

test_all return True 如果数据列表中的所有元素都在预期列表中

def test_all(data, expected):
    return all( x in expected for x in data )

print df_list['b'].apply(lambda x:test_all(x,[2,3,8]) )
print df_list[ df_list['b'].apply(lambda x:test_all(x,[2,3,8]) ) ]

你可以交换'x'和[2,3,8]

如果预期列表中的任何元素在数据列表中

，则获取True

print df_list[ df_list['b'].apply(lambda x:test_any_2([2,3,8], x) ) ]

如果预期列表中的所有元素都在数据列表中

，则获取True

print df_list[ df_list['b'].apply(lambda x:test_all_2([2,3,8], x) ) ]

【讨论】：

【解决方案3】：

好的，我将执行以下操作以使您的数据框具有“更好”的格式。我允许任何数量的“科学属性”，正如你所说的那样，并将它们称为“附加”。

D = df
df = pd.concat([D['a'], pd.DataFrame(D['b'].tolist(), index=D.index)], axis=1, keys=['standard', 'additional'])
In[103]: df
Out[103]: 
   standard  additional      
          a           0  1  2
0         2           2  3  8

现在我们只在“附加”部分搜索您提供给我们的键：

In[133]: any(df['additional'] == 3, axis=1) & any(df['additional'] == 8, axis=1)
Out[133]: array([ True], dtype=bool)

现在我只是破解第二条假线来检查我是否真的“不选择”那些不符合标准的人：

df2 = df.append(df)
df2.iloc[1] += 1
any(df2['additional'] == 3, axis=1) & any(df2['additional'] == 8, axis=1)
Out[132]: array([ True, False], dtype=bool)

致谢：我从 HYRY here 那里学到了 concat() 这个可爱的用法。

【讨论】：