查找python列表之间的交集/差异答案

【问题标题】：Finding intersection/difference between python lists查找python列表之间的交集/差异
【发布时间】：2013-02-08 22:42:18
【问题描述】：

我有两个 python 列表：

a = [('when', 3), ('why', 4), ('throw', 9), ('send', 15), ('you', 1)]

b = ['the', 'when', 'send', 'we', 'us']

我需要从a中过滤掉所有与b中相似的元素。就像在这种情况下，我应该得到：

c = [('why', 4), ('throw', 9), ('you', 1)]

什么应该是最有效的方法？

【问题讨论】：

为什么不用方法交集呢？它可以工作，但你可以让它更好地工作;）
为什么这个问题用 numpy 标记？你需要一个 numpy 解决方案吗？

标签： python list numpy

【解决方案1】：

列表推导会起作用。

a = [('when', 3), ('why', 4), ('throw', 9), ('send', 15), ('you', 1)]
b = ['the', 'when', 'send', 'we', 'us']
filtered = [i for i in a if not i[0] in b]

>>>print(filtered)
[('why', 4), ('throw', 9), ('you', 1)]

【讨论】：

这是一种非常优雅的方法，同时将列表保留为列表，而不是将它们视为字典...谢谢您的帮助。
如果您使用in 运算符，您应该将b 转换为set。它将查找时间从线性更改为常量，当b 是一个长列表时，这将产生巨大的差异。所以，c = set(b)，然后是filtered = [i for i in a if not i[0] in c]。注意b 在最后一行变成了c。即使在这个包含 5 个项目的简短列表中，它也为我带来了 25% 的速度提升。使用更长的列表（b 中的 100 个项目），它可以提高 90% 的速度。

【解决方案2】：

列表推导应该可以工作：

c = [item for item in a if item[0] not in b]

或者用字典理解：

d = dict(a)
c = {key: value for key in d.iteritems() if key not in b}

【讨论】：

你想要{key: value for key, value in d.iteritems() if key not in b}吗？

【解决方案3】：

in 很好，但您至少应该为b 使用集合。如果你有 numpy，你当然也可以试试np.in1d，但如果它更快与否，你应该试试。

# ruthless copy, but use the set...
b = set(b)
filtered = [i for i in a if not i[0] in b]

# with numpy (note if you create the array like this, you must already put
# the maximum string length, here 10), otherwise, just use an object array.
# its slower (likely not worth it), but safe.
a = np.array(a, dtype=[('key', 's10'), ('val', int)])
b = np.asarray(b)

mask = ~np.in1d(a['key'], b)
filtered = a[mask]

Set 也有 difference 等方法，这些方法在这里可能没用，但总的来说可能有用。

【讨论】：

+1 表示 numpy。在发布我的答案之前没有看到你的答案。 in1d 比对较大数据集的列表理解要快 2 倍。

【解决方案4】：

由于这是用 numpy 标记的，因此这是一个使用 numpy.in1d 的 numpy 解决方案，以列表理解为基准：

In [1]: a = [('when', 3), ('why', 4), ('throw', 9), ('send', 15), ('you', 1)]

In [2]: b = ['the', 'when', 'send', 'we', 'us']

In [3]: a_ar = np.array(a, dtype=[('string','|S5'), ('number',float)])

In [4]: b_ar = np.array(b)

In [5]: %timeit filtered = [i for i in a if not i[0] in b]
1000000 loops, best of 3: 778 ns per loop

In [6]: %timeit filtered = a_ar[-np.in1d(a_ar['string'], b_ar)]
10000 loops, best of 3: 31.4 us per loop

所以对于 5 条记录，列表理解更快。

但是对于大型数据集，numpy 解决方案的速度是列表理解的两倍：

In [7]: a = a * 1000

In [8]: a_ar = np.array(a, dtype=[('string','|S5'), ('number',float)])

In [9]: %timeit filtered = [i for i in a if not i[0] in b]
1000 loops, best of 3: 647 us per loop

In [10]: %timeit filtered = a_ar[-np.in1d(a_ar['string'], b_ar)]
1000 loops, best of 3: 302 us per loop

【讨论】：

【解决方案5】：

试试这个：

a = [('when', 3), ('why', 4), ('throw', 9), ('send', 15), ('you', 1)]

b = ['the', 'when', 'send', 'we', 'us']

c=[]

for x in a:
    if x[0] not in b:
        c.append(x)
print c

演示：http://ideone.com/zW7mzY

【讨论】：

向后：OP希望c包含b中not的东西
这似乎是“c++ 方式”，而不是“python 方式”;)
@tohecz c++ 不支持in 运算符。
@Arpit 否，但本质上使用循环来操作容器，Python 本质上不应该。
我仍然支持交集！ :]

【解决方案6】：

简单的方法

a = [('when', 3), ('why', 4), ('throw', 9), ('send', 15), ('you', 1)]
b = ['the', 'when', 'send', 'we', 'us']
c=[] # a list to store the required tuples 
#compare the first element of each tuple in with an element in b
for i in a:
    if i[0] not in b:
        c.append(i)
print(c)

【讨论】：

【解决方案7】：

使用过滤器：

c = filter(lambda (x, y): False if x in b else True, a)

【讨论】：

-1：如果你使用的是False if .. else True 或True if ... else False，那你就错了
某种“Python风格”错误，还是其他原因错误？
X in Y本身就是python中的布尔语句
@RahulBanerjee False if ... else True 是不必要的复杂且难以阅读 - 只需执行 lambda (x, y): x not in b。此外，这会导致 Python 3 中的语法错误 - 您必须执行 lambda x: x[0] not in b，因为您使用的参数解包形式不再是该语言的一部分。
这里的部分问题是filter(lambda:... 天生难以阅读（相对于过滤理解）。大概，您更喜欢您的符号，因为它包含if。