从结构化 Numpy 数组 Python3.x 中删除重复项答案

【问题标题】：Drop duplicates from Structured Numpy Array Python3.x从结构化 Numpy 数组 Python3.x 中删除重复项
【发布时间】：2017-09-24 13:08:52
【问题描述】：

取以下数组：

import numpy as np

arr_dupes = np.array(
    [
      ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 246),
      ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 246),
      ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 222),
      ('2017-09-13T11:04:00.000000',  1.32683,  1.32686,  1.32682,  1.32685,  1.32682,  1.32684,  1.3268 ,  1.32684,  97),
      ('2017-09-13T11:03:00.000000',  1.32664,  1.32684,  1.32663,  1.32683,  1.32664,  1.32683,  1.32661,  1.32682, 268),
      ('2017-09-13T11:02:00.000000',  1.3268 ,  1.32692,  1.3266 ,  1.32664,  1.32678,  1.32689,  1.32658,  1.32664, 299),
      ('2017-09-13T11:02:00.000000',  1.3268 ,  1.32692,  1.3266 ,  1.32664,  1.32678,  1.32689,  1.32658,  1.32664, 299),
      ('2017-09-13T11:01:00.000000',  1.32648,  1.32682,  1.32648,  1.3268 ,  1.32647,  1.32682,  1.32647,  1.32678, 322),
      ('2017-09-13T11:00:00.000000',  1.32647,  1.32649,  1.32628,  1.32648,  1.32644,  1.32651,  1.32626,  1.32647, 285)],
      dtype=[('date', '<M8[us]'), ('askopen', '<f8'), ('askhigh', '<f8'), ('asklow', '<f8'), ('askclose', '<f8'),
             ('bidopen', '<f8'), ('bidhigh', '<f8'), ('bidlow', '<f8'), ('bidclose', '<f8'), ('volume', '<i8')]
)

使用日期作为索引并保留最后一个值来删除重复项的最快解决方案是什么？

Pandas DataFrame 等价物是

In [5]: df = pd.DataFrame(arr_dupes, index=arr_dupes['date'])
In [6]: df
Out[6]:
                                   date  askopen  askhigh   asklow  askclose  bidopen  bidhigh   bidlow  bidclose  volume
2017-09-13 11:05:00 2017-09-13 11:05:00  1.32685  1.32704  1.32682   1.32686  1.32684  1.32702  1.32679   1.32683     246
2017-09-13 11:05:00 2017-09-13 11:05:00  1.32685  1.32704  1.32682   1.32686  1.32684  1.32702  1.32679   1.32683     246
2017-09-13 11:05:00 2017-09-13 11:05:00  1.32685  1.32704  1.32682   1.32686  1.32684  1.32702  1.32679   1.32683     222
2017-09-13 11:04:00 2017-09-13 11:04:00  1.32683  1.32686  1.32682   1.32685  1.32682  1.32684  1.32680   1.32684      97
2017-09-13 11:03:00 2017-09-13 11:03:00  1.32664  1.32684  1.32663   1.32683  1.32664  1.32683  1.32661   1.32682     268
2017-09-13 11:02:00 2017-09-13 11:02:00  1.32680  1.32692  1.32660   1.32664  1.32678  1.32689  1.32658   1.32664     299
2017-09-13 11:02:00 2017-09-13 11:02:00  1.32680  1.32692  1.32660   1.32664  1.32678  1.32689  1.32658   1.32664     299
2017-09-13 11:01:00 2017-09-13 11:01:00  1.32648  1.32682  1.32648   1.32680  1.32647  1.32682  1.32647   1.32678     322
2017-09-13 11:00:00 2017-09-13 11:00:00  1.32647  1.32649  1.32628   1.32648  1.32644  1.32651  1.32626   1.32647     285

In [7]: df.reset_index().drop_duplicates(subset='date', keep='last').set_index('date')
Out[7]:
                                  index  askopen  askhigh   asklow  askclose  bidopen  bidhigh   bidlow  bidclose  volume
date
2017-09-13 11:05:00 2017-09-13 11:05:00  1.32685  1.32704  1.32682   1.32686  1.32684  1.32702  1.32679   1.32683     222
2017-09-13 11:04:00 2017-09-13 11:04:00  1.32683  1.32686  1.32682   1.32685  1.32682  1.32684  1.32680   1.32684      97
2017-09-13 11:03:00 2017-09-13 11:03:00  1.32664  1.32684  1.32663   1.32683  1.32664  1.32683  1.32661   1.32682     268
2017-09-13 11:02:00 2017-09-13 11:02:00  1.32680  1.32692  1.32660   1.32664  1.32678  1.32689  1.32658   1.32664     299
2017-09-13 11:01:00 2017-09-13 11:01:00  1.32648  1.32682  1.32648   1.32680  1.32647  1.32682  1.32647   1.32678     322
2017-09-13 11:00:00 2017-09-13 11:00:00  1.32647  1.32649  1.32628   1.32648  1.32644  1.32651  1.32626   1.32647     285

numpy.unique 似乎会比较整个元组，并会返回重复项。

最终输出应如下所示。

array([
      ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 222),
      ('2017-09-13T11:04:00.000000',  1.32683,  1.32686,  1.32682,  1.32685,  1.32682,  1.32684,  1.3268 ,  1.32684,  97),
      ('2017-09-13T11:03:00.000000',  1.32664,  1.32684,  1.32663,  1.32683,  1.32664,  1.32683,  1.32661,  1.32682, 268),
      ('2017-09-13T11:02:00.000000',  1.3268 ,  1.32692,  1.3266 ,  1.32664,  1.32678,  1.32689,  1.32658,  1.32664, 299),
      ('2017-09-13T11:01:00.000000',  1.32648,  1.32682,  1.32648,  1.3268 ,  1.32647,  1.32682,  1.32647,  1.32678, 322),
      ('2017-09-13T11:00:00.000000',  1.32647,  1.32649,  1.32628,  1.32648,  1.32644,  1.32651,  1.32626,  1.32647, 285)],
      dtype=[('date', '<M8[us]'), ('askopen', '<f8'), ('askhigh', '<f8'), ('asklow', '<f8'), ('askclose', '<f8'),
             ('bidopen', '<f8'), ('bidhigh', '<f8'), ('bidlow', '<f8'), ('bidclose', '<f8'), ('volume', '<i8')]
)

谢谢

【问题讨论】：

如果它是 keep=last，那么您的输出将与您所显示的不同...
@COLDSPEED 你确定吗？我添加了 Pandas 版本
@James 为什么你不能使用 pandas？
@ChaosPredictor Pandas 很棒，但这会增加很多开销。在这种情况下，速度很重要

标签： python-3.x numpy data-structures

【解决方案1】：

您的问题的解决方案似乎不必模仿熊猫的 drop_duplicates() 函数，但我会提供一个模仿它的和一个不模仿它的。

如果您需要与 pandas drop_duplicates() 完全相同的行为，则可以使用以下代码：

#initialization of arr_dupes here

#actual algorithm

helper1, helper2 = np.unique(arr_dupes['date'][::-1], return_index = True)

result = arr_dupes[::-1][helper2][::-1]

初始化 arr_dupes 时，您只需将“日期”列传递给 numpy.unique()。此外，由于您对数组中的最后一个非唯一元素感兴趣，因此您必须使用 [::-1] 反转传递给 unique() 的数组的顺序。这种方式 unique() 将丢弃除最后一个之外的所有非唯一元素。然后 unique() 返回一个唯一元素列表 (helper1) 作为第一个返回值，并将原始数组 (helper2) 中这些元素的索引列表作为第二个返回值。最后，通过从原始数组 arr_dupes 中选取 helper2 中列出的元素来创建一个新数组。

此解决方案比 pandas 版本快约 9.898 倍。

现在让我解释一下我在这个答案开头的意思。在我看来，您的数组是按“日期”列排序的。如果这是真的，那么我们可以假设重复项将被组合在一起。如果将它们分组在一起，那么我们只需要保留下一行“日期”列与当前行“日期”列不同的行。例如，如果我们看一下以下数组行：

...
  ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 246),
  ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 246),
  ('2017-09-13T11:05:00.000000',  1.32685,  1.32704,  1.32682,  1.32686,  1.32684,  1.32702,  1.32679,  1.32683, 222),
  ('2017-09-13T11:04:00.000000',  1.32683,  1.32686,  1.32682,  1.32685,  1.32682,  1.32684,  1.3268 ,  1.32684,  97),
...

第三行“日期”列与第四行不同，我们需要保留它。无需再做任何检查。第一行“日期”列与第二行相同，我们不需要该行。第二行也是如此。所以在代码中它看起来像这样：

#initialization of arr_dupes here

#actual algorithm

result = arr_dupes[np.concatenate((arr_dupes['date'][:-1] != arr_dupes['date'][1:], np.array([True])))]

首先将“日期”列的每个元素与下一个元素进行比较。这会创建一个真假数组。如果此布尔数组中的索引具有分配给它的真值，则需要保留具有该索引的 arr_dupes 元素。否则它需要离开。接下来，concatenate() 只是将最后一个真值添加到这个布尔数组，因为最后一个元素总是需要留在结果数组中。

此解决方案比 pandas 版本快约 17 倍。

【讨论】：

我刚刚为您的回答 +1 - 感谢分享。我会测试并尽快回复您。