算法在没有辅助存储的情况下删除数组中的重复元素答案

【问题标题】：algorithm removing duplicate elements in array without auxillay storage算法在没有辅助存储的情况下删除数组中的重复元素
【发布时间】：2014-03-04 00:09:08
【问题描述】：

我正在研究这个著名的面试问题，即在不使用 auxillary storage 的情况下删除 array 中的重复元素并保留顺序；

我已经阅读了很多帖子； Algorithm: efficient way to remove duplicate integers from an array, Removing Duplicates from an Array using C.

它们要么在C 中实现（没有解释），要么在[1,1,1,3,3] 等连续重复时提供的Java Code 会失败。

我对使用C 不太自信，我的背景是Java。所以我自己实现了代码；如下：

使用两个循环，外循环遍历数组，内循环检查重复项，如果存在则将其替换为 null。
然后我检查重复替换空数组并删除空元素并将其替换为下一个非空元素。

我现在看到的总运行时间是O(n^2)+O(n) ~ O(n^2)。阅读以上帖子，我明白这是我们能做的最好的事情，如果不允许排序和辅助存储。我的代码在这里：我正在寻找进一步优化的方法（如果有可能）或better/simplisitc logic;

public class RemoveDup {
    public static void main (String[] args){
        Integer[]  arr2={3,45,1,2,3,3,3,3,2,1,45,2,10};
            Integer[] res= removeDup(arr2);
                System.out.println(Arrays.toString(res));
            }
          private static Integer[] removeDup(Integer[] data) {
            int size = data.length;
            int count = 1;
                for (int i = 0; i < size; i++) {
                    Integer temp = data[i];
                    for (int j = i + 1; j < size && temp != null; j++) {
                        if (data[j] == temp) {
                            data[j] = null;
                        }
                    }
                }
                for (int i = 1; i < size; i++) {
                    Integer current = data[i];
                    if (data[i] != null) {
                        data[count++] = current;
                    }
                }

                return Arrays.copyOf(data, count);

         }

}

编辑 1；来自@keshlam 的重新格式化代码抛出 ArrayIndexOutofBound 异常：

private static int removeDupes(int[] array) {
        System.out.println("method called");
        if(array.length < 2)
          return array.length;

        int outsize=1; // first is always kept

     for (int consider = 1; consider < array.length; ++consider) {

          for(int compare=0;compare<outsize;++compare) {
            if(array[consider]!=array[compare])
                array[outsize++]=array[consider]; // already present; advance to next compare
           else break;
          // if we get here, we know it's new so append it to output
          //array[outsize++]=array[consider]; // could test first, not worth it. 

        }

      }
        System.out.println(Arrays.toString(array));
         // length is last written position plus 1
        return outsize;
    }

【问题讨论】：

你的问题是？
@keshlam：以上第 3 点
我认为，在这种限制下，你不能做得比 O(N^2) 更好。所以代码看起来没问题。
我不认为Arrays.copyOf() 正在做你想做的事。
有一个我更喜欢的解决方案，它只需要通过外循环一次，但我不确定它在 O() 规模上是否更好。

标签： java arrays algorithm duplicate-removal

【解决方案1】：

好的，这是我的答案，最坏的情况应该是 O(N*N)。（使用较小的常数，因为即使在最坏的情况下，我也在测试 N 以对抗 - 平均 - 1/2 N，但这是计算机科学而不是软件工程，仅仅 2 倍的加速并不重要。感谢@Alexandru for指出这一点。）

1) 分割光标（输入和输出分开进阶），

2) 每个新值只需与已保存的值进行比较，如果找到匹配项，比较可以停止。（提示关键字是“增量”）

3) 第一个元素不需要测试。

4) 我正在利用标记为continue 的优势，我可以在breaking 之前设置一个标志，然后测试该标志。结果是一样的；这有点优雅。

4.5) 如果是真的，我本可以测试 outsize==consider 是否不被复制。但是测试它需要和做可能不必要的副本一样多的周期，而且大多数情况是它们不是相同的，所以让一个可能冗余的副本发生更容易.

5) 我没有在 key 函数中重新复制数据；我已经将复制打印操作分解为一个单独的函数，以明确removeDupes 确实完全在目标数组中运行，并在堆栈中添加了一些自动变量。而且我不会花时间将数组末尾的剩余元素清零；这可能是浪费的工作（如本例所示）。虽然我认为它实际上不会改变形式的复杂性。

import java.util.Arrays;

public class RemoveDupes {

  private static int removeDupes(final int[] array) {
    if(array.length < 2)
      return array.length;

    int outsize=1; // first is always kept

    outerloop: for (int consider = 1; consider < array.length; ++consider) {

      for(int compare=0;compare<outsize;++compare)
        if(array[consider]==array[compare])
          continue outerloop; // already present; advance to next compare

      // if we get here, we know it's new so append it to output
      array[outsize++]=array[consider]; // could test first, not worth it. 
    }

    return outsize; // length is last written position plus 1
  }

  private static void printRemoveDupes(int[] array) {
    int newlength=removeDupes(array);
    System.out.println(Arrays.toString(Arrays.copyOfRange(array, 0, newlength)));
  }

  public static void main(final String[] args) {
    printRemoveDupes(new int[] { 3, 45, 1, 2, 3, 3, 3, 3, 2, 1, 45, 2, 10 });
    printRemoveDupes(new int[] { 2, 2, 3, 3 });
    printRemoveDupes(new int[] { 1, 1, 1, 1, 1, 1, 1, 1 });
  }
}

后期添加：由于人们对我的解释中的第 4 点表示困惑，这里是重写的循环，没有标记continue：

for (int consider = 1; consider < array.length; ++consider) {
  boolean matchfound=false;

  for(int compare=0;compare<outsize;++compare) {
    if(array[consider]==array[compare]) {
      matchfound=true;
      break;
    }

    if(!matchFound) // only add it to the output if not found
      array[outsize++]=array[consider];
}

希望对您有所帮助。标记为continue 是Java 很少使用的功能，因此有些人以前没有见过它并不奇怪。它很有用，但确实使代码更难阅读；我可能不会在比这个简单算法更复杂的事情中使用它。

【讨论】：

没有。即使不是每个值都针对所有其他值进行测试，它仍然可以是O(n²)，但具有更小的常数。所以你有 1 次测试，然后 2 次测试，第 3 次和最多 N-1:Check wolframalpha 上的总和结果。
....啊啊啊。它平均为 N * N/2，当排除 1/2 时，这确实是 O(N*N)。我的不好的，好的，我会修正描述。
@keshlam 是的，那么它只会通过一次。
再次感谢 AB 的健全性检查 - 并让我诚实。很明显，我主要是一名工程师，而不是科学家。
未格式化的 cmets 真的不适合讨论扩展的代码块。如果您需要帮助理解为什么这不起作用，您最好开始一个新问题。但是请重新阅读我已经说过的内容：您需要能够区分循环退出是因为测试找到了匹配项还是没有找到匹配项，并使用它来控制是否分配发生。这意味着您需要一个标志变量，它从 false 开始并仅在中断发生时变为 true（反之亦然），并且您需要一个 if 语句来测试该标志并控制分配

【解决方案2】：

这里的一个版本不使用额外的内存（除了它返回的数组）并且也不排序。

我相信这比 O(n*log n) 略差。

编辑：我错了。这比 O(n^3) 略好。

public class Dupes {

    private static int[] removeDupes(final int[] array) {
        int end = array.length - 1;
        for (int i = 0; i <= end; i++) {
            for (int j = i + 1; j <= end; j++) {
                if (array[i] == array[j]) {
                    for (int k = j; k < end; k++) {
                        array[k] = array[k + 1];
                    }
                    end--;
                    j--;
                }
            }
        }

        return Arrays.copyOf(array, end + 1);
    }

    public static void main(final String[] args) {
        System.out.println(Arrays.toString(removeDupes(new int[] { 3, 45, 1, 2, 3, 3, 3, 3, 2, 1, 45, 2, 10 })));
        System.out.println(Arrays.toString(removeDupes(new int[] { 2, 2, 3, 3 })));
        System.out.println(Arrays.toString(removeDupes(new int[] { 1, 1, 1, 1, 1, 1, 1, 1 })));
    }
}

这是一个修改后的版本，它不会从欺骗之后转移所有元素。相反，它只是用最后一个不匹配的元素切换欺骗。这显然不能保证顺序。

private static int[] removeDupes(final int[] array) {
    int end = array.length - 1;
    for (int i = 0; i <= end; i++) {
        for (int j = i + 1; j <= end; j++) {
            if (array[i] == array[j]) {
                while (end >= j && array[j] == array[end]) {
                    end--;
                }
                if (end > j) {
                    array[j] = array[end];
                    end--;
                }
            }
        }
    }

    return Arrays.copyOf(array, end + 1);
}

【讨论】：

我想的就是这个。
这实际上是O(n^3)（最坏情况）
这是我的印象还是两个外环也是O(n^2)？你不会在第二个for 中减半，因为假设for 将有O(log n)。正如@alfasin 所说，3 fors 将是O(n^3)
第二个循环的大小平均是第一个循环的一半，因为它遍历了第一个循环的剩余部分。第三个循环的大小平均是第二个循环的一半，但它也减小了列表的大小。
但这仍然是 O(n^3)..因为 O(n/2)~O(n) ?

【解决方案3】：

这里有一个最坏的情况O(n^2)，返回指向第一个非唯一元素。所以之前的一切都是独一无二的。可以使用 Java 中的索引代替 C++ 迭代器。

std::vecotr<int>::iterator unique(std::vector<int>& aVector){
    auto end = aVector.end();
    auto start = aVector.begin();
    while(start != end){
        auto num = *start; // the element to check against
        auto temp = ++start; // start get incremented here
        while (temp != end){
            if (*temp == num){
                std::swap(temp,end);
                end--;
            }
            else
                temp++; // the temp is in else so that if the swap occurs the algo should still check the swapped element.
        }
    }
return end;
}

Java 等效代码：（返回将是一个 int，它是第一个非唯一元素的索引）

int unique(int[] anArray){
        int end = anArray.length-1;
        int start = 0;
        while(start != end){
            int num = anArry[start]; // the element to check against
            int temp = ++start; // start get incremented here
            while (temp != end){
                if (anArry[temp] == num){
                    swap(temp,end); // swaps the values at index of temp and end
                    end--;
                }
                else
                    temp++; // the temp is in else so that if the swap occurs the algo should still check the swapped element.
            }
        }
    return end;
    }

此算法与您的算法的细微差别在于您的第 2 点。与其将当前元素替换为 null，不如将其与最后一个可能唯一的元素进行交换，该元素在第一次交换时是数组的最后一个元素，on第二个交换倒数第二个，依此类推。

您不妨考虑查看 C++ 中的std::unique 实现，它线性小于第一个和最后一个之间的距离：比较每对元素，并可能对其中一些元素执行分配。，但正如@keshlam 所指出的，它仅用于排序数组。返回值与我的算法相同。这是直接来自标准库的代码：

template<class _FwdIt, class _Pr> inline
    _FwdIt _Unique(_FwdIt _First, _FwdIt _Last, _Pr _Pred)
    {   // remove each satisfying _Pred with previous
    if (_First != _Last)
        for (_FwdIt _Firstb; (_Firstb = _First), ++_First != _Last; )
            if (_Pred(*_Firstb, *_First))
                {   // copy down
                for (; ++_First != _Last; )
                    if (!_Pred(*_Firstb, *_First))
                        *++_Firstb = _Move(*_First);
                return (++_Firstb);
                }
    return (_Last);
    }

【讨论】：

对不起，我对 C++ 的了解为零。所以我很难跟上。至少我看到您使用 Java 中不允许的指针（*）。所以这很难......或者如果你可以在每一行上方评论它在做什么，它可以帮助我理解
请注意，uniq 仅删除相邻的重复项。如果要删除所有重复项，则需要先进行排序——这会将最初陈述的完整问题放回 N 平方空间。（Alexandru 知道这一点，我只是为那些可能对 uniq 本身是线性的事实感到过度兴奋的人指出这一点。）
@keshlam 你是对的，完全忘记了这一点。这是1-1）
@keshlam：我试过了，它工作了int[] arr2 = { 3, 45, 1, 2, 3, 3, 3, 3, 2, 1, 45, 2, 10 };，这里的重复不是连续的

【解决方案4】：

引入一点观点 - Haskell 中的一种解决方案，它使用列表而不是数组并返回相反的顺序，可以通过在末尾应用 reverse 来修复。

import Data.List (foldl')

removeDup :: (Eq a) => [a] -> [a]
removeDup = foldl' (\acc x-> if x `elem` acc then acc else x:acc) []

【讨论】：