在忽略字段的列表中查找重复项答案

【问题标题】：Finding duplicates in a List ignoring a field在忽略字段的列表中查找重复项
【发布时间】：2015-03-11 15:26:14
【问题描述】：

我有一个 List 的 Persons，我想查找重复的条目，考虑除 id 之外的所有字段。所以使用equals()-方法（以及结果List.contains()），因为他们考虑了id。

public class Person {
    private String firstname, lastname;
    private int age;
    private long id;
}

修改equals() 和hashCode() 方法以忽略id 字段不是一个选项，因为代码的其他部分依赖于此。

如果我想忽略 id 字段，Java 中最有效的方法是什么？

【问题讨论】：

创建自定义 Comparator<Person> 并使用 TreeSet<Person>。
也许将Set与自定义Comparator一起使用
Finding duplicate values in arraylist的可能重复

标签： java list duplicate-removal

【解决方案1】：

构建一个Comparator<Person> 来实现您的自然键排序，然后使用基于二进制搜索的重复数据删除。 TreeSet 会给你这种开箱即用的能力。

请注意Comparator<T>.compare(a, b) must fulfil 通常的反对称、传递性、一致性和自反性要求或二分搜索排序将失败。您还应该使其具有空值感知能力（例如，如果一个、其他或两者的名字字段为空）。

您的 Person 类的一个简单的自然键比较器如下所示（它是一个静态成员类，如果您有每个字段的访问器，您还没有显示）。

public class Person {
    public static class NkComparator implements Comparator<Person>
    {
        public int compare(Person p1, Person p2)
        {
            if (p1 == null || p2 == null) throw new NullPointerException();
            if (p1 == p2) return 0;
            int i = nullSafeCompareTo(p1.firstname, p2.firstname);
            if (i != 0) return i;
            i = nullSafeCompareTo(p1.lastname, p2.lastname);
            if (i != 0) return i;
            return p1.age - p2.age;
        }
        private static int nullSafeCompareTo(String s1, String s2)
        {
            return (s1 == null)
                    ? (s2 == null) ? 0 : -1
                    : (s2 == null) ? 1 : s1.compareTo(s2);
        }
    }
    private String firstname, lastname;
    private int age;
    private long id;
}

然后您可以使用它来生成唯一列表。使用 add 方法返回 true 当且仅当元素不存在于集合中：

List<Person> newList = new ArrayList<Person>();
TreeSet<Person> nkIndex = new TreeSet<Person>(new Person.NkComparator());
for (Person p : originalList)
    if (nkIndex.add(p)) newList.add(p); // to generate a unique list

或将最后一行换成此行以输出重复项

    if (nkIndex.add(p)) newList.add(p);

无论您做什么，都不要在枚举原始列表时使用remove，这就是这些方法将您的独特元素添加到新列表的原因。

如果您只对唯一列表感兴趣，并且希望使用尽可能少的行：

TreeSet<Person> set = new TreeSet<Person>(new Person.NkComparator());
set.addAll(originalList);
List<Person> newList = new ArrayList<Person>(set);

【讨论】：

compare 如果 p1 是 null 但 p2 不是，则抛出 NPE。
@pbabcdefp。来自the docs：与 Comparable 不同，比较器可以选择允许比较空参数，同时保持等价关系的要求。。如果您认为null 具有自然排序顺序，我会将其解释为从compare 抛出NPE 是可选的。不过，我编辑的答案确实显示了如何将其用于输入参数。

【解决方案2】：

正如 @LuiggiMendoza 在 cmets 中建议的那样：

您可以创建一个自定义的 Comparator 类来比较两个 Person 对象的相等性，忽略它们的 ID。

class PersonComparator implements Comparator<Person> {

    // wraps the compareTo method to compare two Strings but also accounts for NPE
    int compareStrings(String a, String b) {
        if(a == b) {           // both strings are the same string or are null
          return 0;
        } else if(a == null) { // first string is null, result is negative
            return -1;
        } else if(b == null){  // second string is null, result is positive
            return 1;
        } else {               // no strings are null, return the result of compareTo
            return a.compareTo(b);
        }
    }

    @Override
    public int compare(Person p1, Person p2) {

        // comparisons on Person objects themselves
        if(p1 == p2) {                 // Person 1 and Person 2 are the same Person object
            return 0;
        }
        if(p1 == null && p2 != null) { // Person 1 is null and Person 2 is not, result is negative
            return -1;
        }
        if(p1 != null && p2 == null) { // Person 1 is not null and Person 2 is, result is positive
            return 1;
        }

        int result = 0;

        // comparisons on the attributes of the Persons objects
        result = compareStrings(p1.firstname, p2.firstname);
        if(result != 0) {   // Persons differ in first names, we can return the result
            return result;
        }
        result = compareStrings(p1.lastname, p2.lastname);
        if(result != 0) {  // Persons differ in last names, we can return the result
            return result;
        }

        return Integer.compare(p1.age, p2.age); // if both first name and last names are equal, the comparison difference is in their age
    }
}

现在您可以将TreeSet 结构与此自定义Comparator 一起使用，例如，创建一个消除重复值的简单方法。

List<Person> getListWithoutDups(List<Person> list) {
    List<Person> newList = new ArrayList<Person>();
    TreeSet<Person> set = new TreeSet<Person>(new PersonComparator()); // use custom Comparator here

    // foreach Person in the list
    for(Person person : list) {
        // if the person isn't already in the set (meaning it's not a duplicate)
        // add it to the set and the new list
        if(!set.contains(person)) {
            set.add(person);
            newList.add(person);
        }
        // otherwise it's a duplicate so we don't do anything
    }

    return newList;
}

TreeSet、as the documentation says、“提供有保证的 log(n) 时间成本”中的 contains 操作。

我上面建议的方法需要O(n*log(n)) 时间，因为我们对每个列表元素执行contains 操作，但它也使用O(n) 空间来创建新列表和TreeSet。

如果您的列表非常大（空间非常重要）但处理速度不是问题，那么您可以删除每个找到的重复项，而不是将每个非重复项添加到列表中：

 List<Person> getListWithoutDups(List<Person> list) {
    TreeSet<Person> set = new TreeSet<Person>(new PersonComparator()); // use custom Comparator here
    Person person;
    // for every Person in the list
    for(int i = 0; i < list.size(); i++) {
        person = list.get(i);
        // if the person is already in the set (meaning it is a duplicate)
        // remove it from the list
        if(set.contains(person) { 
            list.remove(i);
            i--; // make sure to accommodate for the list shifting after removal
        } 
        // otherwise add it to the set of non-duplicates
        else {
            set.add(person);
        }
    }
    return list;
}

由于列表上的每个remove 操作都需要O(n) 时间（因为每次删除元素时列表都会移动），并且每个contains 操作需要log(n) 时间，因此这种方法将是O(n^2 log(n))及时。

但是，空间复杂度将减半，因为我们只创建 TreeSet 而不是第二个列表。

【讨论】：

您的比较器不是反对称的，您确定它可以与 TreeSet 一起使用吗？它也不是空安全的。
你是对的。我的意图不是用所有的错误检查来挤满代码，而只是显示主要逻辑。我添加了无效/空数据检查。
您的比较器仍然无法工作。你的getListWithoutDups 方法也坏了。
我看到了您的回答，这是一种更清洁、更有效的方法，+1。不过，我似乎并没有关注你的错误。能举个例子吗？
在getListWithoutDups 中，假设您删除了第 0 个元素。所有元素都向上移动。现在你增加 i 并跳过“新”第 0 个元素。在改变集合的同时迭代集合时要非常小心。对于比较器，它需要是反对称的：它不能只为“不等于”返回 -1，它必须是 -ve 或 +ve，如果你反转输入参数 p1 和 p2，它必须反转。

【解决方案3】：

我建议不要使用Comparator 来执行此操作。基于其他字段编写合法的compare()方法是相当困难的。

我认为更好的解决方案是像这样创建一个类PersonWithoutId：

public PersonWithoutId {
  private String firstname, lastname;
  private int age;
  // no id field
  public PersonWithoutId(Person original) { /* copy fields from Person */ }
  @Overrides public boolean equals() { /* compare these 3 fields */ }
  @Overrides public int hashCode() { /* hash these 3 fields */ }
}

然后，给定一个名为 people 的 List<Person>，您可以这样做：

Set<PersonWithoutId> set = new HashSet<>();
for (Iterator<Person> i = people.iterator(); i.hasNext();) 
    if (!set.add(new PersonWithoutId(i.next())))
        i.remove();

编辑

正如其他人在 cmets 中指出的那样，这种解决方案并不理想，因为它会为垃圾收集器创建大量对象来处理。但是这个解决方案比使用Comparator 和TreeSet 的解决方案快很多。保持Set 有序需要时间，并且与原始问题无关。我在Lists 的 1,000,000 个 Person 实例上测试了这个，使用

new Person(
    "" + rand.nextInt(500),  // firstname 
    "" + rand.nextInt(500),  // lastname
    rand.nextInt(100),       // age
    rand.nextLong())         // id

并发现此解决方案的速度大约是使用 TreeSet 的解决方案的两倍。（诚然，我使用了System.nanoTime() 而不是正确的基准测试）。

那么，如何在不创建大量不必要对象的情况下有效地做到这一点呢？ Java 并不容易。一种方法是在Person 中编写两个新方法

boolean equalsIgnoringId(Person other) { ... }

int hashCodeIgnoringId() { ... }

然后编写Set<Person> 的自定义实现，您基本上剪切并粘贴HashSet 的代码，除非您将equals() 和hashCode() 替换为equalsIgnoringId() 和hashCodeIgnoringId()。

以我的拙见，您可以创建使用 Comparator 的 TreeSet，但不能创建使用自定义版本的 equals/hashCode 的 HashSet，这一事实是语言中的一个严重缺陷.

【讨论】：

如果您编写了一个 List 不知道的自定义比较器，则不会更改列表的顺序。仅仅为此目的创建派生类不是一个好主意：这是关于替代相等和排序（实际上是自然键排序而不是主键排序），因此比较器是一个完美的工具。创建一个新类型并使用所有类型转换来搞砸 GC 并不是一个好主意。
我已经删除了关于更改订单的行。从技术上讲，您可以使用 TreeSet 来做到这一点。但是 Comparator 是完美工具的想法是完全错误的——问题在于平等，而不是顺序。我同意我的解决方案有缺陷，但没有令人满意的解决方案。
我喜欢您的新论点，但是当我尝试对您的解决方案进行性能测试时，结果取决于重复百分比、列表大小、稳态与启动，甚至 JVM。我在您的rand 范围内得到了与您类似的结果，但根据具体条件，我看到了加速和减速。 tldr：性能测试is hard
@AndyBrown 我知道我的测试方法非常懒惰 - 谢谢你的链接。我会阅读它，因为我对性能测试一无所知。

【解决方案4】：

您可以通过 <K,V> 对来使用 Java HashMap。 Map<K,V> map = new HashMap<K,V>()。此外，还需要某种形式的 Comparator 实现。如果您检查 containsKey 或 containsValue 方法并发现您已经有了一些东西（即您正在尝试添加重复项，请将它们保留在原始列表中。否则，将它们弹出。这样，您将得到一个列表原始列表中重复的元素。TreeSet 将是另一种选择，但我还没有使用它，因此无法提供建议。

【讨论】：