java - Hashset 包含重复项？答案

【问题标题】：java - Hashset contain duplicates?java - Hashset 包含重复项？
【发布时间】：2018-10-10 02:50:59
【问题描述】：

我正在尝试从 PDF 中提取所有文本并将其存储在 HashSet 中。据我所知， HashSet 不包含重复项，因此当我提取它们时它会忽略重复项。但是，当我打印出哈希结果时，我注意到其中有重复的空格。

我想将哈希值插入到我在 MySQL 中的表中，但它有一个主键约束，这给我带来了一些麻烦。有没有办法完全删除哈希中的各种重复项？

我提取文本的代码：

public static void main(String[] args) throws Exception {
      String path ="D:/PDF/searchable.pdf";
        HashSet<String> uniqueWords = new HashSet<>();
        try (PDDocument document = PDDocument.load(new File(path))) {

            if (!document.isEncrypted()) {

                PDFTextStripper tStripper = new PDFTextStripper();
                String pdfFileInText = tStripper.getText(document);
                String lines[] = pdfFileInText.split("\\r?\\n");
                for (String line : lines) {
                    String[] words = line.split(" ");

                    for (String word : words) {
                        uniqueWords.add(word);

                    }

                }
              System.out.println(uniqueWords);

            }
        } catch (IOException e){
            System.err.println("Exception while trying to read pdf document - " + e);
        }
        Object[] words =  uniqueWords.toArray();
        System.out.println(words[1].toString());

        MysqlAccess connection=new MysqlAccess();

        for(int i = 1 ; i <= words.length - 1 ; i++ ) {

            connection.readDataBase(path, words[i].toString());

        }

        System.out.println("Completed");

    }

}

这是我的哈希：

[, highlight, of, Even, copy, file,, or, ., ,, 1, reader,, different, D, F, ll, link, ea, This, ed, document, V, P, ability, regardless, g, d, text., e, b, a, n, o, web, l, footnote., should, Most, IDRH, selection, text-searchable, positioning, u, s, what, r, PDF., happens, er, y, x, to, body, single, ca, te, together, ti, th, would, when, be, Text-Searchable, document,, text, isn't, such, kinds, sh, co, ld, font,, example, ch, this, attempt, have, t,, Notice,, contained, from, re, text.1, page,, style, page., able, if, is, You, standard, PDF, your, as, readers, you, the, in, main, an, iz]

如果它们是唯一的，为什么当我尝试插入主键列时会抛出 " Duplicate entry for key PRIMARY"？

任何建议都将不胜感激。

【问题讨论】：

显然它们不是同一个字符串。
您的输入可能还包含,、空格、制表符等内容。
reader, 可以是一个词吗？你似乎不会处理标点符号。
您的数据库可能有不同的唯一性概念。例如，它可能会将foo 和FOO 视为相同的值。错误消息应该准确地告诉您失败的地方。
您可以使用不区分大小写的集合：new TreeSet<>(String.CASE_INSENSITIVE_ORDER)

标签： java

【解决方案1】：

HashSet 不允许向其中输入任何重复项。

这里是HashSet类的add(E e)方法说明：

public boolean add(E e)

如果指定元素尚不存在，则将其添加到此集合中。更正式地说，如果此集合不包含元素 e2，则将指定的元素 e 添加到此集合中，使得 (e==null ? e2==null : e.equals(e2))。如果此集合已包含该元素，则调用将保持集合不变并返回 false。

在您的情况下，当您在 pdfFileInText 上调用 split 方法时，您将获得具有单个空格的字符串和具有多个空格的字符串的字符串数组，导致您的 HashSet 数据结构同时具有单空格字符串和多空格字符串.但是在将字符串插入到数据库的某处时，该字符串正在被修剪，从而导致重复条目。

要详细说明，请查看以下代码 sn-p：

public class TestHashSetUniqueness {
public static void main(String[] args) {
    HashSet<String> hashSet = new HashSet<String>();
    String oneSpace = " ";
    String twoSpaces = "  ";

    hashSet.add(oneSpace);
    hashSet.add(twoSpaces);

    // Here HashSet size is 2 as it is treating string objects oneSpace
    // and twoSpaces as two different strings.
    System.out.println("HashSet size without trim() : "+hashSet.size());

    hashSet.clear();
    hashSet.add(oneSpace.trim());
    hashSet.add(twoSpaces.trim());

    // As we are trimming(removing the excess spaces) spaces in the strings
    // causing our HashSet to have only one element there by avoiding duplicates
    System.out.println("HashSet size with trim() : "+hashSet.size());
}

}

因此，在将字符串添加到 HashSet 时，对字符串调用 trim() 以解决您的问题。

我希望这能回答你的问题。

【讨论】：

它会忽略小写和大写的重复项，如 Fish 和 fish 吗？如果这两个元素在哈希集中，我的 SQL 不会让我插入
不，它没有。对于HashSet<String>，唯一性标准是String::equals的结果
要根据大小写过滤重复项，您需要将 TreeSet 与 CASE_INSENSITIVE_ORDER 比较器一起使用。请通过以下链接了解更多信息。 stackoverflow.com/questions/24558456/…