Java - 从文本文件打印 unicode 不输出相应的 UTF-8 字符答案

【问题标题】：Java - Printing unicode from text file doesn't output corresponding UTF-8 characterJava - 从文本文件打印 unicode 不输出相应的 UTF-8 字符
【发布时间】：2017-07-08 17:30:08
【问题描述】：

我有这个包含大量 unicode 的文本文件，并试图在控制台中打印相应的 UTF-8 字符，但它打印的只是十六进制字符串。就像我复制任何值并将它们粘贴到 System.out 一样，它可以正常工作，但从文本文件中读取它们时就不行了。

以下是我读取文件的代码，其中包含诸如 \u00C0、\u00C1、\u00C2、\u00C3 之类的值行，这些值会打印到控制台，而不是我想要的值。

private void printFileContents() throws IOException {
    Path encoding = Paths.get("unicode.txt");
    try (Stream<String> stream = Files.lines(encoding)) {

        stream.forEach(v -> { System.out.println(v); });

    } catch (IOException e) {
        e.printStackTrace();
    }
}

这是我用来解析首先包含 unicode 的 html 的方法。

private void parseGermanEncoding() {

    try 
    {
        File encoding = new File("encoding.html");

        Document document = Jsoup.parse(encoding, "UTF-8", "http://example.com/");

        Element table = document.getElementsByClass("codetable").first();

        Path f = Paths.get("unicode.txt");

        try (BufferedWriter wr = new BufferedWriter(new FileWriter(f.toFile()))) 
        {
            for (Element row : table.select("tr"))
            {
                Elements tds = row.select("td");

                String unicode = tds.get(0).text();

                if (unicode.startsWith("U+"))
                {
                    unicode = unicode.substring(2);
                }

                wr.write("\\u" + unicode);
                wr.newLine();   

            }   
            wr.flush();
            wr.close();
        }

    } catch (IOException e) 
    {
        e.printStackTrace();
    }
}

【问题讨论】：

你是不是在你的文件里写了\u00C2等等？请向我们展示文本文件的一部分
文本文件如下所示。 '\u00C0 \u00C1 \u00C2 \u00C3 \u00C4 \u00C5 \u00C6 \u00C7 \u00C8 \u00C9 \u00CA \u00CB \u00CC \u00CD \u00CE \u00CF \u00D0 \u00D1 \u00D2 \u00D3 \u00D4'
抱歉，打印不正确。基本上，这些值中的每一个都在单独的行上。
在原帖中添加了更多内容。

标签： java file parsing utf-8 path

【解决方案1】：

您需要将字符串从 unicode 编码字符串转换为 UTF-8 编码字符串。您可以按照以下步骤操作，1. 使用 myString.getBytes("UTF-8") 将字符串转换为字节数组，以及 2. 使用 new String(byteArray, "UTF-8") 获取 UTF-8 编码字符串。对于 UnsupportedEncodingException，代码块需要用 try/catch 包围。

【讨论】：

还是不行。我的方法现在如下所示。 Path encoding = Paths.get("unicode.txt"); System.out.println("\u00D9 \u00FC \u00C2 \u00C7 Acme, Inc."); try (Stream<String> stream = Files.lines(encoding)) { stream.forEach(v -> { try { byte[] bytes = v.getBytes("UTF-8"); String str = new String(bytes, "UTF-8"); System.out.println(str); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } })
代码在注释中打印不好。我在这个系统中包含了另一个系统，第一个系统打印了我想要的正确字符。
在您帖子的原始代码中，您可以尝试使用 stream.forEach(System.out::println); ?
其实我原来也是这样的，结果也一样。
好的，那么您可能想看看这篇文章的答案。 stackoverflow.com/questions/11145681/…

【解决方案2】：

感谢上面 OTM 的评论，我能够为此找到一个可行的解决方案。您获取 unicode 字符串，使用 Integer.parseInt() 转换为十六进制，最后转换为 char 以获得实际值。此解决方案基于 OTM 提供的这篇帖子 - How to convert a string with Unicode encoding to a string of letters

private void printFileContents() throws IOException {
    Path encoding = Paths.get("unicode.txt");

    try (Stream<String> stream = Files.lines(encoding)) {
        stream.forEach(v -> 
        {
            String output = "";

            // Takes unicode digits and converts to HEX value
            int parse = Integer.parseInt(v, 16);

            // Get the actual value of the hex value
            output += (char) parse; 

            System.out.println(output);
        });

    } catch (IOException e) {
        e.printStackTrace();
    }
}

【讨论】：