字符串中文件存储的文本内容未将 unicode 转换为 ISO_8859_1答案

【问题标题】：Text content of file store in String not converting unicode to ISO_8859_1字符串中文件存储的文本内容未将 unicode 转换为 ISO_8859_1
【发布时间】：2021-09-15 11:54:15
【问题描述】：

我正在尝试将 Unicode 转换为 ISO_8859_1。在 Java String 变量中声明 Unicode 非常简单，例如

String myString = "\u00E9checs";
byte[] bytesOfString = myString.getBytes();
String encoded_String = new String(bytesOfString, StandardCharsets.ISO_8859_1);
System.out.println(encoded_String);

输出：

échecs

到目前为止一切都很好，但是当我尝试转换保存在文件中的相同文本时，它不仅仅是转换打印，这里我附上了从文件中读取并执行转换的代码。

    String path = "st.txt"; //where st.txt contains only one line i.e. \u00E9checs
    FileInputStream inputStream = null;
    Scanner sc = null;
    try {
        inputStream = new FileInputStream(path);
        sc = new Scanner(inputStream);
        while (sc.hasNextLine()) {
            byte[] bytesOfString = sc.nextLine().getBytes();   
            String encoded_String = new String(bytesOfString, StandardCharsets.ISO_8859_1);
            System.out.println(encoded_String); 
        
        }

        if (sc.ioException() != null) {
            throw sc.ioException();
        }
    } finally {
        if (inputStream != null) {
            inputStream.close();
        }
        if (sc != null) {
            sc.close();
        }
    }

输出：

\u00E9checs

注意：这是一个测试代码，因此我在文件中使用了一行；我需要对一个大文件应用相同的过程，因为我使用 Scanner Class 来节省内存使用率。

谁能指导我如何使文件中的文本与我在 Unicode 直接在 Java 字符串变量中声明时得到的结果相同？

提前感谢您，期待您的早日回复。

【问题讨论】：

文件中的这一行是否包含文本échecs 或\u00E9checs？第一个字符是é 还是反斜杠？如果它在 ISO-8859-1 中包含 échecs，我无法重现您的问题。
\u00E9 是 Java 用于 Unicode 字符的表示形式。使用调试器逐步完成您的第一个示例，并查看 myString.getBytes() 返回的内容。这就是您需要放入 UTF8 编码文件以获取 échecs

标签： java unicode iso-8859-1

【解决方案1】：

这就是问题所在：

      byte[] bytesOfString = sc.nextLine().getBytes();
      String encoded_String = new String(bytesOfString, StandardCharsets.ISO_8859_1);

所以：

一个文件中有一些 8859-1 字节
扫描仪在假设它们是 Unicode 的情况下读取它们
然后您将 Unicode 数据转换为一些 UTF-8 字节
然后假装它们是 8859-1 将字节转换为 Unicode

您应该使用需要 8859-1 输入的扫描器：

  new Scanner(inputstream, StandardCharsets.ISO_8859_1);

然后 nextLine 会做正确的转换；不再需要代码杂耍。

【讨论】：