将 windows-1252 输入文件转换为 utf-8 输出文件的字符编码答案

【问题标题】：Character encoding converting windows-1252 input file to utf-8 output file将 windows-1252 输入文件转换为 utf-8 输出文件的字符编码
【发布时间】：2019-12-09 20:35:30
【问题描述】：

我正在处理从 Word 的保存选项（以编程方式）转换为 HTML 的 HTML 文档。这个 HTML 文本文件是 windows-1252 编码的。（是的，我已经阅读了很多关于字节和 Unicode 代码点的内容，我知道超过 128 的代码点可以是 2,3，最多可以是 6 个字节，等等。）我在我的 Word 文档模板中添加了很多不可打印的字符并编写代码来评估每个 CHARACTER（十进制等价物）。当然，我知道我不想允许十进制 #160，这是 MS Word 将不间断空格的 HTML 翻译。我预计在不久的将来人们会将更多这些“非法”构造放入模板中，我需要捕获它们并处理它们（因为它们会在浏览器中引起有趣的查看：（这是在转储到 Eclipse 控制台，我将所有文档行放入地图中）

 DataObj.paragraphMap  : {1=, 2=Introduction and Learning Objective, 3=? ©®™§¶…‘’“”????, 4=, 5=, 6=, 
   7=This is paragraph 1 no formula, 8=,

我将十进制 #160 替换为 #32（常规空格），然后使用 UTF-8 编码将字符写入新文件 - 我的想法也是如此，我可以使用这种技术来替换还是决定不回写使用十进制等效的特定字符？我想避免使用字符串，因为我可以处理多个文档并且不想耗尽内存....所以我在文件中进行...

 public static void convert1252toUFT8(String fileName) throws IOException {   
    File f = new File(fileName);
    Reader r = new BufferedReader(new InputStreamReader(new FileInputStream(f), "windows-1252"));
    OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(fileName + "x"), StandardCharsets.UTF_8); 
    List<Character> charsList = new ArrayList<>(); 
    int count = 0;

    try {
        int intch;
        while ((intch = r.read()) != -1) {   //reads a single character and returns integer equivalent
            int ch = (char)intch;
            //System.out.println("intch=" + intch + " ch=" + ch + " isValidCodePoint()=" + Character.isValidCodePoint(ch) 
            //+ " isDefined()=" + Character.isDefined(ch) + " charCount()=" + Character.charCount(ch) + " char=" 
            //+ (char)intch);

            if (Character.isValidCodePoint(ch)) {
                if (intch == 160 ) {
                    intch = 32;
                }
                charsList.add((char)intch);
                count++;
            } else {
                System.out.println("unexpected character found but not dealt with.");
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        System.out.println("Chars read in=" + count + " Chars read out=" + charsList.size());
        for(Character item : charsList) {
            writer.write((char)item);
        }
        writer.close();
        r.close();
        charsList = null;

        //check that #160 was replaced File 
        //f2 = new File(fileName + "x"); 
        //Reader r2 = new BufferedReader(new InputStreamReader(new FileInputStream(f2), "UTF-8")); 
        //int intch2;
        //while ((intch2 = r2.read()) != -1) { //reads a single character and returns integer equivalent 
        //int ch2 = (char)intch2; 
        //System.out.println("intch2=" + intch2 + " ch2=" + ch2 + " isValidCodePoint()=" +
        //Character.isValidCodePoint(ch2) + " char=" + (char)intch2); 
        //}

    }   
}

【问题讨论】：

标签： java file ms-word character-encoding character

【解决方案1】：

首先，HTML 页面采用不同于 UTF-8 的编码并没有错。事实上，文档中很可能包含这样的一行

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

在其标题中，当您更改文件的字符编码而不调整此标题行时，会使文档无效。

此外，没有理由替换文档中的代码点 #160，因为它是 Unicode 的标准 non breaking space character，这就是为什么 &#160; 是 &nbsp; 的有效替代品的原因，如果文档的字符集支持这一点codepoint，直接使用也是有效的。

您尝试避免使用字符串是premature optimization 的典型案例。缺乏实际测量会导致像ArrayList<Character> 这样的解决方案消耗的内存是String 的两倍¹。

如果要复制或转换文件，则不应将整个文件保存在内存中。只需在读取下一个之前将数据写回，但为了效率，使用一些缓冲区而不是一次读取和写入单个字符。此外，您应该使用try-with-resources statement 来管理输入和输出资源。

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    try(BufferedReader br = Files.newBufferedReader(in, Charset.forName("windows-1252"));
        BufferedWriter bw = Files.newBufferedWriter(out, // default UTF-8
            StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING)) {

        char[] buffer = new char[1000];
        do {
            int count = br.read(buffer);
            if(count < 0) break;
            readCount += count;

            // if you really want to replace non breaking spaces:
            for(int ix = 0; ix < count; ix++) {
                if(buffer[ix] == 160) buffer[ix] = ' ';
            }

            bw.write(buffer, 0, count);
            writeCount += count;
        } while(true);
    } finally {
        System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
    }
}

测试字符的有效性没有意义，因为解码器不会产生无效的代码点。解码器默认配置为在无效字节上抛出异常。其他选项是用替换字符（如 �）替换无效输入或跳过它们，但它永远不会产生无效字符。

操作期间所需的内存量由缓冲区大小决定，尽管上面的代码使用了各自拥有缓冲区的读取器和写入器。用于该操作的内存总量仍然与文件大小无关。

仅使用您明确指定的缓冲区的解决方案如下所示

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    try(Reader br = Channels.newReader(Files.newByteChannel(in), "windows-1252");
        Writer bw = Channels.newWriter(
            Files.newByteChannel(out, WRITE, CREATE, TRUNCATE_EXISTING),
            StandardCharsets.UTF_8)) {

        char[] buffer = new char[1000];
        do {
            int count = br.read(buffer);
            if(count < 0) break;
            readCount += count;

            // if you really want to replace non breaking spaces:
            for(int ix = 0; ix < count; ix++) {
                if(buffer[ix] == 160) buffer[ix] = ' ';
            }

            bw.write(buffer, 0, count);
            writeCount += count;
        } while(true);
    } finally {
        System.out.println("Chars read in="+readCount+" Chars written out="+writeCount);
    }
}

这也是实现对无效输入进行不同处理的起点，例如要删除所有无效的输入字节，您只需将方法的开头更改为

public static void convert1252toUFT8(String fileName) throws IOException {
    Path in = Paths.get(fileName), out = Paths.get(fileName+"x");
    int readCount = 0, writeCount = 0;
    CharsetDecoder dec = Charset.forName("windows-1252")
            .newDecoder().onUnmappableCharacter(CodingErrorAction.IGNORE);
    try(Reader br = Channels.newReader(Files.newByteChannel(in), dec, -1);
…

请注意，对于成功的转换，读取和写入的字符数是相同的，但仅对于输入编码 Windows-1252，字符数与字节数相同，即文件大小（当整个文件有效）。

此转换代码示例仅用于完成，如开头所述，在不调整标题的情况下转换 HTML 页面可能会使文件无效，甚至没有必要。

¹取决于实现，甚至四次

【讨论】：

谢谢 - 关于这个主题有很多“噪音”，我尝试了您改进的解决方案，效果很好！我希望将 Character ArrayList 清空可以使其可用于 gc，而不是占用更多不可变的 String 内存。您对缓冲区等的建议非常有帮助 - 是的，我确实将“新”文件中的字符集更改为 utf-8。
在操作结束时设置为null 是不必要的，因为无论如何它都符合 gc 条件。这也适用于String 对象，不可变不会阻止 gc。但是在操作期间ArrayList<Character> 的内存消耗要高得多，因为您有一个对Character 对象的引用列表，而不是围绕char[] 数组的包装器（如果JRE 没有，那就更糟了） t 重用大多数 Character 实例）。正如答案中所说，不一次将整个文件放在内存中也有助于减少消耗的内存。