UTF-16LE 编码和 xerces2 Java答案

【问题标题】：UTF-16LE encoding and xerces2 JavaUTF-16LE 编码和 xerces2 Java
【发布时间】：2020-01-12 04:14:49
【问题描述】：

我浏览了一些帖子，例如 FileReader reads the file as a character stream 和 can be treated as whitespace if the document is handed as a stream of characters，其中的答案说输入源实际上是字符流，而不是字节流。

但是，1 建议的解决方案似乎不适用于 UTF-16LE。虽然我使用这个代码：

    try (final InputStream is = Files.newInputStream(filename.toPath(), StandardOpenOption.READ)) {
      DOMParser parser = new org.apache.xerces.parsers.DOMParser();
      parser.parse(new InputSource(is));
      return parser.getDocument();
    } catch (final SAXParseException saxEx) {
      LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
    }

我仍然收到org.xml.sax.SAXParseException: Content is not allowed in prolog.。

我查看了 Files.newInputStream，它确实使用了ChannelInputStream，它将移交字节，而不是字符。我还尝试设置 InputSource 对象的编码，但没有运气。我还检查了<?xml 部分之前没有多余的字符（BOM 除外）。

我还想提一下，这段代码适用于 UTF-8。

// 编辑：我也试过 DocumentBuilderFactory.newInstance().newDocumentBuilder().parse() 和 XmlInputStreamReader.next()，结果一样。

// 编辑 2: 尝试使用缓冲阅读器。结果相同：序言中出现意外字符“뿯”（代码 49135 / 0xbfef）；预期'

提前致谢。

【问题讨论】：

如果去掉开头的 BOM（跳过前两个字节）怎么办？ ... { is.read(): is.read();
如果没有 bom 或 ISO-8859-1，我将无法读取 UTF-8。 :(
<?xml encoding=...?> 中给出或默认编码为 UTF-8。我听说在极少数情况下，BOM 会出现这样的问题。但我不记得具体的了。
我什至无法做到这一点。我想阅读您所指的标签和属性。但是请看我的第二次编辑，它在那之前就停止了。
我仔细检查了一遍。该文件以 BOM 0xFF 0xFE 开头。也许我需要把它包装成一个 BOMRemovingInputStream...

标签： java xml utf-16 xerces byte-order-mark

【解决方案1】：

为了进一步收集一些信息：

byte[] bytes = Files.readAllBytes(filename.toPath);
String xml = new String(bytes, StandardCharsets.UTF_16LE);
if (xml.startsWith("\uFEFF")) {
    LOG.info("Has BOM and is evidently UTF_16LE");
    xml = xml.substring(1);
}
if (!xml.contains("<?xml")) {
    LOG.info("Has no XML declaration");
}
String declaredEncoding = xml.replaceFirst("<?xml[^>]*encoding=[\"']([^\"']+)[\"']", "$1");
if (declaredEncoding == xml) {
    declaredEncoding = "UTF-8";
}
LOG.info("Declared as " + declaredEncoding);

try (final InputStream is = new ByteArrayInputStream(xml.getBytes(declaredEncoding))) {
  DOMParser parser = new org.apache.xerces.parsers.DOMParser();
  parser.parse(new InputSource(is));
  return parser.getDocument();
} catch (final SAXParseException saxEx) {
  LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}

【讨论】：

我用xxd查看了文件，我知道它的开头是\uFFFE。
bytes FF FE 在 UTF-16LE 实际上是 char \uFEFF aka BOM（有点奇怪的 Unicode 数字）
嘿，顺便说一句。由于 UTF-8 字节中的 BOM 是 EF BB BF，这可以解释您的 edit-2: 0xbfef 多少。
IntelliJ 和 file 都显示这是一个 UTF-16LE 文件。 UTF16LE 以\uFFFE 开头。 \uFFFE 并不奇怪，根据维基百科，它是“无字符”字符。第二次编辑意外使用了 UTF-8 解释，当我再次放回 UTF16LE 时没有任何变化:(
（深吸一口气）...... MAVEN 资源过滤。我一直在查看源代码，但是在查看目标文件时，您可以在 o/t 文件的开头看到一些额外的字节。很遗憾看到这个:(