使用 ow3c.dom.Document 对象解析文档时出现解析错误，在文档的元素内容中发现 (Unicode: 0x1a)答案

【问题标题】：parsing error while parsing document using ow3c.dom.Document object, (Unicode: 0x1a) was found in the element content of the document使用 ow3c.dom.Document 对象解析文档时出现解析错误，在文档的元素内容中发现 (Unicode: 0x1a)
【发布时间】：2014-05-30 19:48:49
【问题描述】：

我收到错误消息：org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 14515; An invalid XML character (Unicode: 0x1a) was found in the element content of the document。

我收到错误的 xml 文件内容：

 <Product>
          <Description>672577000 3M 4540 DISPOSABLE COVERALL → XL</Description>
 </Product>

我在使用 org.w3c.dom.Document 对象解析文档时遇到此错误，由于输入文件中的 → 而发生错误。那么我该如何解决这个问题呢？

【问题讨论】：

是的，我从开始，但也收到此错误
无效的 XML 字符 (Unicode: 0x1a) 表示存在无效字符。检查以确保没有双字节字符或其他东西。

标签： java xml

【解决方案1】：

I resolved this by using below code
String removedUnicodeChar  = "DISPOSABLE COVERALL → XXL</Description></Order> ↔ ↕ ↑ ↓ → ABC";
Pattern pattern = Pattern.compile("[\\p{Cntrl}|\\uFFFD]");
Matcher m = pattern.matcher(removedUnicodeChar);
if(m.find()){
    System.out.println("Control Characters found");
    removedUnicodeChar = m.replaceAll("");
}

【讨论】：

【解决方案2】：

xml 文件中不允许所有字符。这里有一个链接供您查找允许或不鼓励和不允许重置的链接：

http://en.wikipedia.org/wiki/Valid_characters_in_XML

不允许使用您的 (→)。

【讨论】：

那么如何以编程方式从文件中删除这种类型的字符？
你应该先考虑一下这个角色的意义是什么？这是你想出来的吗？那么很容易想出xml接受的东西。是不是你无法控制的东西（可能是另一个系统生成的东西）？然后就没有那么容易了。无论如何，应该找到超出数据域的字符或字符集。这可以坦率地替代你的邪恶角色（→）。
至于如何删除它？它只是对文件进行简单的预处理以删除或替换字符的所有实例。你可以看一个关于如何做到这一点的简单教程。