解析 RSS 时出错 -> org.xml.sax.SAXParseException；行号：1；列号：1；文件过早结束答案

【问题标题】：Error Parsing RSS -> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Premature end of file解析 RSS 时出错 -> org.xml.sax.SAXParseException；行号：1；列号：1；文件过早结束
【发布时间】：2018-02-04 04:22:47
【问题描述】：

我有一种方法可以从不同的 url 解析 RSS 并且效果很好：

例如：https://www.clarin.com/rss/lo-ultimo/

但是在其中一个 url (https://www.cio.com/category/mobile/index.rss) 和该网站的所有 RSS 中，当我执行代码时，控制台会显示以下错误，而解析器不会作品：

org.xml.sax.SAXParseException;行号：1；列号：1；文件过早结束。

我正在使用此方法（代码的一部分）解析 RSS 提要：

        try {
            DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
            DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();

            URL url = new URL("https://www.cio.com/category/mobile/index.rss");
            URLConnection urlConnection = url.openConnection();
            InputStream inputStream = urlConnection.getInputStream();

            Document doc = dBuilder.parse(inputStream);

错误发生在最后一行 -> Document doc = dBuilder.parse(inputStream);

在该代码中，我从 url 解析 RSS，奇怪的是，当我直接从文件 (index.rss) 解析 RSS 时，我没有错误并且解析效果很好，我这样做是使用：

File fXmlFile = new File("index.rss");

DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();

DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();

Document doc = dBuilder.parse(fXmlFile);

Document doc = dBuilder.parse(inputStream);

doc.getDocumentElement().normalize();

注意：

这是一个 maven webapp 项目。

部署在 Tomcat 9.0 服务器中。

当我在 Web 主页中按下按钮时运行该方法。

我提到这是因为当我在一个简单的 java 项目中尝试时，解析器也可以与 inputStream 一起正常工作。

如果您能帮我解决这个问题，我将不胜感激，谢谢！

【问题讨论】：

标签： java xml parsing inputstream domparser

【解决方案1】：

我已经运行了以下代码，它运行良好，没有错误。

     public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {

        DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();

        URL url = new URL("https://www.cio.com/category/mobile/index.rss");
        URLConnection urlConnection = url.openConnection();
        InputStream inputStream = urlConnection.getInputStream();

        Document doc = dBuilder.parse(inputStream);
        Element root = doc.getDocumentElement();
        NodeList children = root.getChildNodes();

        for (int i = 0; i < children.getLength(); i++) {
             System.out.println(children.item(i));
        }

        inputStream.close();

     }

然后我添加了以下内容并尝试解析一个空文件：

    File fXmlFile = new File("EmptyFile.xml");
    inputStream = new FileInputStream(fXmlFile);
    doc = dBuilder.parse(inputStream);
    System.out.println(doc.getDocumentElement());

当文件为空（或仅包含 XML 处理指令）时，我收到了您收到的错误。当我添加一个根元素时，错误消失了。在我看来，这似乎证明了当 inputStream （或者它正在流式传输的东西）基本上为空时会发生此错误。这个理论似乎也得到了支持：org.xml.sax.SAXParseException: Premature end of file for *VALID* XML。因此，如果您仍然收到此错误，我建议您在 URL url 上放置一个断点...并按照它来查看连接是否正确。希望对您有所帮助。

【讨论】：

我阅读了您的答案，然后创建了一个简单的 java 项目，以在静态 main 方法中证明我的代码，并且代码运行良好。但是在我最初的问题中，我忘了提到这是部署在 Tomcat Server 9.0 中的“maven webapp 项目”，当我单击 web 中的按钮时，该方法就会运行。不知道是不是和这个问题有关，但是在web项目里面不行。
好的。嗯，这显然是问题所在……现在是傍晚，是喝啤酒的时间（?），周一是国定假日。但如果明天有机会，我会去看看。如果没有，我会尽快看看:-) 同时，祝你好运:-)
哈哈，好的，迈克，在此期间，我将尝试获得解决方案，如果我得到解决方案，我会发布它，非常感谢！
我要做的第一件事是证明问题肯定出在输入流上，而不是其他地方。有一个名为 PushbackInputStream 的类（参见tutorials.jenkov.com/java-io/pushbackinputstream.html）。也许使用它来检查流中是否有要读取的内容，然后仅在有时才访问它。这应该允许您管理/捕获错误。如果我们的理论是正确的并且 inputstream 是空的，那么下一步就是想办法解决这种情况。这将使您的代码再次工作，并为您赢得时间来弄清楚为什么流有时会失败:-)