解析几乎格式良好的 XML 片段：如何跳过多个 XML 标头答案

【问题标题】：parsing almost well formed XML fragments: how to skip over multiple XML headers解析几乎格式良好的 XML 片段：如何跳过多个 XML 标头
【发布时间】：2012-08-09 15:09:12
【问题描述】：

我需要编写一个工具来处理以下格式不正确的 XML 片段，因为它在流中间包含 XML 声明。

公司已长期使用此类文件，无法更改格式。

没有可用于解析的源代码，新工具的首选平台是 .NET 4 或更新版本，最好使用 C#。

片段如下所示：

<Header>
  <Version>1</Version>
</Header>
<Entry><?xml version="1.0"?><Detail>...snip...</Detail></Entry>
<Entry><?xml version="1.0"?><Detail>...snip...</Detail></Entry>
<Entry><?xml version="1.0"?><Detail>...snip...</Detail></Entry>
<Entry><?xml version="1.0"?><Detail>...snip...</Detail></Entry>

使用XmlReader 并将XmlReaderSettings.ConformanceLevel 设置为ConformanceLevel.Fragment，我可以阅读完整的<Header> 元素。即使<Entry> 元素开始也可以，但是在读取<Detail> 信息时，XmlReader 它会抛出一个XmlException，因为它在<?xml...?> XML 声明中读取它并不期望在那个地方。

除了繁重的字符串操作之外，我还有哪些选项可以跳过这些 XML 声明？

由于片段可以轻松超过 100 兆字节，我宁愿不要一次将所有内容加载到内存中。但这就是它所需要的，我愿意接受。

我得到的异常示例：

System.Xml.XmlException: Unexpected XML declaration.
The XML declaration must be the first node in the document, and no white space characters are allowed to appear before it.
Line ##, position ##.

【问题讨论】：

您是否尝试过使用 System.Xml.Linq (msdn.microsoft.com/de-de/library/bb299195) 命名空间中的类？
还没有；哪些最适合从解析片段开始？ LINQ 的内存消耗有多大？这些文件可以很容易地达到 100 兆字节。

标签： c# xml .net-4.0 xml-parsing xmlreader

【解决方案1】：

我将此作为答案添加，因为它保留了语法突出显示。

    private void ProcessFile(string inputFileName, string outputFileName)
    {
        using (StreamReader reader = new StreamReader(inputFileName, new UTF8Encoding(false, true)))
        {
            using (StreamWriter writer = new StreamWriter(outputFileName, false, Encoding.UTF8))
            {
                string line;
                while ((line = reader.ReadLine()) != null)
                {
                    const string xmlDeclarationStart = "<?xml";
                    const string xmlDeclarationFinish = "?>";
                    if (line.Contains(xmlDeclarationStart))
                    {
                        string newLine = line.Substring(0, line.IndexOf(xmlDeclarationStart));
                        int endPosition = line.IndexOf(xmlDeclarationFinish, line.IndexOf(xmlDeclarationStart));
                        if (endPosition == -1)
                        {
                            throw new NotImplementedException(string.Format("Implementation assumption is wrong. {0} .. {1} spans multiple lines (or input file is severely misformed)", xmlDeclarationStart, xmlDeclarationFinish));
                        }
                        // the code completely strips the <?xml ... ?> part
                        // an alternative would be to make this a new XML element containing
                        // the information inside the <?xml ... ?> part as attributes
                        // just like Daren Thomas suggested
                        newLine += line.Substring(endPosition + 2);
                        line = newLine;
                    }
                    writer.WriteLine(line);
                }
            }
        }
    }

【讨论】：

【解决方案2】：

如果您不确定声明是否始终保持不变，请将 <?xml 替换为 <XmlDeclaration 并将 ?> 替换为 /> 并使用常规解析器；)

另外，您是否尝试过通过 XML tidy 样式程序传递文件？

您还可以使用 SGML 库来预处理数据并输出正确的 XML。

【讨论】：

感谢您的回答。我知道我可以做 RegEx，并且在找不到更好的替代品时会做。 TextPad 中的 XMLTidy 因文件太大而阻塞。任何指向此类 SGML 库的指针都将不胜感激。
我不能同时接受这两个答案，并添加了我自己的代码作为单独的答案答案，因此保留了语法突出显示。你的回答不被接受，因为它离我已经有的更远。对不起（：

【解决方案3】：

我认为内置类不会有帮助；您可能需要做一些准备并删除额外的标题。如果您的样本准确无误，您只需发送string.Replace(badXml, "<?xml version=\"1.0\"?>, "") 即可。

【讨论】：

谢谢。这与我之前使用的类似，但我不确定 XML 声明是否保持不变。很高兴看到我们的想法是一致的。
我不能同时接受这两个答案，并添加了我自己的代码作为单独的答案答案，因此保留了语法突出显示。您的答案已被接受，因为它最接近我已有的答案。