Word Interop - 您能判断 Word 文档的 byte[] 数组是否为 HTML？答案

【问题标题】：Word Interop - Can you tell if a byte[] array of a Word Document is HTML?Word Interop - 您能判断 Word 文档的 byte[] 数组是否为 HTML？
【发布时间】：2016-10-15 02:52:05
【问题描述】：

我正在使用一个代码库，简而言之，它负责在基于 Web 的查看器中显示文档，每个页码都有缩略图。文档中的加载策略和页数计算按文档类型分离，并将文档转换为通用格式以进行演示。

我正在处理的问题涉及某些 Word 文档的初始页数计算。这些文档存储在第 3 方数据库中，其中包括文档的二进制流和扩展名（始终为“doc”）。为了计算文档的页数，我们使用 Microsoft Office Interop，如下所示：

    public int GetPageCount(byte[] file)
    {
        var filePath = Path.GetTempFileName();
        File.WriteAllBytes(filePath, file);

        return this.GetPageCount(filePath);
    }

    public int GetPageCount(string filePath)
    {
        try
        {
            this.OpenDocument(filePath);
            const WdStatistic statistic = Microsoft.Office.Interop.Word.WdStatistic.wdStatisticPages;
            var pages = Document.ComputeStatistics(statistic, Type.Missing);

            return pages;
        }
        finally
        {
            //Closes handles, removes temp files, implementation omitted for brevity
            this.DisposeDocument();
            this.DisposeApplication();
        }
    }

    private void OpenDocument(string filePath)
    {
        // Create a new Microsoft Word application object
        this.Word = new Application();
        this.Word.Visible = false;
        this.Word.ScreenUpdating = false;

        object refFilePath = filePath;

        object html  = WdOpenFormat.wdOpenFormatWebPages;

        this.Document = this.Word.Documents.Open(ref refFilePath, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing, ref this.missing);

        if (Document == null)
        {
            throw new Exception(string.Format("Could not open Word document ({0})", filePath));
        }
    }

此代码处理的大多数文档都是正常的 Word 文档。但是，其中一些文档实际上是保存为 Word 文档的 HTML 文档，不幸的是，使用 wdstatisticpages 的这段代码错误地推断出这些文档只有 1 页。我不确定这个现有代码中是否缺少某些东西，这将使与 Interop 库的交互能够正确确定 HTML 的页数，这似乎是最简单的选择。

作为替代方案，我考虑了是否可以确定字节数组是否可以解析为 HTML；我们有 .html 文件的渲染策略，但由于从数据库中推断出“doc”策略，因此没有使用该策略。将 HTML 文档的二进制文件转换为字符串为我们提供了原始 HTML，我想知道像正则表达式或一些 3rd 方库这样的聪明东西是否可行。两者都没有问题，但我想知道 .NET 中是否有一些优雅的东西可以更好地做到这一点。如果 .NET 原生的东西可用，最好不要引入依赖或依赖正则表达式。比如：

    public bool IsHtml(byte[] file)
    {
        var fileString = Encoding.UTF8.GetString(file); 
        //Validate the fileString; how do we determine that the GetString() method correctly parsed and is not garbage?
        //return answer
    }

我应该指出，另一种选择是让第 3 方数据库的供应商将他们的数据更改为更正确，例如存储“html”作为其扩展名。但我好奇的唯一一个想知道处理代码中的差异是否实际上是可能的并且足够干净以值得考虑。我在 StackOverflow 上进行了一些研究和搜索，但找不到与此查询相关的任何内容。

感谢您的任何帮助和想法。请询问您是否需要更多信息或详细信息。

【问题讨论】：

标签： c# html ms-word office-interop

【解决方案1】：

理论上，您应该能够尝试并使用 XDocument.Load() 的重载来尝试将文件加载到 xml 对象中，因为 HTML 是 xml，假设它的有效 html。

实际上，大多数 xml 类都可以用来尝试解决这个问题，特别是如果您已经有了字符串，您只需要假设无效的 xml 意味着它实际上是一个 word doc。

编辑：废话现在意识到较新的单词格式也是 XML，所以这可能行不通....但是我相信使用 HtmlAgilityPack 您可以使用类似的想法来解决这个问题

另请参阅此主题，了解可能有用的各种 3rd 方和 .net 技巧的一些想法 -> What is the best way to parse html in C#?

【讨论】：

使用 HtmlParser 而不是 xml，像
或
这样的简单错误如果没有关闭就会破坏你的 xml 解析器。
不保证有效的 HTML 是有效的 XML。事实上，可能大多数有效的 HTML 都不是有效的 XML。
感谢您的意见； XML 验证是最初的考虑，但它并没有太大的实际意义。它看起来像是第 3 方或依赖来完成这项工作，或者将根本原因（数据！）修复为正确的 :)