如何按页码访问 OpenXML 内容？答案

【问题标题】：How to access OpenXML content by page number?如何按页码访问 OpenXML 内容？
【发布时间】：2016-10-12 07:29:57
【问题描述】：

使用OpenXML，我可以通过页码读取文档内容吗？

wordDocument.MainDocumentPart.Document.Body 给出完整文档的内容。

  public void OpenWordprocessingDocumentReadonly()
        {
            string filepath = @"C:\...\test.docx";
            // Open a WordprocessingDocument based on a filepath.
            using (WordprocessingDocument wordDocument =
                WordprocessingDocument.Open(filepath, false))
            {
                // Assign a reference to the existing document body.  
                Body body = wordDocument.MainDocumentPart.Document.Body;
                int pageCount = 0;
                if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
                {
                    pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
                }
                for (int i = 1; i <= pageCount; i++)
                {
                    //Read the content by page number
                }
            }
        }

MSDN Reference

更新 1：

看起来分页符设置如下

<w:p w:rsidR="003328B0" w:rsidRDefault="003328B0">
        <w:r>
            <w:br w:type="page" />
        </w:r>
    </w:p>

所以现在我需要使用上述检查拆分 XML，并为每个检查使用 InnerTex，这将给我页面虎钳文本。

现在的问题是如何使用上述检查拆分 XML？

更新 2：

仅当您有分页符时才设置分页符，但如果文本从一页浮动到其他页面，则没有设置分页符 XML 元素，因此它返回到如何识别分页符的相同挑战.

【问题讨论】：

阅读此stackoverflow.com/questions/14479698/…
@PaulZahra 我在 XML 中找不到这样的元素（lastRenderedPageBreak）

标签： c# xml openxml docx openxml-sdk

【解决方案1】：

您不能仅在 OOXML 数据级别通过页码引用 OOXML 内容。

硬分页符不是问题；可以计算硬分页符。
软分页符是问题所在。这些是根据实现的换行和分页算法依赖;它不是 OOXML 数据所固有的。没有什么数数。

w:lastRenderedPageBreak 呢，它记录了最后一次呈现文档时软分页符的位置？ 不，w:lastRenderedPageBreak 通常也无济于事，因为：

根据定义，w:lastRenderedPageBreak 位置在内容有自上次打开以来已被一个对其进行分页的程序更改内容。
在 MS Word 的实现中，w:lastRenderedPageBreak 在各种情况下都不可靠，包括

如果您愿意接受 Word Automation 及其固有的 licensing and server operation limitations 的依赖，那么您就有机会确定页面边界、页码、页数等。

否则，唯一真正的答案是超越依赖于专有的、特定于实现的分页算法的基于页面的引用框架。

【讨论】：

感谢您提供详细信息。这也是我通过研究得出的结论。但是我可以从基于 Web 的界面使用 Word Automation，我的意思是我的数据库中有 Word 文档作为二进制文件，并使用它来使用许可的 Word Automation 获取页面内容？
如何使用 Add-In Express add-in-express.com/creating-addins-blog/2013/08/07/…
我不建议在服务器上使用 Word 自动化，因为 inherent licensing and server operation limitations stated by Microsoft，但如果它适合您的情况，那就太好了。
Add-in Express post you cite 中讨论的技术需要 Word 自动化。
很公平，你得到了赏金 :)

【解决方案2】：

这就是我最终做到的方式。

  public void OpenWordprocessingDocumentReadonly()
        {
            string filepath = @"C:\...\test.docx";
            // Open a WordprocessingDocument based on a filepath.
            Dictionary<int, string> pageviseContent = new Dictionary<int, string>();
            int pageCount = 0;
            using (WordprocessingDocument wordDocument =
                WordprocessingDocument.Open(filepath, false))
            {
                // Assign a reference to the existing document body.  
                Body body = wordDocument.MainDocumentPart.Document.Body;
                if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
                {
                    pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
                }
                int i = 1;
                StringBuilder pageContentBuilder = new StringBuilder();
                foreach (var element in body.ChildElements)
                {
                    if (element.InnerXml.IndexOf("<w:br w:type=\"page\" />", StringComparison.OrdinalIgnoreCase) < 0)
                    {
                        pageContentBuilder.Append(element.InnerText);
                    }
                    else
                    {
                        pageviseContent.Add(i, pageContentBuilder.ToString());
                        i++;
                        pageContentBuilder = new StringBuilder();
                    }
                    if (body.LastChild == element && pageContentBuilder.Length > 0)
                    {
                        pageviseContent.Add(i, pageContentBuilder.ToString());
                    }
                }
            }
        }

缺点：这并不适用于所有情况。这仅在您有分页符时才有效，但如果您将文本从第 1 页扩展到第 2 页，则没有标识符可以知道您在第 2 页中。

【讨论】：

感谢您的回答！如何将页面内容复制到新文档中？

【解决方案3】：

不幸的是，正如Why only some page numbers stored in XML of docx file? 的回答，docx 不包含可靠的页码服务。 Xml 文件没有页码，直到 Microsoft Word 打开它并动态呈现。即使您阅读诸如https://docs.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.pagenumber?view=openxml-2.8.1 之类的openxml 文档。

您可以解压一些 docx 文件，然后搜索“page”或“pg”。然后你就会知道了。在我的情况下，我对不同类型的 docx 文件执行此操作。所有人都告诉我同样的事实。很高兴这有帮助。

【讨论】：

【解决方案4】：

List Allparagraphs = wp.MainDocumentPart.Document.Body.OfType().ToList();

List PageParagraphs = Allparagraphs.Where (x=>x.Descendants().Count() ==1) .Select(x => x).Distinct().ToList();

【讨论】：

添加有关代码及其解决问题的说明

【解决方案5】：

将 docx 重命名为 zip。打开 docProps\app.xml 文件。：

 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
  <Template>Normal</Template>
  <TotalTime>0</TotalTime>
  <Pages>1</Pages>
  <Words>141</Words>
  <Characters>809</Characters>
  <Application>Microsoft Office Word</Application>
  <DocSecurity>0</DocSecurity>
  <Lines>6</Lines>
  <Paragraphs>1</Paragraphs>
  <ScaleCrop>false</ScaleCrop>
  <HeadingPairs>
    <vt:vector size="2" baseType="variant">
      <vt:variant>
        <vt:lpstr>Название</vt:lpstr>
      </vt:variant>
      <vt:variant>
        <vt:i4>1</vt:i4>
      </vt:variant>
    </vt:vector>
  </HeadingPairs>
  <TitlesOfParts>
    <vt:vector size="1" baseType="lpstr">
      <vt:lpstr/>
    </vt:vector>
  </TitlesOfParts>
  <Company/>
  <LinksUpToDate>false</LinksUpToDate>
  <CharactersWithSpaces>949</CharactersWithSpaces>
  <SharedDoc>false</SharedDoc>
  <HyperlinksChanged>false</HyperlinksChanged>
  <AppVersion>14.0000</AppVersion>
</Properties>

OpenXML 库从 <Pages>1</Pages> property 读取 wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text。此属性仅由 winword 应用程序创建。如果 word 文档更改 wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text 不是实际的。如果 Word 文档以编程方式创建，则 wordDocument.ExtendedFilePropertiesPart 通常为 null。

【讨论】：