使用 XSLT 1.0 将 XHTML 转换为结构化 XML答案

【问题标题】：XHTML to Structured XML with XSLT 1.0使用 XSLT 1.0 将 XHTML 转换为结构化 XML
【发布时间】：2017-06-05 22:36:31
【问题描述】：

我有一个来自基本 ePub 输出的 XHTML 文档，我正在尝试将其转换为结构化 XML 文档。它的格式一般不会太疯狂，如下所示：

<?xml version="1.0" encoding="utf-8"?>
<html>
<body>
  <h1>Topic 1</h1>
  <p>1.0.1</p>
  <p>1.0.2</p>

  <h2>Subtopic 1.1</h2>
  <p>1.1.1</p>
  <p>1.1.2</p>

  <h2>Subtopic 1.2</h2>
  <p>1.2.1</p>
  <p>1.2.2</p>

  <h1>Topic 2</h1>
  <p>2.0.1</p>
  <p>2.0.2</p>

  <h2>Subtopic 2.1</h2>
  <p>2.1.1</p>
  <p>2.1.2</p>

  <h2>Subtopic 2.2</h2>
  <p>2.2.1</p>
  <p>2.2.2</p>
</body>
</html>

理想情况下，我想将其转换为基于 h1、h2、... 标签的结构化代码。第一个 h1 之后，第二个之前的东西应该包含在它自己的容器中，第二个 h1 到文档末尾的东西应该包含在它自己的容器中。同样，h2 之间的东西也应该进入一个容器，从而嵌套它。输出应该是这样的：

<Root>
   <Topic>
      <Title>Topic 1</Title>
      <Paragraph>1.0.1</Paragraph>
      <Paragraph>1.0.2</Paragraph>
      <Topic>
         <Title>Subtopic 1.1</Title>
         <Paragraph>1.1.1</Paragraph>
         <Paragraph>1.1.2</Paragraph>
      </Topic>
      <Topic>
         <Title>Subtopic 1.2</Title>
         <Paragraph>1.2.1</Paragraph>
         <Paragraph>1.2.2</Paragraph>
      </Topic>
   </Topic>
   <Topic>
      <Title>Topic 2</Title>
      <Paragraph>2.0.1</Paragraph>
      <Paragraph>2.0.2</Paragraph>
      <Topic>
         <Title>Subtopic 2.1</Title>
         <Paragraph>2.1.1</Paragraph>
         <Paragraph>2.1.2</Paragraph>
      </Topic>
      <Topic>
         <Title>Subtopic 2.2</Title>
         <Paragraph>2.2.1</Paragraph>
         <Paragraph>2.2.2</Paragraph>
      </Topic>
   </Topic>
</Root>

虽然这个例子只包含 p 个标签，但它也可能包含 div 和其他元素，所以不要指望它只是一个节点。它需要足够通用，才能不关心标题标签之间的内容。

我对 Muenchian 分组很熟悉，但这对我来说有点复杂。我试过使用这样的键：

<xsl:key name="kHeaders1" match="*[not(self::h1)]" use="generate-id(preceding-sibling::h1[1])"/>

<xsl:template match="h1">
  <Topic>
    <Title><xsl:apply-templates /></Title>
    <xsl:apply-templates select="key('kHeaders1', generate-id())" />
  </Topic>
</xsl:template>

<xsl:template match="html">
  <Root>
     <xsl:apply-templates select="body/h1" />
  </Root>
</xsl:template>

<xsl:template match="p">
   <Paragraph><xsl:apply-templates /></Paragraph>
</xsl:template>

这对于第一级来说效果很好，但随后尝试重复该过程，但使用 h2，似乎让我心烦意乱。由于在 h2 级别，任何节点的键都应该是第一个，h1 或 h2 兄弟。似乎它可以组合成一组键，其中 id 是它之前的最后一个 h* ，并且 h* 元素未在分组中列出（这样它们就不会递归）。我会想象这样的事情：

<xsl:key name="kHeaders" match="*[not(self::h1 or self::h2)]" use="generate-id(preceding-sibling::*[self::h1 or self::h2][1])"/>

但是，这会从列表中忽略 h2 元素，这些元素需要存在于前一个 h1 的分组中。如果我放宽对匹配的限制以包含 h1/h2 元素（并使 h1 模板也匹配 h2），那么我会得到 h2 重新列出 h1 等等（有点预期）。

一个理想的解决方案是可以扩展为适用于 h3、h4 等而无需付出太多努力的解决方案。但是，它不需要包含用于处理通用 h* 元素的脚本元素。关于如何添加附加层的简单说明就足够了。

这里有人有什么建议吗？

【问题讨论】：

标签： xml xslt xpath xslt-1.0 xslt-grouping

【解决方案1】：

以下样式表（从this answer 复制的大部分基本代码）在涉及更多标头时将起作用：

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:key name="next-headings" match="h6"
          use="generate-id(preceding-sibling::*[self::h1 or self::h2 or
                                               self::h3 or self::h4 or
                                               self::h5][1])" />

    <xsl:key name="next-headings" match="h5"
          use="generate-id(preceding-sibling::*[self::h1 or self::h2 or
                                               self::h3 or self::h4][1])" />
    <xsl:key name="next-headings" match="h4"
          use="generate-id(preceding-sibling::*[self::h1 or self::h2 or
                                               self::h3][1])" />
    <xsl:key name="next-headings" match="h3"
          use="generate-id(preceding-sibling::*[self::h1 or self::h2][1])" />

    <xsl:key name="next-headings" match="h2"
          use="generate-id(preceding-sibling::h1[1])" />

    <xsl:key name="immediate-nodes"
          match="node()[not(self::h1 | self::h2 | self::h3 | self::h4 |
                           self::h5 | self::h6)]"
          use="generate-id(preceding-sibling::*[self::h1 or self::h2 or
                                               self::h3 or self::h4 or
                                               self::h5 or self::h6][1])" />

    <xsl:template match="/">
        <Root>
            <xsl:apply-templates select="html/body/h1"/>
        </Root>
    </xsl:template>

    <xsl:template match="p">
        <Paragraph>
            <xsl:value-of select="."/>
        </Paragraph>
    </xsl:template>

    <xsl:template match="h1 | h2 | h3 | h4 | h5 | h6">
        <Topic>
            <Title>
                <xsl:value-of select="."/>
            </Title>
            <xsl:apply-templates select="key('immediate-nodes', generate-id())"/>
            <xsl:apply-templates select="key('next-headings', generate-id())"/>
        </Topic>
    </xsl:template>

</xsl:stylesheet>

【讨论】：

完美！当我搜索帖子时，我没有遇到过，它似乎拥有我需要的大部分内容。感谢您对其进行调整并挽救了我的理智。我也不知道我可以“添加”到键。我只定义过一次。所以我也学到了一些东西！

【解决方案2】：

这样就可以了：

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="yes"/>

  <xsl:template match="/">
    <Root>
      <xsl:apply-templates select="//h1"/>
    </Root>
  </xsl:template>

  <xsl:template match="*[starts-with(local-name(), 'h')]">
    <xsl:variable name="lvl" select="number(substring-after(local-name(), 'h'))"/>
    <Topic>
      <Title>
        <xsl:value-of select="text()"/>
      </Title>
      <xsl:apply-templates select="//following-sibling::*[not(starts-with(local-name(), 'h'))
                           and preceding-sibling::*[starts-with(local-name(), 'h')][1] = current()]"/>
      <xsl:apply-templates select="//following-sibling::*[local-name() = concat('h', $lvl + 1) 
                           and preceding-sibling::*[local-name() = concat('h', $lvl)][1] = current()]"/>
    </Topic>
  </xsl:template>

  <xsl:template match="*">
    <Paragraph>
      <xsl:value-of select="text()"/>
    </Paragraph>
  </xsl:template>
</xsl:stylesheet>

【讨论】：

我也测试了这个，这似乎是诀窍，但没有钥匙。它也更浓缩了一些。知道它可以这么简单地完成并不是很省心，但仍然是一个完全有效的答案！