正则表达式在 XML 中查找子元素答案

【问题标题】：Regex to find subelement in XML正则表达式在 XML 中查找子元素
【发布时间】：2019-06-13 14:27:16
【问题描述】：

我正在使用 Notepad++ 中的正则表达式搜索功能在数百个文件中查找匹配项。

我的目标是在每个中找到一个父/子组合。我不太关心具体选择了什么（父母和孩子或只是孩子）。我只想知道父母是否包含特定的孩子。

我想找到一个也有一个子元素的父元素。

它应该找到的示例（因为其中一个子元素是 a ）：

<description>
    <otherstuff>
    </otherstuff>
    <something>
    </something>
    <description>
    </description>
    <otherstuff>
    </otherstuff>
</description>

不应找到的示例：

<description>
    <otherstuff>
    </otherstuff>
    <something>
    </something>
    <notadescription>
    </notadescription>
    <otherstuff>
    </otherstuff>
<description>

每个都可能有其他孩子和子孩子。它们也可能在同一个文档中。

如果我搜索这个：

<description>(.*)<description>(.*)</description>

它选择了太多，因为当我只希望它为第二个部分选择子级时，它会选择另一个顶层。

【问题讨论】：

标签： regex xml notepad++

【解决方案1】：

你说你正在使用 Notepad++，这里有一个方法：

Ctrl+F
查找内容：<description>(?:(?!</description).)*<description>(?:(?!<description>).)*</description>
检查匹配大小写
检查环绕
检查正则表达式
检查. matches newline

说明：

<description>               # opening tag
(?:(?!</description).)*     # tempered greedy token, make sure we have not closing tag before:
<description>               # opening tag
(?:(?!<description>).)*     # tempered greedy token, make sure we have not opening tag before:
</description>              # closing tag

屏幕截图：

【讨论】：

【解决方案2】：

你不应该使用(.*)它是贪婪的这是一个为什么你不应该在你的情况下使用它的例子

<description>
    <otherstuff>
    </otherstuff>
    <description>
        <description>hello<\description>
    </description>
<\description>

假设这里我们使用<description>(.*)<description>(.*)</description> 它将解析：

    <description>
        <description>hello<\description>
    </description>
<\description>

因此，如果您只想解析第二个描述中的内容，则应使用 (.*?) 它被称为非贪婪使用<description>(.*)<description>(.*?)</description> 会解析：

<description>
    <description>hello<\description> # end of parse
# here <\description> is missing cause (.*?) will look only for the first match

所以你必须使用(.*?)，它会在找到第一个结束匹配时停止解析，但(.*) 是贪婪的，所以它会寻找可能的最大匹配

所以如果你使用<description>(.*)<description>(.*?)</description> 就可以了，因为它只会解析你的情况下的子描述中的内容

【讨论】：

这个好像还是选得太多了。如果有多次出现，它会选择介于两者之间以及最后一次出现的所有内容

【解决方案3】：

我猜我们会设计一个表达式来排除<notadescription>，例如：

<description>(?!<notadescription>)[\s\S]*<\/description>

如果我们要捕获描述元素，我们可能需要一个捕获组：

(<description>(?!<notadescription>)[\s\S]*<\/description>)

Demo

【讨论】：

在notepad++中，这个从第一个到最后一个选择有点太贪心了，而且还选择了与这个不匹配的那些。