正则表达式：如何选择两个标题之间的所有内容？答案

【问题标题】：Regex: How can I select all the contents between two headings?正则表达式：如何选择两个标题之间的所有内容？
【发布时间】：2021-01-31 14:02:41
【问题描述】：

我想选择任意两个标题之间的内容。

我已经创建了这个正则表达式，它并没有真正选择我需要的东西。目前，它选择标题和段落，但不选择最后一个标题。

当前正则表达式：/^<h.*?(?:>)(.*?)(?=<\h)/gms

给定字符串：

<h2>What is lorem impsum</h2>
Stack overflow is a great community for developers to seek help and connect the beautiful experience.

<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically. 
But its a nice place to be at.
another paragraph betwen these headings

<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of 
people with his kindness and his contents on the internet.

预期结果：

[
    'Stack overflow is a great community for developers to seek help and connect the beautiful 
    experience.',

    'Quoora is good but doesn't provide any benefits to the person who's helping others economically. 
    But it\'s a nice place to be at.
    another paragraph betwen these headings',

   'One of the best guy to learn react with. He also has helped a lot of 
    people with his kindness and his contents on the internet.'

]

【问题讨论】：

避免使用正则表达式进行 HTML 解析
预期结果中的最后一个不在标题标签之间，
@anubhava 为什么会这样？你能详细说明一下吗？
请查看：stackoverflow.com/questions/590747/…

标签： javascript reactjs regex

【解决方案1】：

如果您想在不捕获的情况下获取匹配项：

/(?<=<\/h\d+>\s*)\S.*?(?=\s*<h\d|$)/gs

见proof

const text = `<h2>What is lorem impsum</h2>
Stack overflow is a great community for developers to seek help and connect the beautiful experience.

<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically. 
But its a nice place to be at.
another paragraph betwen these headings

<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of 
people with his kindness and his contents on the internet.`;
const regex = /(?<=<\/h\d+>\s*)\S.*?(?=\s*<h\d|$)/gs;
console.log(text.match(regex));

如果您需要更高效的正则表达式，请使用捕获：

const text = `<h2>What is lorem impsum</h2>
Stack overflow is a great community for developers to seek help and connect the beautiful experience.

<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically. 
But its a nice place to be at.
another paragraph betwen these headings

<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of 
people with his kindness and his contents on the internet.`;
const regex = /<\/h\d+>\s*([^<]*(?:<(?!h\d)[^<]*)*?)\s*(?:<h\d|$)/g;
console.log(Array.from(text.matchAll(regex), x => x[1].trim()));

第二个正则解释：

--------------------------------------------------------------------------------
  <                        '<'
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  h                        'h'
--------------------------------------------------------------------------------
  \d+                      digits (0-9) (1 or more times (matching
                           the most amount possible))
--------------------------------------------------------------------------------
  >                        '>'
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
--------------------------------------------------------------------------------
      <                        '<'
--------------------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
        h                        'h'
--------------------------------------------------------------------------------
        \d                       digits (0-9)
--------------------------------------------------------------------------------
      )                        end of look-ahead
--------------------------------------------------------------------------------
      [^<]*                    any character except: '<' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )*?                      end of grouping
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    <h                       '<h'
--------------------------------------------------------------------------------
    \d                       digits (0-9)
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    $                        before an optional \n, and the end of
                             the string
--------------------------------------------------------------------------------
  )                        end of grouping

【讨论】：

谢谢你能详细说明你是如何用你的第二个正则表达式选择文本的吗？
@ReyYoung 我添加了解释。第二个正则表达式匹配结束的h 标记，然后匹配任何不是< 和< 的文本块，后面没有h 打开标记，直到打开h 标记。如果文本很大，这是最有效的方法。 .*? 看起来很短，但速度较慢。

【解决方案2】：

正则表达式：/(<.>)/gm

这将选择你所有的标题标签和它们之间的内容。如果为真则使用布尔值，然后丢弃它

如果它是错误选择，那么你会得到你需要的。

【讨论】：

我不想选择标题标签

【解决方案3】：

如果您想远离正则表达式来解析 HTML。您可以使用nextSibling。请注意，有不同种类的节点。我在这里抓取所有节点，包括文本节点，因为我认为这就是你想要的。不过，这可以调整为仅查找元素节点。

const op = []

const [h1, h2] = document.querySelectorAll("h1,h2")

let next = h1.nextSibling

while (next && next !== h2) {
  op.push(next.textContent)
  next = next.nextSibling
}

console.log(op)

<h1>start</h1>

The quick brown fox jumps over the lazy dog

<p> some paragraph as well </p>

<div> something <strong> nested <code>works</code> too </strong> :) </div>

<h2>next</h2>

more content we are not interested in...

【讨论】：

这也很好，但是如果我们有大量内容，它会比正则表达式更快吗？
我不知道它是否更快，但我认为这不是一个特别慢的解决方案。唯一缓慢的查找是通过querySelectorAll 搜索标题标签，它可能有O(n^2) ?!.. 之后它只是用O(n) 遍历一个数组。我不知道textContent 到底做了什么，但我知道它比innerHTML 快得多。就正则表达式而言，我不知道到底发生了什么。

【解决方案4】：

如果您可以自己选择标题（而不是尝试选择标题之间的文本）并将它们从整个字符串中删除，只保留它们之间的内容，那么复杂度会降低。您只能选择带有表达式的标题：

(<h.*(?:>))/gm

您可以在here 中找到它（只需使用正则表达式选择标题。删除部分必须在代码中处理）

【讨论】：

你可以在之前answer给出的代码中使用这个表达式

【解决方案5】：

这里有一些令人难以置信的答案，尤其是 dom 答案，但是如果您需要传递一个字符串，那么您也可以考虑我的。

只需要传递所需的字符串，它就会返回所需的数组

function GetContentBetweenHtags(HtmlString){
  const Regex = /<\/h\w>(.*?)<h\w>/msg
  const AfterTagRegex = /<\/h\w>([\s\w\.]*)$/
  const EndMatch = HtmlString.match(AfterTagRegex)
  let result, resultArr = []
  while((result = Regex.exec(HtmlString)) != null){
    resultArr.push(result[1].trim())
  }
  if(EndMatch.length !== 0){
    resultArr.push(EndMatch[1].trim())
  }
  return resultArr
}

【讨论】：

这很好，但它没有选择最后一个....即在两个标题标签之间进行选择。如果有一个标题标签，那么即使没有标题，也要选择另一个标题。你的正则表达式不能满足我的预期结果。你能看看吗？
您可以通过更改几行轻松地覆盖这种情况。无论如何，我已经编辑了答案。