【问题标题】:Regex: How can I select all the contents between two headings?正则表达式:如何选择两个标题之间的所有内容?
【发布时间】:2021-01-31 14:02:41
【问题描述】:

我想选择任意两个标题之间的内容。

我已经创建了这个正则表达式,它并没有真正选择我需要的东西。目前,它选择标题和段落,但不选择最后一个标题。

当前正则表达式:/^<h.*?(?:>)(.*?)(?=<\h)/gms

给定字符串:

<h2>What is lorem impsum</h2>
Stack overflow is a great community for developers to seek help and connect the beautiful experience.

<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically. 
But its a nice place to be at.
another paragraph betwen these headings

<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of 
people with his kindness and his contents on the internet.

预期结果:

[
    'Stack overflow is a great community for developers to seek help and connect the beautiful 
    experience.',

    'Quoora is good but doesn't provide any benefits to the person who's helping others economically. 
    But it\'s a nice place to be at.
    another paragraph betwen these headings',

   'One of the best guy to learn react with. He also has helped a lot of 
    people with his kindness and his contents on the internet.'

]

【问题讨论】:

  • 避免使用正则表达式进行 HTML 解析
  • 预期结果中的最后一个不在标题标签之间,
  • @anubhava 为什么会这样?你能详细说明一下吗?

标签: javascript reactjs regex


【解决方案1】:

如果您想在不捕获的情况下获取匹配项:

/(?<=<\/h\d+>\s*)\S.*?(?=\s*<h\d|$)/gs

proof

const text = `<h2>What is lorem impsum</h2>
Stack overflow is a great community for developers to seek help and connect the beautiful experience.

<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically. 
But its a nice place to be at.
another paragraph betwen these headings

<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of 
people with his kindness and his contents on the internet.`;
const regex = /(?<=<\/h\d+>\s*)\S.*?(?=\s*<h\d|$)/gs;
console.log(text.match(regex));

如果您需要更高效的正则表达式,请使用捕获:

const text = `<h2>What is lorem impsum</h2>
Stack overflow is a great community for developers to seek help and connect the beautiful experience.

<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically. 
But its a nice place to be at.
another paragraph betwen these headings

<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of 
people with his kindness and his contents on the internet.`;
const regex = /<\/h\d+>\s*([^<]*(?:<(?!h\d)[^<]*)*?)\s*(?:<h\d|$)/g;
console.log(Array.from(text.matchAll(regex), x => x[1].trim()));

第二个正则解释:

--------------------------------------------------------------------------------
  <                        '<'
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  h                        'h'
--------------------------------------------------------------------------------
  \d+                      digits (0-9) (1 or more times (matching
                           the most amount possible))
--------------------------------------------------------------------------------
  >                        '>'
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
--------------------------------------------------------------------------------
      <                        '<'
--------------------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
        h                        'h'
--------------------------------------------------------------------------------
        \d                       digits (0-9)
--------------------------------------------------------------------------------
      )                        end of look-ahead
--------------------------------------------------------------------------------
      [^<]*                    any character except: '<' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )*?                      end of grouping
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    <h                       '<h'
--------------------------------------------------------------------------------
    \d                       digits (0-9)
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    $                        before an optional \n, and the end of
                             the string
--------------------------------------------------------------------------------
  )                        end of grouping

【讨论】:

  • 谢谢你能详细说明你是如何用你的第二个正则表达式选择文本的吗?
  • @ReyYoung 我添加了解释。第二个正则表达式匹配结束的h 标记,然后匹配任何不是&lt;&lt; 的文本块,后面没有h 打开标记,直到打开h 标记。如果文本很大,这是最有效的方法。 .*? 看起来很短,但速度较慢。
【解决方案2】:

正则表达式:/(<.>)/gm

这将选择你所有的标题标签和它们之间的内容。如果为真则使用布尔值,然后丢弃它

如果它是错误选择,那么你会得到你需要的。

【讨论】:

  • 我不想选择标题标签
【解决方案3】:

如果您想远离正则表达式来解析 HTML。您可以使用nextSibling。请注意,有不同种类的节点。我在这里抓取所有节点,包括文本节点,因为我认为这就是你想要的。不过,这可以调整为仅查找元素节点。

const op = []

const [h1, h2] = document.querySelectorAll("h1,h2")

let next = h1.nextSibling

while (next && next !== h2) {
  op.push(next.textContent)
  next = next.nextSibling
}

console.log(op)
<h1>start</h1>

The quick brown fox jumps over the lazy dog

<p> some paragraph as well </p>

<div> something <strong> nested <code>works</code> too </strong> :) </div>

<h2>next</h2>

more content we are not interested in...

【讨论】:

  • 这也很好,但是如果我们有大量内容,它会比正则表达式更快吗?
  • 我不知道它是否更快,但我认为这不是一个特别慢的解决方案。唯一缓慢的查找是通过querySelectorAll 搜索标题标签,它可能有O(n^2) ?!.. 之后它只是用O(n) 遍历一个数组。我不知道textContent 到底做了什么,但我知道它比innerHTML 快得多。就正则表达式而言,我不知道到底发生了什么。
【解决方案4】:

如果您可以自己选择标题(而不是尝试选择标题之间的文本)并将它们从整个字符串中删除,只保留它们之间的内容,那么复杂度会降低。 您只能选择带有表达式的标题:

(<h.*(?:>))/gm

您可以在here 中找到它(只需使用正则表达式选择标题。删除部分必须在代码中处理)

【讨论】:

  • 你可以在之前answer给出的代码中使用这个表达式
【解决方案5】:

这里有一些令人难以置信的答案,尤其是 dom 答案,但是如果您需要传递一个字符串,那么您也可以考虑我的。

只需要传递所需的字符串,它就会返回所需的数组

function GetContentBetweenHtags(HtmlString){
  const Regex = /<\/h\w>(.*?)<h\w>/msg
  const AfterTagRegex = /<\/h\w>([\s\w\.]*)$/
  const EndMatch = HtmlString.match(AfterTagRegex)
  let result, resultArr = []
  while((result = Regex.exec(HtmlString)) != null){
    resultArr.push(result[1].trim())
  }
  if(EndMatch.length !== 0){
    resultArr.push(EndMatch[1].trim())
  }
  return resultArr
}

【讨论】:

  • 这很好,但它没有选择最后一个....即在两个标题标签之间进行选择。如果有一个标题标签,那么即使没有标题,也要选择另一个标题。你的正则表达式不能满足我的预期结果。你能看看吗?
  • 您可以通过更改几行轻松地覆盖这种情况。无论如何,我已经编辑了答案。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2018-09-02
  • 2016-04-05
  • 2011-05-20
  • 2010-09-22
  • 2013-10-31
  • 2014-02-25
  • 2014-09-06
相关资源
最近更新 更多