正则表达式替换脚本标签外的文本答案

【问题标题】：Regex replace text outside script tag正则表达式替换脚本标签外的文本
【发布时间】：2017-08-29 08:15:20
【问题描述】：

我有这个 HTML：

"这是简单的 html 文本文本"

我只需要匹配脚本标记之外的单词。我的意思是，如果我想匹配“simple”和“text”，我应该只从“This is simple html text”和最后一部分“text”中得到结果——结果将是“simple”1 match，“text”2火柴。谁能帮我解决这个问题？我正在使用 PHP。

我在标签外找到了类似的匹配文本答案：

(text|simple)(?![^<]*>|[^<>]*</)

Regex replace text outside html tags

但无法为特定标签（脚本）工作：

(text|simple)(?!(^<script*>)|[^<>]*</)

ps：这个问题不是重复的（strip_tags, remove javascript）。因为我没有尝试剥离标签，或者选择脚本标签内的内容。我正在尝试替换标签“脚本”之外的内容。

【问题讨论】：

您绝对需要匹配，还是捕获组就可以了？
当您想自信地解析 html 时，请使用 html 解析器而不是正则表达式。 SO一遍又一遍地说。 IIRC 甚至有一条注释显示 SO 软件会弹出“不要使用正则表达式来解析 html”。
@mickmackusa，但是当您使用解析器时，它们会停止解析格式错误的 html。我认为这个问题不是重复的。因为我不是在尝试剥离标签，而是在尝试替换标签“脚本”之外的内容。
已撤回重复链接，只是相关的。

标签： php html regex preg-replace

【解决方案1】：

我的模式将使用(*SKIP)(*FAIL) 取消匹配的脚本标签及其内容的资格。

text 和 simple 将在每次符合条件的情况下匹配。

正则表达式模式：~<script.*?/script>(*SKIP)(*FAIL)|text|simple~

Pattern / Replacement Demo Link

代码：(Demo)

$strings=['This has no replacements',
    'This simple text has no script tag',
    'This simple text ends with a script tag <script language="javascript">simple simple text text</script>',
    'This is simple html text is split by a script tag <script language="javascript">simple simple text text</script> text',
    '<script language="javascript">simple simple text text</script> this text starts with a script tag'
];

$strings=preg_replace('~<script.*?/script>(*SKIP)(*FAIL)|text|simple~','***replaced***',$strings);

var_export($strings);

输出：

array (
  0 => 'This has no replacements',
  1 => 'This ***replaced*** ***replaced*** has no script tag',
  2 => 'This ***replaced*** ***replaced*** ends with a script tag <script language="javascript">simple simple text text</script>',
  3 => 'This is ***replaced*** html ***replaced*** is split by a script tag <script language="javascript">simple simple text text</script> ***replaced***',
  4 => '<script language="javascript">simple simple text text</script> this ***replaced*** starts with a script tag',
)

【讨论】：

【解决方案2】：

如果确定script 会出现，那么只需匹配

(.*?)<script.*</script>(.*)

标签外的文本将出现在子匹配 1 和 2 中。如果 script 是可选的，则执行 (.*?)(<script.*</script>)?(.*)。

【讨论】：

【解决方案3】：

这是另一个解决方案

([\w\s]*)(?:<script.*?\/script>)(.*)$

这是https://regex101.com/r/1Lthi8/1上的演示

【讨论】：

我正在尝试替换标记之外的字符串。
是的，这是在第 1 组中捕获的，因为 regex101 突出显示了 This is simple html text
匹配 2 在标签内，最后一个单词“text”没有被选中。最后，这是试图忽略所有标签，而不是特定标签“脚本”。
ha ..我看到了问题...我错过了第二条文字。我更新了我的答案和正则表达式演示。如果您仍有问题/疑问，请告诉我

【解决方案4】：

仅供参考，就标签而言，不可能忽略单个标签
不解析所有标签。

您可以跳过/失败过去的 html 标记和不可见的内容。
这将找到您要查找的单词。

https://regex101.com/r/7ZGlvW/1

格式化

    <
    (?:
         (?:
              (?:
                                                 # Invisible content; end tag req'd
                   (                             # (1 start)
                        script
                     |  style
                     |  object
                     |  embed
                     |  applet
                     |  noframes
                     |  noscript
                     |  noembed 
                   )                             # (1 end)
                   (?:
                        \s+ 
                        (?>
                             " [\S\s]*? "
                          |  ' [\S\s]*? '
                          |  (?:
                                  (?! /> )
                                  [^>] 
                             )?
                        )+
                   )?
                   \s* >
              )

              [\S\s]*? </ \1 \s* 
              (?= > )
         )

      |  (?: /? [\w:]+ \s* /? )
      |  (?:
              [\w:]+ 
              \s+ 
              (?:
                   " [\S\s]*? " 
                |  ' [\S\s]*? ' 
                |  [^>]? 
              )+
              \s* /?
         )
      |  \? [\S\s]*? \?
      |  (?:
              !
              (?:
                   (?: DOCTYPE [\S\s]*? )
                |  (?: \[CDATA\[ [\S\s]*? \]\] )
                |  (?: -- [\S\s]*? -- )
                |  (?: ATTLIST [\S\s]*? )
                |  (?: ENTITY [\S\s]*? )
                |  (?: ELEMENT [\S\s]*? )
              )
         )
    )
    >
    (*SKIP)
    (?!)
 |  
    (?: text | simple )

或者，一个更快的方法是匹配两个标签和你的文本
寻找。

匹配的标签移过它们。

如果您要进行替换，请使用回调来确定要替换的内容。
第 1 组是 TAG 或 Invisible Content 运行。
第 3 组是您要替换的单词。

因此，在回调中，如果组 1 匹配，则只返回组 1。
如果组 3 匹配，请替换为您要替换的内容。

正则表达式

https://regex101.com/r/7ZGlvW/2

这个正则表达式类似于 SAX 和 DOM 解析器解析标签的方式。
我已经在 SO 上发布了数百次。

以下是如何删除所有 html 标签的示例：

https://regex101.com/r/oCVkZv/1

【讨论】：

这个正则表达式工作正常，但使用大量内存，导致错误：Firefox：连接已重置 Chrome：（net::ERR_CONNECTION_RESET）：连接已重置。 IE：Internet Explorer 无法显示网页
@PauloACosta - 我看到你已经接受了我最初发布的 skip/fail 答案。但是，正如我所说的it is impossible to ignore a single tag without parsing all tags。并且在我的正则表达式中使用跳过/失败会更慢。你得到 MEMORY 问题的地方是 not 来自正则表达式。否则，为了速度，我说 not 使用跳过/失败，而是使用我以后的正则表达式匹配您需要的标签和文本。您在答案中做出了错误的选择。太糟糕了……