【问题标题】:RegEx to grab <script> tag正则表达式抓取 <script> 标签
【发布时间】:2021-02-19 14:16:10
【问题描述】:

我正在尝试定位脚本中包含“”@type“:“NewsArticle””的整个脚本标签。

类似:

<script type="application\/ld\+json">[^\{]*?{(.*?)\}[^\}]*?<\/script>

我可以使用上面的正则表达式来定位最上面的脚本标签。但我正在寻找一个 newsArticle JSON 信息,在这种情况下这是第二个,但在某些页面中有 4+ application/ld+json 标签,但是 " "@type": "NewsArticle" ”无论如何总是存在于每个页面中。所以我正在寻找可以针对特定脚本的脚本。

感谢您的帮助。


<script type="application/ld+json">
{
    "@context": "http://schema.org",
    "@type": "Organization",
    "@id": "https://www.givemesport.com/#gms",
    "name": "GiveMeSport",
    "url": "https://www.givemesport.com",
    "logo": {
        "@type": "ImageObject",
        "url": "https://gmsrp.cachefly.net/v4/images/logo-gms-black.png"
    },
    "sameAs":[
        "https://www.facebook.com/GiveMeSport",
        "https://www.instagram.com/givemesport",
        "https://twitter.com/GiveMeSport",
        "https://www.youtube.com/user/GiveMeSport"
    ]
}
</script>
    <script type="application/ld+json">
    {
    "@context": "http://schema.org",
    "@type": "NewsArticle",
    "mainEntityOfPage": "https://www.givemesport.com/1612447-man-uniteds-scott-mctominay-delighted-fans-with-reaction-after-third-goal-vs-rb-leipzig",
    "url": "https://www.givemesport.com/1612447-man-uniteds-scott-mctominay-delighted-fans-with-reaction-after-third-goal-vs-rb-leipzig",
    "headline": "Man United's Scott McTominay delighted fans with reaction after third goal vs RB Leipzig",
    "datePublished": "2020-10-30T21:52:48.3510000Z",
    "dateModified": "2020-10-30T21:52:48.3510000Z",
    "description": "Man United's Scott McTominay delighted fans with reaction after third goal vs RB Leipzig",
    "articleSection": "Football",
    "keywords": ["Football","Manchester United","Marcus Rashford","RB Leipzig","Scott McTominay","UEFA Champions"],
    "creator": ["Scott Wilson"],
    "thumbnailUrl": "https://gmsrp.cachefly.net/images/20/10/30/03a426c8204af5c8d02282afaeed6189/144.jpg",
    "author": {
    "@type": "Person",
    "name": "Scott Wilson",
    "sameAs": "https://www.givemesport.com/scott-wilson-1"
    },
    "publisher": {
    "@id": "https://www.givemesport.com/#gms"
    },
    "image": {
    "@type": "ImageObject",
    "url": "https://gmsrp.cachefly.net/images/20/10/30/03a426c8204af5c8d02282afaeed6189/960.jpg",
    "height": 620,
    "width": 960
    }
    }
</script>

【问题讨论】:

标签: javascript json regex


【解决方案1】:

很抱歉听到您不想遵循最佳实践,使用正则表达式解析 HTML 充满了问题。但是,如果您想要快速而肮脏的解决方法,请使用

<script type="application\/ld\+json">((?:(?!<\/?script)[\w\W])*?"@type":\s*"NewsArticle"[\w\W]*?)<\/script>

proof

说明

--------------------------------------------------------------------------------
  <script                  '<script type="application'
  type="application
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  ld                       'ld'
--------------------------------------------------------------------------------
  \+                       '+'
--------------------------------------------------------------------------------
  json">                   'json">'
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
--------------------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
        <                        '<'
--------------------------------------------------------------------------------
        \/?                      '/' (optional (matching the most
                                 amount possible))
--------------------------------------------------------------------------------
        script                   'script'
--------------------------------------------------------------------------------
      )                        end of look-ahead
--------------------------------------------------------------------------------
      [\w\W]                   any character of: word characters (a-
                               z, A-Z, 0-9, _), non-word characters
                               (all but a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
    )*?                      end of grouping
--------------------------------------------------------------------------------
    "@type":                 '"@type":'
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    "NewsArticle"            '"NewsArticle"'
--------------------------------------------------------------------------------
    [\w\W]*?                 any character of: word characters (a-z,
                             A-Z, 0-9, _), non-word characters (all
                             but a-z, A-Z, 0-9, _) (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  <                        '<'
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  script>                  'script>'

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-06-12
    • 1970-01-01
    • 2011-05-16
    • 2021-10-19
    相关资源
    最近更新 更多