【问题标题】:Extracting tags with multivalued attributes提取具有多值属性的标签
【发布时间】:2016-10-06 18:36:20
【问题描述】:

我正在尝试以下代码:

import re
from bs4 import BeautifulSoup
htmlsource1 = """<div class="small-12 columns ">
                    <h5 class="clsname1 large-text seq2">text1</h5>
                    <h5 class="clsname1 small-text seq1">text2</h5>
                    <h5 class="clsname1 seq1 small-text clsname2">text3</h5>
                 </div>"""
soup = BeautifulSoup(htmlsource1, "html.parser")
interesting_h5s = soup.find_all('h5', class_=re.compile('^(?=.*\bsmall-text\b)(?=.*\bseq1\b).*$'))
for h5 in interesting_h5s:
    print h5

我的目的是提取包含“small-text”和“seq1”类(以任何顺序)的h5标签,但由于某种原因,尽管正则表达式在http://pythex.org中得到了正面测试,但它却无法正常工作。

对于正则表达式,我改编了Regex to match string containing two names in any order中提供的答案

感谢您的任何建议。

【问题讨论】:

标签: regex python-2.7 bs4


【解决方案1】:

前进

您确实应该使用 html 解析工具,但您似乎可以创造性地控制您的 HTML,因此可能的边缘情况将受到限制。

说明

&lt;h5(?=\s)(?=(?:[^&gt;=]|='[^']*'|="[^"]*"|=[^'"][^\s&gt;]*)*?\sclass=['"](?=[^"]*\bsmall-text\b)(?=[^"]*\bseq1\b)([^"]*)['"]?)(?:[^&gt;=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?\/?&gt;(.*?)&lt;/h5&gt;

此正则表达式将执行以下操作:

  • 提取包含“small-text”和“seq1”类的h5标签(以任意顺序)
  • 避免一些困难的边缘情况

示例

现场演示

https://regex101.com/r/fR0mT7/2

示例文本

注意最后两个 h5 标记中的困难边缘情况

<div class="small-12 columns ">
   <h5 class="clsname1 large-text seq2">text1</h5>
   <h5 class="clsname1 small-text seq1">text2</h5>
   <h5 class="clsname1 seq1 small-text clsname2">text3</h5>
   <h5 onmouseover=' class="small-text seq1" ; ' class="clsname1 large-text seq2">text4</h5>
   <h5 onmouseover=' class="small-text seq1" ; ' class="clsname1 small-text seq1">text5</h5>
   </div>

示例匹配

  • 捕获组 0 获取整个 h5 标签
  • 捕获组 1 从类属性中获取整个值
  • Capture Group 2 从h5 标签获取内部文本
[0][0] = <h5 class="clsname1 small-text seq1">text2</h5>
[0][1] = clsname1 small-text seq1
[0][2] = text2

[1][0] = <h5 class="clsname1 seq1 small-text clsname2">text3</h5>
[1][1] = clsname1 seq1 small-text clsname2
[1][2] = text3

[2][0] = <h5 onmouseover=' class="small-text seq1" ; ' class="clsname1 small-text seq1">text5</h5>
[2][1] = clsname1 small-text seq1
[2][2] = text5

说明

NODE                     EXPLANATION
----------------------------------------------------------------------
  <h5                      '<h5'
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
    class=                   'class='
----------------------------------------------------------------------
    ['"]                     any character of: ''', '"'
----------------------------------------------------------------------
    (?=                      look ahead to see if there is:
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      \b                       the boundary between a word char (\w)
                               and something that is not a word char
----------------------------------------------------------------------
      small-text               'small-text'
----------------------------------------------------------------------
      \b                       the boundary between a word char (\w)
                               and something that is not a word char
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    (?=                      look ahead to see if there is:
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      \b                       the boundary between a word char (\w)
                               and something that is not a word char
----------------------------------------------------------------------
      seq1                     'seq1'
----------------------------------------------------------------------
      \b                       the boundary between a word char (\w)
                               and something that is not a word char
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
    ['"]?                    any character of: ''', '"' (optional
                             (matching the most amount possible))
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ='                       '=\''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ="                       '="'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    =                        '='
----------------------------------------------------------------------
    [^'"\s]*                 any character except: ''', '"',
                             whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  "                        '"'
----------------------------------------------------------------------
  \s?                      whitespace (\n, \r, \t, \f, and " ")
                           (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  \/?                      '/' (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  >                        '>'
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  </h5>                    '</h5>'
----------------------------------------------------------------------

【讨论】:

    【解决方案2】:

    根据Disable special "class" attribute handling文章,通过添加以下代码行解决了该问题:

    from bs4.builder import HTMLParserTreeBuilder
    
    bb = HTMLParserTreeBuilder()
    bb.cdata_list_attributes["*"].remove("class")
    
    soup = BeautifulSoup(bs, "html.parser", builder=bb)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-05-02
      • 1970-01-01
      • 2013-09-25
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-03-09
      相关资源
      最近更新 更多