【问题标题】:BeautifulSoup: finding nested tagBeautifulSoup:查找嵌套标签
【发布时间】:2020-12-04 13:02:52
【问题描述】:

我对此很坚持:

<span>Alpha<span class="class_xyz">Beta</span></span>

我试图只抓取第一个跨度文本“Alpha”(不包括第二个嵌套的“Beta”)。 你会怎么做?

我正在尝试编写一个函数来查找所有没有类属性的 Span 标签,但有些东西不起作用......

谢谢。

【问题讨论】:

    标签: python beautifulsoup nested tags screen-scraping


    【解决方案1】:

    这是另一种获取每个没有类属性的 Span 标签文本的方法:

    from bs4 import BeautifulSoup
    
    html = """
    <body>
    <p>Some random text</p>
    <span>Alpha<span class="class_xyz">Beta</span></span>
    <span>Gamma<span class="class_abc">Delta</span></span>
    <span>Epsilon<span class="class_lmn">Zeta</span></span>
    </body>
    """
    
    soup = BeautifulSoup(html)
    target = soup.select("span[class]")
    for i in range(len(target)):
        target[i].decompose()
    target = soup.select("span")
    out = []
    for i in range(len(target)):
        out.append(target[i].text.strip())
    
    print(out)
    

    输出:

    ['Alpha', 'Gamma', 'Epsilon']
    

    或者如果你想要整个 span 标签:

    from bs4 import BeautifulSoup
    
    html = """
    <body>
    <p>Some random text</p>
    <span>Alpha<span class="class_xyz">Beta</span></span>
    <span>Gamma<span class="class_abc">Delta</span></span>
    <span>Epsilon<span class="class_lmn">Zeta</span></span>
    </body>
    """
    
    soup = BeautifulSoup(html)
    target = soup.select("span[class]")
    for i in range(len(target)):
        target[i].decompose()
    out = soup.select("span")
    
    print(out)
    

    输出:

    [<span>Alpha</span>, <span>Gamma</span>, <span>Epsilon</span>]
    

    【讨论】:

      【解决方案2】:

      一种处理方式:

      from bs4 import BeautifulSoup as bs
      txt = """<doc>
      <span>Alpha<span class="class_xyz">Beta</span></span>
      </doc>"""
      soup = bs(txt,'lxml')
      target = soup.select_one('span[class]')
      target.decompose()
      soup.text.strip()
      

      输出:

      'Alpha'
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2011-06-03
        • 2013-10-19
        • 2016-01-10
        • 2019-08-24
        • 2012-11-04
        • 2021-12-11
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多