【问题标题】:Python: How to scrape the string 7.7872 in <span class='pos'><span class='arr_ud arrow_u5'> </span>&nbsp;7.7872</span>Python:如何在 <span class='pos'><span class='arr_ud arrow_u5'> </span> 7.7872</span> 中抓取字符串 7.7872
【发布时间】:2021-09-20 18:41:33
【问题描述】:

我正在尝试抓取以下行并提取 7.7872 的值,如何使其工作?

<span class='pos'><span class='arr_ud arrow_u5'> </span>&nbsp;7.7872</span>

我尝试了以下代码,但有一些空白字符串我无法摆脱:

for a in soupUSD.find_all("span", attrs={"class":"pos"})[0]:
    print(a)

我有以下结果:

<span class='arr_ud arrow_u5'> </span>&nbsp;7.7872

有什么办法只能找到 7.7872 的文字吗?

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup


    【解决方案1】:
    from bs4 import BeautifulSoup
    
    spam = "<span class='pos'><span class='arr_ud arrow_u5'> </span>&nbsp;7.7872</span>"
    soup = BeautifulSoup(spam, 'html.parser')
    span = soup.find('span', {'class':'pos'})
    print(' '.join(span.stripped_strings))
    

    输出

    7.7872
    

    【讨论】:

    • 我改用这一行解决了:(但你的也可以)for a in soupUSD.find_all("span", attrs={"class":"pos"})[0]。剥离字符串:
    • 好吧,这将产生生成器:&lt;generator object Tag.stripped_strings at 0x7f25ea928c00&gt; 但如果它让你开心......如果你将它转换为例如列表 - ['7.7872']
    【解决方案2】:

    由于在目标字符串的同一级别还有其他标签,.string 属性不会检测到字符串(在这种情况下)。因此,您可以遍历标签内容并查找字符串,实例NavigableString,然后将其转换为字符串。

    from bs4 import BeautifulSoup, NavigableString
    
    spam = "<span class='pos'><span class='arr_ud arrow_u5'> </span>&nbsp;7.7872</span>"
    soup = BeautifulSoup(spam, 'lxml')
    span = soup.find('span', class_='pos')
    
    nr = ''.join([str(string).strip() for string in span.contents if isinstance(string, NavigableString)])
    
    print(nr)
    # 7.7872
    

    【讨论】:

      【解决方案3】:

      使用核心python库(ElementTree)

      import xml.etree.ElementTree as ET
      
      
      dtd = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
                  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
                  <!ENTITY nbsp ' '>
                  ]>'''
      
      html = '''<span class='pos'><span class='arr_ud arrow_u5'> </span>&nbsp;7.7872</span>'''
      root = ET.fromstring(dtd + html)
      print(list(root)[0].tail)
      

      输出

       7.7872
      

      【讨论】:

        猜你喜欢
        • 2023-03-10
        • 1970-01-01
        • 2013-08-04
        • 2016-03-18
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-09-26
        相关资源
        最近更新 更多