【问题标题】:Extract data with BeautifulSoup使用 BeautifulSoup 提取数据
【发布时间】:2016-03-09 09:41:22
【问题描述】:

我需要从文件中提取“7 秒前结束”:

<div class="featured__columns">             
                            <div class="featured__column"><i style="color:rgb(149,213,230);" class="fa fa-clock-o"></i> <span title="Today, 11:49am">Ended 7 seconds ago</span></div>
                            <div class="featured__column featured__column--width-fill text-right"><span title="March 7, 2016, 10:50am">2 days ago</span> by <a style="color:rgb(149,213,230);" href="/user/Eclipsy">Eclipsy</a></div><a href="/user/Eclipsy" class="global__image-outer-wrap global__image-outer-wrap--avatar-small">
                                <div class="global__image-inner-wrap" style="background-image:url(https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/dc/dc5b8424bd5d17e13dcfe613689921dfc29f4574_medium.jpg);"></div>
                            </a>
                        </div>

我试试:

#!/usr/bin/python3
from bs4 import BeautifulSoup
with open("./source.html") as source_html:
    soup=BeautifulSoup(source_html.read())
    soup=soup.find_all("span")
    print(soup[0].string)

一切都很好,但我认为我的方法很愚蠢。有不同的方法来提取数据?

【问题讨论】:

    标签: python python-3.x beautifulsoup


    【解决方案1】:

    你想要的span在第一个featured__columndiv:

    from bs4 import BeautifulSoup
    
    html ="""<div class="featured__columns">
                                <div class="featured__column"><i style="color:rgb(149,213,230);" class="fa fa-clock-o"></i> <span title="Today, 11:49am">Ended 7 seconds ago</span></div>
                                <div class="featured__column featured__column--width-fill text-right"><span title="March 7, 2016, 10:50am">2 days ago</span> by <a style="color:rgb(149,213,230);" href="/user/Eclipsy">Eclipsy</a></div><a href="/user/Eclipsy" class="global__image-outer-wrap global__image-outer-wrap--avatar-small">
                                    <div class="global__image-inner-wrap" style="background-image:url(https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/dc/dc5b8424bd5d17e13dcfe613689921dfc29f4574_medium.jpg);"></div>
                                </a>
                            </div>"""
    
    
    print(BeautifulSoup(html).select("div.featured__column span")[0].text)
    Ended 7 seconds ago
    

    如果你想要第一个,或者第 n 个跨度,你可以在选择中使用 nth-of-type

    In [53]: BeautifulSoup(html).select("div.featured__column span")
    Out[53]: 
    [<span title="Today, 11:49am">Ended 7 seconds ago</span>,
     <span title="March 7, 2016, 10:50am">2 days ago</span>]
    
    In [54]: BeautifulSoup(html).select("div.featured__column span:nth-of-type(1)")
    Out[54]: [<span title="Today, 11:49am">Ended 7 seconds ago</span>]
    
    In [55]: BeautifulSoup(html).select("div.featured__column span:nth-of-type(2)")
    Out[55]: [<span title="March 7, 2016, 10:50am">2 days ago</span>]
    In [56]: BeautifulSoup(html).select("div.featured__column span:nth-of-type(2)")[0].text
    Out[56]: u'2 days ago'
    
    In [57]: BeautifulSoup(html).select("div.featured__column span:nth-of-type(1)")[0].text
    Out[57]: u'Ended 7 seconds ago'
    

    我们也可以将i标签与fa fa-clock-o类一起使用并获得它的相邻兄弟跨度:

    In [70]: BeautifulSoup(html).select("i.fa.fa-clock-o + span")
    Out[70]: [<span title="Today, 11:49am">Ended 7 seconds ago</span>]
    
    In [71]: BeautifulSoup(html).select("i.fa.fa-clock-o + span")[0].text
    Out[71]: u'Ended 7 seconds ago'
    

    最后,准确地复制您自己的逻辑并获得第一个跨度 html,而不管类等。您可以简化为:

    BeautifulSoup(html).select("span:nth-of-type(1)")[0].text
    BeautifulSoup(html).find("span").text
    

    【讨论】:

      【解决方案2】:

      你可以试试

      f_c = soup.find_all('div', class='featured__columns')[0]
      print f_c.find('div', class='featured__column').span.get_text()
      

      类似地,如果有多个div 标记为featured__columns,那么您可以循环遍历它并获取您的数据。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2013-01-29
        • 1970-01-01
        • 2023-03-25
        • 1970-01-01
        • 2019-05-02
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多