【问题标题】:BeautifulSoup scraping .text attribute problemBeautifulSoup 抓取.text 属性问题
【发布时间】:2021-04-01 23:26:47
【问题描述】:

我有以下代码来抓取页面,https://www.hotukdeals.com

from bs4 import BeautifulSoup
import requests

url="https://www.hotukdeals.com/hot"
r = requests.get(url)
soup = BeautifulSoup(r.text,"html.parser")
deals = soup.find_all("article")
for deal in deals:
    priceElement = deal.find("span",{"class":"thread-price"})
    try:
        print(priceElement,priceElement.text)
    except AttributeError:
        pass

由于某种原因,这有效,在循环中抓取交易价格一定次数,然后停止工作。

程序输出:

<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£9.09</span> £9.09
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£39.95</span> £39.95
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£424.98</span> £424.98
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£8.10</span> £8.10
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£14.59</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£2.50</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl text--color-greyShade">£20</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£19</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£29</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl text--color-greyShade">£49.97</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">FREE</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£2.49</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£1.99</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£54.99</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£12.85</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£1.99</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£21.03</span>
<span class="thread-price text--b cept-tp size--all-l size--fromW3-xl">£5.29</span>

从输出中可以看出,在前四行之后,.text 属性为空,但元素中有文本。

有人知道吗?有什么想法或解决方案吗?

【问题讨论】:

    标签: python html web-scraping beautifulsoup


    【解决方案1】:

    Beautifulsoup 需要html5lib 解析器才能正确解析站点,例如:

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.hotukdeals.com/"
    
    soup = BeautifulSoup(requests.get(url).content, "html5lib")  # <-- use html5lib
    
    for price in soup.select(".thread-price"):
        print(price.text)
    

    打印:

    £149.99
    £7
    £21.03
    £31.79
    £359.10
    £19.99
    £60
    £0.60
    £168
    £4.99
    £20
    £119
    Free P&P
    Free
    £5
    £89.99
    FREE
    £10.96
    £1.79
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-03-11
      • 2016-06-28
      • 2023-03-09
      相关资源
      最近更新 更多