【问题标题】:Beautifulsoup unable to find more than 24 classes with find_allBeautifulsoup 无法使用 find_all 找到超过 24 个类
【发布时间】:2019-01-17 10:44:20
【问题描述】:

我正在尝试从所有项目都像这样存储的页面中转义数据

<div class="box browsingitem canBuy 123"> </div> <div class="box browsingitem canBuy 264"> </div>

有数百个,但是当我尝试将它们添加到数组中时,它只节省了 24 个

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import re
import lxml

my_url = 'https://www.alza.co.uk/tablets/18852388.htm'


uClient = uReq(my_url)

page_html = uClient.read()

uClient.close()

page_soup = soup(page_html, "lxml")

classname = "box browsingitem"
containers = page_soup.find_all("div", {"class":re.compile(classname)})

#len(containers) will be equal to 24

for container in containers:    
    title_container = container.find_all("a",{"class":"name browsinglink"})
    product_name = title_container[0].text  
    print("product_name: " + product_name)

re.compile 有问题吗?我还能如何搜索课程?

感谢您的帮助

【问题讨论】:

  • 是否有可能在滚动时加载了这数百个?
  • 如果所有这些项目都包含类名box browsingitem,为什么不直接使用page_soup.find_all('div', 'box browsingitem')。这应该检索加载到 DOM 中的该类的所有项目。
  • @taras 它是,但它加载 24,例如,即使有 18 个项目......真的很奇怪
  • @Steven 由于某种原因它不起作用,它加载 0
  • 你能提供你想抓取的链接吗? @布莱斯

标签: python html web-scraping beautifulsoup html-parsing


【解决方案1】:

所以在这种情况下,当您访问页面时,DOM 中只加载了 24 个项目。我想到的两个选项是 1)使用无头浏览器单击“加载更多”按钮并将更多项目加载到 DOM 或 2)创建一个简单的分页方案并循环浏览这些页面。

这是第二个选项的示例:

for page in range(0, 10):
    print("Trying page # {}".format(page))
    if page == 0:
        my_url = 'https://www.alza.co.uk/tablets/18852388.html'
    else: 
        my_url = 'https://www.alza.co.uk/tablets/18852388-p{}.html'.format(page)
        requests.get(my_url)

    page_html = requests.get(my_url)
    page_soup = soup(page_html.content, "lxml")
    items = page_soup.find_all('div', {"class": "browsingitem"})
    print("Found a total of {}".format(len(items)))
    for item in items:
        title  = page_soup.find('a', 'browsinglink')

您可以看到 URL 已内置分页信息,因此您只需确定要抓取多少页,即可保存所有这些信息。这是输出:

Trying page # 0
Found a total of 24
Trying page # 1
Found a total of 24
Trying page # 2
Found a total of 24
Trying page # 3
Found a total of 24
Trying page # 4
Found a total of 24
Trying page # 5
Found a total of 24
Trying page # 6
Found a total of 24
Trying page # 7
Found a total of 24
Trying page # 8
Found a total of 17
Trying page # 9
Found a total of 0

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2018-03-28
    • 2015-02-05
    • 2014-05-08
    • 2012-12-15
    • 1970-01-01
    • 2021-01-06
    • 2022-11-23
    • 2018-01-03
    相关资源
    最近更新 更多