【问题标题】:Web scraping through pagination with BeautifulSoup使用 BeautifulSoup 通过分页抓取网页
【发布时间】:2018-09-21 23:58:23
【问题描述】:

我正在为课程项目从Bodybuilding.com 抓取数据,我的目标是抓取会员信息。我成功地在第一页为 20 个成员抓取了信息。当我转到第二页时出现问题。下面突出显示的部分显示索引 21 到 40 重复了索引 1 到 20 的信息。而且,我不知道为什么。

我认为第 28 行(粗体)会更新变量及其存储的信息。但它似乎没有改变。这与网站结构有关吗?

如果有任何帮助,我将不胜感激,谢谢。

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.common.exceptions import NoSuchElementException
import time
import json

data = {}

browser = webdriver.Chrome()
url = "https://bodyspace.bodybuilding.com/member-search"
browser.get(url)

html = browser.page_source
soup = BeautifulSoup(html, "html.parser")

# Going through pagination
pages_remaining = True
counter = 1
index = 0

while pages_remaining:

    if counter == 60:
        pages_remaining = False

    # FETCH AGE, HEIGHT, WEIGHT, & FITNESS GOAL

    **metrics = soup.findAll("div", {"class": "bbcHeadMetrics"})**

    for x in range(0, len(metrics)):
        metrics_children = metrics[index].findChildren()

        details = soup.findAll("div", {"class": "bbcDetails"})
        individual_details = details[index].findChildren()

        if len(individual_details) > 16:
            print ("index: " + str(counter) + " / Age: " + individual_details[2].text + " / Height: " + individual_details[4].text + " / Weight: " + individual_details[7].text + " / Gender: " + individual_details[12].text + " / Goal: " + individual_details[18].text)
        else:
            print ("index: " + str(counter) + " / Age: " + individual_details[2].text + " / Height: " + individual_details[4].text + " / Weight: " + individual_details[7].text + " / Gender: " + individual_details[12].text + " / Goal: " + individual_details[15].text)

        index = index + 1
        counter = counter + 1

    try:
        # Go to page 2
        next_link = browser.find_element_by_xpath('//*[@title="Go to page 2"]')
        next_link.click()
        index = 0
        time.sleep(30)
    except NoSuchElementException:
        rows_remaining = False

【问题讨论】:

  • 我回答了下面的问题。

标签: python selenium selenium-webdriver web-scraping beautifulsoup


【解决方案1】:

需要更新变量html和soup。

try:
    # Go to page 2
    next_link = browser.find_element_by_xpath('//*[@title="Go to page 2"]')
    next_link.click()
    index = 0

    # update html and soup
    html = browser.page_source
    soup = BeautifulSoup(html, "html.parser")

    time.sleep(30)

except NoSuchElementException:
    rows_remaining = False

我相信您必须这样做,因为 URL 不会更改,并且 html 是使用 javascript 动态生成的。

【讨论】:

  • 我认为这个解决方案在浏览下一页时有效。只是相同的页面结果即将到来。
猜你喜欢
  • 2016-09-05
  • 1970-01-01
  • 2020-09-17
  • 1970-01-01
  • 2020-03-27
  • 2017-04-15
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多