【发布时间】:2018-09-21 23:58:23
【问题描述】:
我正在为课程项目从Bodybuilding.com 抓取数据,我的目标是抓取会员信息。我成功地在第一页为 20 个成员抓取了信息。当我转到第二页时出现问题。下面突出显示的部分显示索引 21 到 40 重复了索引 1 到 20 的信息。而且,我不知道为什么。
我认为第 28 行(粗体)会更新变量及其存储的信息。但它似乎没有改变。这与网站结构有关吗?
如果有任何帮助,我将不胜感激,谢谢。
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.common.exceptions import NoSuchElementException
import time
import json
data = {}
browser = webdriver.Chrome()
url = "https://bodyspace.bodybuilding.com/member-search"
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
# Going through pagination
pages_remaining = True
counter = 1
index = 0
while pages_remaining:
if counter == 60:
pages_remaining = False
# FETCH AGE, HEIGHT, WEIGHT, & FITNESS GOAL
**metrics = soup.findAll("div", {"class": "bbcHeadMetrics"})**
for x in range(0, len(metrics)):
metrics_children = metrics[index].findChildren()
details = soup.findAll("div", {"class": "bbcDetails"})
individual_details = details[index].findChildren()
if len(individual_details) > 16:
print ("index: " + str(counter) + " / Age: " + individual_details[2].text + " / Height: " + individual_details[4].text + " / Weight: " + individual_details[7].text + " / Gender: " + individual_details[12].text + " / Goal: " + individual_details[18].text)
else:
print ("index: " + str(counter) + " / Age: " + individual_details[2].text + " / Height: " + individual_details[4].text + " / Weight: " + individual_details[7].text + " / Gender: " + individual_details[12].text + " / Goal: " + individual_details[15].text)
index = index + 1
counter = counter + 1
try:
# Go to page 2
next_link = browser.find_element_by_xpath('//*[@title="Go to page 2"]')
next_link.click()
index = 0
time.sleep(30)
except NoSuchElementException:
rows_remaining = False
【问题讨论】:
-
我回答了下面的问题。
标签: python selenium selenium-webdriver web-scraping beautifulsoup