【发布时间】:2018-01-28 05:39:29
【问题描述】:
我正在尝试遍历多个页面以使用 Python 和 Beautifulsoup 抓取数据。我的脚本适用于一个页面,但是当尝试遍历多个页面时,它只返回最后一页抓取的数据。我认为我循环或存储/附加player_data 列表的方式可能有问题。
这是我迄今为止所得到的——非常感谢任何帮助。
#! python3
# downloadRecruits.py - Downloads espn college basketball recruiting database info
import requests, os, bs4, csv
import pandas as pd
# Starting url (class of 2007)
base_url = 'http://www.espn.com/college-sports/basketball/recruiting/databaseresults/_/class/2007/page/'
# Number of pages to scrape (Not inclusive, so number + 1)
pages = map(str, range(1,3))
# url for starting page
url = base_url + pages[0]
for n in pages:
# Create url
url = base_url + n
# Parse data using BS
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
# Creating bs object
soup = bs4.BeautifulSoup(res.text, "html.parser")
table = soup.find('table')
# Get the data
data_rows = soup.findAll('tr')[1:]
player_data = []
for tr in data_rows:
tdata = []
for td in tr:
tdata.append(td.getText())
if td.div and td.div['class'][0] == 'school-logo':
tdata.append(td.div.a['href'])
player_data.append(tdata)
print(player_data)
【问题讨论】:
-
print(player_data)前加4个空格
标签: python web-scraping beautifulsoup