【问题标题】:python beautifulsoup next pagepython beautifulsoup 下一页
【发布时间】:2020-01-16 14:17:33
【问题描述】:

这是我目前从网站上抓取特定玩家数据的代码:

import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd
from pandas import ExcelWriter
import lxml
import xlsxwriter

page = requests.get('https://www.futbin.com/players?page=1')
soup = BeautifulSoup(page.content, 'lxml')
pool = soup.find(id='repTb')

pnames = pool.find_all(class_='player_name_players_table')
pprice = pool.find_all(class_='ps4_color font-weight-bold')
prating = pool.select('span[class*="form rating ut20"]')


all_player_names = [name.getText() for name in pnames]
all_prices = [price.getText() for price in pprice]
all_pratings = [rating.getText() for rating in prating]

fut_data = pd.DataFrame(
    {
        'Player': all_player_names,
        'Rating': all_pratings,
        'Price': all_prices,
     })

writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')

fut_data.to_excel(writer,'Futbin')
writer.save()

print(fut_data)

这对于第一页效果很好。但是我一共需要翻609页,从所有页面中获取数据。

您能帮我重新编写这段代码以使其正常工作吗?我还是新手,正在学习这个项目。

【问题讨论】:

  • 将代码放入循环 1 到 600。使用循环索引重新生成 url。并且不要忘记更改文件名

标签: python pandas web-scraping beautifulsoup python-requests


【解决方案1】:

您可以遍历所有609 页面,解析每个页面,最后将收集到的数据保存到file.xlsx

import requests
from bs4 import BeautifulSoup
import pandas as pd

all_player_names = []
all_pratings = []
all_prices = []

for i in range(1, 610):
    page = requests.get('https://www.futbin.com/players?page={}'.format(i))
    soup = BeautifulSoup(page.content, 'lxml')
    pool = soup.find(id='repTb')

    pnames = pool.find_all(class_='player_name_players_table')
    pprice = pool.find_all(class_='ps4_color font-weight-bold')
    prating = pool.select('span[class*="form rating ut20"]')

    all_player_names.extend([name.getText() for name in pnames])
    all_prices.extend([price.getText() for price in pprice])
    all_pratings.extend([rating.getText() for rating in prating])

fut_data = pd.DataFrame({'Player': all_player_names,
                         'Rating': all_pratings,
                         'Price': all_prices})

writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')
fut_data.to_excel(writer, 'Futbin')
writer.save()

【讨论】:

  • 感谢您的详细帮助! Howeserv 我现在收到一个错误:第 14 行,在 pnames = pool.find_all(class_='player_name_players_table') AttributeError: 'NoneType' object has no attribute 'find_all'
  • 哦..似乎我的 URL 被阻止了,这就是我收到该错误的原因:糟糕,出现错误 - 403。您的客户端目前无权访问此页面。请在几分钟后重试。
  • 我必须在某处放置睡眠定时器吗?是不是访问太快了?
  • 我不确定。不过你可以试试
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-03-14
  • 2016-07-01
  • 1970-01-01
相关资源
最近更新 更多