使用 Beautiful Soup 通过网站的不同页面循环抓取数据答案

【问题标题】：Looping Scraped Data Through Different Pages of a Website Using Beautiful Soup使用 Beautiful Soup 通过网站的不同页面循环抓取数据
【发布时间】：2018-12-22 23:21:24
【问题描述】：

下面是一个网络爬虫，它成功地从团队的website 中提取花名册信息并将其导出为 CSV 文件。如您所见，每个团队网站都有相似的 url 模式。

http://m.redsox.mlb.com/roster/
http://m.yankees.mlb.com/roster/

我正在尝试创建一个循环，该循环将遍历每个团队的网站，抓取每个球员的名单信息，并将其写入 CSV 文件。在我的代码开头，我创建了一个团队名称字典，并将其格式化为请求页面的 url。这个策略奏效了，然而，代码只是循环浏览我在字典中列出的最后一页。有谁知道如何更改此代码，以便循环遍历 team_list 字典中的所有页面？提前致谢！

import requests
import csv
from bs4 import BeautifulSoup

team_list={'yankees','redsox'}

for team in team_list:
    page = requests.get('http://m.{}.mlb.com/roster/'.format(team))
    soup = BeautifulSoup(page.text, 'html.parser')

    soup.find(class_='nav-tabset-container').decompose()
    soup.find(class_='column secondary span-5 right').decompose()

    roster = soup.find(class_='layout layout-roster')
    names = [n.contents[0] for n in roster.find_all('a')]
    ids = [n['href'].split('/')[2] for n in roster.find_all('a')]
    number = [n.contents[0] for n in roster.find_all('td', index='0')]
    handedness = [n.contents[0] for n in roster.find_all('td', index='3')]
    height = [n.contents[0] for n in roster.find_all('td', index='4')]
    weight = [n.contents[0] for n in roster.find_all('td', index='5')]
    DOB = [n.contents[0] for n in roster.find_all('td', index='6')]
    team = [soup.find('meta',property='og:site_name')['content']] * len(names)

    with open('MLB_Active_Roster.csv', 'w', newline='') as fp:
        f = csv.writer(fp)
        f.writerow(['Name','ID','Number','Hand','Height','Weight','DOB','Team'])
        f.writerows(zip(names, ids, number, handedness, height, weight, DOB, team))

【问题讨论】：

hmm，我对 csv 模块不是很熟悉，如果我尝试使用 pandas module 解决您的问题会适合您
我实际上是自己解决了这个问题。我需要将我的开放行中的“w”更改为“a”。现在我只需要弄清楚如何让标题只显示一次。感谢@Fozoro 的报价！感谢您的帮助！

标签： python loops web-scraping beautifulsoup

【解决方案1】：

我相信通过用列表替换您的字典应该可以解决问题：

import requests
import csv
import pandas as pd

from bs4 import BeautifulSoup

team_list=['yankees','redsox']
output = []

for team in team_list:
    page = requests.get('http://m.{}.mlb.com/roster/'.format(team))
    soup = BeautifulSoup(page.text, 'html.parser')

    soup.find(class_='nav-tabset-container').decompose()
    soup.find(class_='column secondary span-5 right').decompose()

    roster = soup.find(class_='layout layout-roster')
    names = [n.contents[0] for n in roster.find_all('a')]
    ids = [n['href'].split('/')[2] for n in roster.find_all('a')]
    number = [n.contents[0] for n in roster.find_all('td', index='0')]
    handedness = [n.contents[0] for n in roster.find_all('td', index='3')]
    height = [n.contents[0] for n in roster.find_all('td', index='4')]
    weight = [n.contents[0] for n in roster.find_all('td', index='5')]
    DOB = [n.contents[0] for n in roster.find_all('td', index='6')]
    team = [soup.find('meta',property='og:site_name')['content']] * len(names)

    output.append([names, ids, number, handedness, height, weight, DOB, team])

pd.DataFrame(data=output, columns=['Name','ID','Number','Hand','Height','Weight','DOB','Team']).tocsv('csvfilename.csv')

【讨论】：

感谢马苏德的回复！不幸的是，那没有用。还是只刮了一队。无论出于何种原因，某些东西没有循环。
您正在抓取所有文件。正如您已经指出的那样，问题在于覆盖它们。我已经编辑了代码 sn-p;为此，我建议您使用 Pandas 的 DataFrame。
感谢马苏德的帮助！