【问题标题】:For Loop to pass a Variable through a URL in PythonFor循环通过Python中的URL传递变量
【发布时间】:2015-05-11 23:59:46
【问题描述】:

我是 Python 的新手,我正在尝试通过一些简单的网络抓取来获取足球统计数据来自学。

我已经成功地一次获取一个页面的数据,但我无法弄清楚如何在我的代码中添加一个循环来一次抓取多个页面(或多个职位/年份/会议就此而言)。

我在这个网站和其他网站上搜索了相当多的内容,但我似乎无法正确找到它。

这是我的代码:

import csv
import requests
from BeautifulSoup import BeautifulSoup

url = 'http://www.nfl.com/stats/categorystats?seasonType=REG&d-447263-n=1&d-447263-o=2&d-447263-p=1&d-447263-s=PASSING_YARDS&tabSeq=0&season=2014&Submit=Go&experience=&archive=false&statisticCategory=PASSING&conference=null&qualified=false'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'data-table1'})

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll('td'):
        text = cell.text.replace('&#39', '')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

#for line in list_of_rows: print ', '.join(line)

outfile = open("./2014.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Rk", "Player", "Team", "Pos", "Comp", "Att", "Pct", "Att/G", "Yds", "Avg", "Yds/G", "TD", "Int", "1st", "1st%", "Lng", "20+", "40+", "Sck", "Rate"])
writer.writerows(list_of_rows)

outfile.close()

这是我在 URL 中添加变量并构建循环的尝试:

import csv
import requests
from BeautifulSoup import BeautifulSoup

pagelist = ["1", "2", "3"]

x = 0
while (x < 500):
    url = "http://www.nfl.com/stats/categorystats?seasonType=REG&d-447263-n=1&d-447263-o=2&d-447263-p="+str(x)).read(),'html'+"&d-447263-s=RUSHING_ATTEMPTS_PER_GAME_AVG&tabSeq=0&season=2014&Submit=Go&experience=&archive=false&statisticCategory=RUSHING&conference=null&qualified=false"

    response = requests.get(url)
    html = response.content
    soup = BeautifulSoup(html)
    table = soup.find('table', attrs={'class': 'data-table1'})
    list_of_rows = []
    for row in table.findAll('tr'):
        list_of_cells = []
        for cell in row.findAll('td'):
            text = cell.text.replace('&#39', '')
            list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

    #for line in list_of_rows: print ', '.join(line)


    outfile = open("./2014.csv", "wb")
    writer = csv.writer(outfile)
    writer.writerow(["Rk", "Player", "Team", "Pos", "Att", "Att/G", "Yds", "Avg", "Yds/G", "TD", "Long", "1st", "1st%", "20+", "40+", "FUM"])
    writer.writerows(list_of_rows)
    x = x + 0
    outfile.close()

非常感谢。

这是我修改后的代码,似乎在写入 csv 文件时会删除每一页。

import csv
import requests
from BeautifulSoup import BeautifulSoup

url_template = 'http://www.nfl.com/stats/categorystats?tabSeq=0&season=2014&seasonType=REG&experience=&Submit=Go&archive=false&d-447263-p=%s&conference=null&statisticCategory=PASSING&qualified=false'

for p in ['1','2','3']:
    url = url_template % p
    response = requests.get(url)
    html = response.content
    soup = BeautifulSoup(html)
    table = soup.find('table', attrs={'class': 'data-table1'})

    list_of_rows = []
    for row in table.findAll('tr'):
        list_of_cells = []
        for cell in row.findAll('td'):
            text = cell.text.replace('&#39', '')
            list_of_cells.append(text)
        list_of_rows.append(list_of_cells)

    #for line in list_of_rows: print ', '.join(line)

        outfile = open("./2014Passing.csv", "wb")
        writer = csv.writer(outfile)
        writer.writerow(["Rk", "Player", "Team", "Pos", "Comp", "Att", "Pct", "Att/G", "Yds", "Avg", "Yds/G", "TD", "Int", "1st", "1st%", "Lng", "20+", "40+", "Sck", "Rate"])
        writer.writerows(list_of_rows)

outfile.close()

【问题讨论】:

    标签: python python-2.7 web-scraping beautifulsoup python-requests


    【解决方案1】:

    假设您只想更改页码,您可以这样做并使用string formatting

    url_template = 'http://www.nfl.com/stats/categorystats?seasonType=REG&d-447263-n=1&d-447263-o=2&d-447263-p=%s&d-447263-s=PASSING_YARDS&tabSeq=0&season=2014&Submit=Go&experience=&archive=false&statisticCategory=PASSING&conference=null&qualified=false'
    for page in [1,2,3]:
      url = url_template % page
      response = requests.get(url)
      # Rest of the processing code can go here
      outfile = open("./2014.csv", "ab")
      writer = csv.writer(outfile)
      writer.writerow(...)
      writer.writerows(list_of_rows)
      outfile.close()
    

    请注意,您应该以追加模式(“ab”)而不是写入模式(“wb”)打开文件,因为后者会覆盖现有内容,正如您所经历的那样。使用追加模式,新内容写入文件末尾。

    这超出了问题的范围,更多的是友好的代码改进建议,但是如果您将脚本拆分为较小的函数,每个函数都做一件事,例如,从网站,将其写入 csv 等。

    【讨论】:

    • 非常感谢您的帮助 Jomel - 当我打印到屏幕时代码有效,但是当我尝试保存到 csv 时,看起来每个页面都在文件中被覆盖,所以我只得到了最后一页。有没有办法将数据从页面附加到第一页数据而不覆盖它?
    • @JasonC 查看我修改后的答案。您只需要以附加模式打开文件,以免覆盖每次传递的内容。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-07-30
    相关资源
    最近更新 更多