如何在 cricinfo 中抓取所有测试匹配详细信息答案

【问题标题】：How to scrape all the test match details in cricinfo如何在 cricinfo 中抓取所有测试匹配详细信息
【发布时间】：2018-11-28 12:10:24
【问题描述】：

我正在尝试抓取所有测试匹配详细信息，但它显示 HTTP Error 504: Gateway Timeout 我正在获取测试匹配的详细信息，但它没有显示这是我使用 bs4 从中获取测试匹配详细信息的代码犯罪信息

我需要抓取 2000 次测试匹配的详细信息，这是我的代码

import urllib.request as req

BASE_URL = 'http://www.espncricinfo.com'

if not os.path.exists('./espncricinfo-fc'):
    os.mkdir('./espncricinfo-fc')

for i in range(0, 2000):
    
    soupy = BeautifulSoup(urllib2.urlopen('http://search.espncricinfo.com/ci/content/match/search.html?search=test;all=1;page=' + str(i)).read())

    time.sleep(1)
    for new_host in soupy.findAll('a', {'class' : 'srchPlyrNmTxt'}):
        try:
            new_host = new_host['href']
        except:
            continue
        odiurl =BASE_URL + urljoin(BASE_URL,new_host)
        new_host = unicodedata.normalize('NFKD', new_host).encode('ascii','ignore')
        print(new_host)
        html = req.urlopen(odiurl).read()
        if html:
            with open('espncricinfo-fc/{0!s}'.format(str.split(new_host, "/")[4]), "wb") as f:
                f.write(html)
                print(html)
        else:
            print("no html")

【问题讨论】：

标签： python-3.x beautifulsoup

【解决方案1】：

这通常发生在多个请求的速度太快时，可能是服务器已关闭或您的连接被服务器防火墙阻止，请尝试增加您的sleep() 或添加随机睡眠。

import random

.....
for i in range(0, 2000):
    soupy = BeautifulSoup(....)

    time.sleep(random.randint(2,6))

【讨论】：

【解决方案2】：

不知道为什么，似乎对我有用。

我通过链接对循环进行了一些更改。我不确定您希望输出在将其写入文件时看起来如何，所以我不理会那部分。但就像我说的，我这边似乎工作正常。

import bs4
import requests  
import os
import time
import urllib.request as req

BASE_URL = 'http://www.espncricinfo.com'

if not os.path.exists('C:/espncricinfo-fc'):
    os.mkdir('C:/espncricinfo-fc')

for i in range(0, 2000):

    i=0
    url = 'http://search.espncricinfo.com/ci/content/match/search.html?search=test;all=1;page=%s' %i
    html = requests.get(url)

    print ('Checking page %s of 2000' %(i+1))

    soupy = bs4.BeautifulSoup(html.text, 'html.parser')

    time.sleep(1)
    for new_host in soupy.findAll('a', {'class' : 'srchPlyrNmTxt'}):
        try:
            new_host = new_host['href']
        except:
            continue
        odiurl = BASE_URL + new_host
        new_host = odiurl
        print(new_host)
        html = req.urlopen(odiurl).read()

        if html:
            with open('C:/espncricinfo-fc/{0!s}'.format('_'.join(str.split(new_host, "/")[4:])), "wb") as f:
                f.write(html)
                #print(html)
        else:
            print("no html")

【讨论】：

当我尝试将 html 保存在文件中时，它显示描述符“split”需要一个“str”对象但收到一个“bytes”
查看您的代码，我看到您正在使用 new_host 变量拆分。您使用的是我上面做的代码，还是您的原始代码？我的代码将new_host 变量存储为str，所以再次，它对我来说工作正常。但同样，你在寻找什么作为输出。您是否尝试将每个 html 源代码保存为单独的 html 文件？
啊，明白了。给我一分钟，我将编辑我的答案以包括在内。您是否希望将 html 保存到单个文件中？
是的，我想将所有 html 文件保存在一个文件夹中
正确。但你正在循环 2000。所以我问，你想要 2000 个文件在那个 1 文件夹中。正确的？只是想确保