【发布时间】:2018-11-28 12:10:24
【问题描述】:
我正在尝试抓取所有测试匹配详细信息,但它显示 HTTP Error 504: Gateway Timeout 我正在获取测试匹配的详细信息,但它没有显示这是我使用 bs4 从中获取测试匹配详细信息的代码犯罪信息
我需要抓取 2000 次测试匹配的详细信息,这是我的代码
import urllib.request as req
BASE_URL = 'http://www.espncricinfo.com'
if not os.path.exists('./espncricinfo-fc'):
os.mkdir('./espncricinfo-fc')
for i in range(0, 2000):
soupy = BeautifulSoup(urllib2.urlopen('http://search.espncricinfo.com/ci/content/match/search.html?search=test;all=1;page=' + str(i)).read())
time.sleep(1)
for new_host in soupy.findAll('a', {'class' : 'srchPlyrNmTxt'}):
try:
new_host = new_host['href']
except:
continue
odiurl =BASE_URL + urljoin(BASE_URL,new_host)
new_host = unicodedata.normalize('NFKD', new_host).encode('ascii','ignore')
print(new_host)
html = req.urlopen(odiurl).read()
if html:
with open('espncricinfo-fc/{0!s}'.format(str.split(new_host, "/")[4]), "wb") as f:
f.write(html)
print(html)
else:
print("no html")
【问题讨论】: