【发布时间】:2018-06-19 13:15:53
【问题描述】:
我正在尝试使用以下程序从网页中提取有关 mtg 卡的一些信息,但我反复检索有关给定初始页面的信息 (InitUrl)。爬虫无法继续进行。我开始相信我没有使用正确的 url,或者使用 urllib 的限制可能会引起我的注意。这是我数周以来一直在努力的代码:
import re
from math import ceil
from urllib.request import urlopen as uReq, Request
from bs4 import BeautifulSoup as soup
InitUrl = "https://mtgsingles.gr/search?q=dragon"
NumOfCrawledPages = 0
URL_Next = ""
NumOfPages = 4 # depth of pages to be retrieved
query = InitUrl.split("?")[1]
for i in range(0, NumOfPages):
if i == 0:
Url = InitUrl
else:
Url = URL_Next
print(Url)
UClient = uReq(Url) # downloading the url
page_html = UClient.read()
UClient.close()
page_soup = soup(page_html, "html.parser")
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
for card in cards:
card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
if len(card.div.contents) > 3:
cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
else:
cardP_T = "Does not exist"
cardType = card.contents[3].text
print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")
try:
URL_Next = InitUrl + "&page=" + str(i + 2)
print("The next URL is: " + URL_Next + "\n")
except IndexError:
print("Crawling process completed! No more infomation to retrieve!")
else:
NumOfCrawledPages += 1
Url = URL_Next
finally:
print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")
【问题讨论】:
-
看看这个 - 关于 try except else 最终如何工作。 stackoverflow.com/a/31626974/8240959
-
我做了,但我没有注意到关于 try-except-else-finally 语句的代码有什么问题。
标签: python beautifulsoup web-crawler urllib