【发布时间】:2017-09-03 10:01:00
【问题描述】:
我正在尝试抓取网页并尝试提取最高级别 3 的 url。我的代码如下:
import lxml.html
import urllib.request
from bs4 import BeautifulSoup
stopLevel = 3
rootUrls = ['http://ps.ucdavis.edu/']
foundUrls = {}
for rootUrl in rootUrls:
foundUrls.update({rootUrl : {'Level':0, 'Parent':'N/A'}})
def getProtocolAndDomainName(url):
protocolAndOther = url.split('://')
# splitting url by '://' and retrun a list
ptorocol = protocolAndOther[0]
domainName = protocolAndOther[1].split('/')[0]
# this will only return 'https://xxxxx.com'
return ptorocol + '://' + domainName
foundUrls = {}
for rootUrl in rootUrls:
foundUrls.update({rootUrl : 0})
def getProtocolAndDomainName(url):
protocolAndOther = url.split('://')
ptorocol = protocolAndOther[0]
domainName = protocolAndOther[1].split('/')[0]
return ptorocol + '://' + domainName
def crawl(urls, stopLevel = 5, level=1):
nextUrls = []
if (level <= stopLevel):
for url in urls:
# need to handle urls (e.g., https) that cannot be read
try:
openedUrl = urllib.request.urlopen(url).read()
soup = BeautifulSoup(openedUrl, 'html.parser')
except:
print('cannot read for :' + url)
for a in soup.find_all('a', href=True):
href = a['href']
if href is not None:
# for the case of a link is relative path
if '://' not in href:
href = getProtocolAndDomainName(url) + href
# check url has been already visited or not
if href not in foundUrls:
foundUrls.update({href: {'Level':level,
'Parent':url}})
nextUrls.append(href)
# recursive call
crawl(nextUrls, stopLevel, level + 1)
crawl(rootUrls, stopLevel)
print(foundUrls)
运行代码后,将显示错误消息UnboundLocalError: local variable 'soup' referenced before assignment。我知道这个问题的发生是因为BeautifulSoup 无法解析openedUrl,因此,这个局部变量soup 没有定义,这进一步导致这个循环失败。为了解决这个问题,我的第一个解决方案是将soup 设置为global,就在def crawl(urls, stopLevel = 5, level=1): 的下方global soup。但是,有人告诉我,这根本不能解决问题。我的第二个解决方案是在BeautifulSoup 无法解析时使用if...continue 保持循环运行,但我现在面临的问题是,无论我设置if soup == ' ' 还是if soup == None,它仍然不起作用。我想知道 BeautifulSoup 在失败时返回什么值。任何人都可以帮忙吗?或者有人有其他解决方案吗?非常感谢。
【问题讨论】:
标签: python-3.x beautifulsoup web-crawler