【发布时间】:2015-06-16 23:49:37
【问题描述】:
我正在尝试创建一个小脚本来简单地将给定网站与关键字一起使用,跟踪所有链接一定次数(仅网站域上的链接),最后搜索所有找到的链接以查找关键字和返回任何成功的匹配项。最终它的目标是,如果您记得某个网站,您在该网站上看到了一些东西,并且知道该页面包含的一个好关键字,那么该程序可能能够帮助找到指向丢失页面的链接。现在是我的错误:在遍历所有这些页面、提取它们的 URL 并创建它们的列表时,它似乎以某种方式结束了冗余检查并从列表中删除了相同的链接。我确实为此添加了保护措施,但它似乎没有按预期工作。我觉得有些网址被错误地复制到列表中并最终被检查了无数次。
这是我的完整代码(对不起,长度),问题区域似乎在 for 循环的最后:
import bs4, requests, sys
def getDomain(url):
if "www" in url:
domain = url[url.find('.')+1:url.rfind('.')]
elif "http" in url:
domain = url[url.find("//")+2:url.rfind('.')]
else:
domain = url[:url.rfind(".")]
return domain
def findHref(html):
'''Will find the link in a given BeautifulSoup match object.'''
link_start = html.find('href="')+6
link_end = html.find('"', link_start)
return html[link_start:link_end]
def pageExists(url):
'''Returns true if url returns a 200 response and doesn't redirect to a dns search.
url must be a requests.get() object.'''
response = requests.get(url)
try:
response.raise_for_status()
if response.text.find("dnsrsearch") >= 0:
print response.text.find("dnsrsearch")
print "Website does not exist"
return False
except Exception as e:
print "Bad response:",e
return False
return True
def extractURLs(url):
'''Returns list of urls in url that belong to same domain.'''
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text)
matches = soup.find_all('a')
urls = []
for index, link in enumerate(matches):
match_url = findHref(str(link).lower())
if "." in match_url:
if not domain in match_url:
print "Removing",match_url
else:
urls.append(match_url)
else:
urls.append(url + match_url)
return urls
def searchURL(url):
'''Search url for keyword.'''
pass
print "Enter homepage:(no http://)"
homepage = "http://" + raw_input("> ")
homepage_response = requests.get(homepage)
if not pageExists(homepage):
sys.exit()
domain = getDomain(homepage)
print "Enter keyword:"
#keyword = raw_input("> ")
print "Enter maximum branches:"
max_branches = int(raw_input("> "))
links = [homepage]
for n in range(max_branches):
for link in links:
results = extractURLs(link)
for result in results:
if result not in links:
links.append(result)
部分输出(约0.000000000001%):
Removing /store/apps/details?id=com.handmark.sportcaster
Removing /store/apps/details?id=com.handmark.sportcaster
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.mobisystems.office
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.joelapenna.foursquared
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.dashlabs.dash.android
Removing /store/apps/details?id=com.eweware.heard
Removing /store/apps/details?id=com.eweware.heard
Removing /store/apps/details?id=com.eweware.heard
【问题讨论】:
标签: python loops for-loop python-requests infinite-loop