【发布时间】:2016-04-19 12:57:32
【问题描述】:
我正在寻找一种代码,该代码将通过迭代找到的所有内部链接 [绝对和相对] 从网站获取所有内部链接。
到目前为止,我设法写了这么多,但无法在程序中构建正确的逻辑。
import requests, csv, time
from lxml import html
from collections import OrderedDict
links = []
domain = 'bunchball.com'
base_link = 'http://www.bunchball.com/'
unique_list = []
def get_links(base_link):
r = requests.get(base_link)
source = html.fromstring(r.content)
link = source.xpath('//a/@href')
for each in link:
each = str(each)
if domain in each:
links.append(each)
elif each.startswith('/'):
links.append(base_link+each)
unique_list.append(each)
else:
pass
get_links(base_link)
#while
for each1 in list(OrderedDict.fromkeys(links)):
get_links(each1)
while each1 not in unique_list:
unique_list.append(each1)
get_links(each1)
【问题讨论】:
标签: python python-2.7 python-requests