获取网站的所有链接答案

【问题标题】：Getting all links of a websites获取网站的所有链接
【发布时间】：2017-05-30 03:03:05
【问题描述】：

您好我想创建一个迷你爬虫但不使用Scrapy，

我创造了这样的东西：

response = requests.get(url)
homepage_link_list = []
soup = BeautifulSoup(response.content, 'lxml')
for link in soup.findAll("a"):
    if link.get("href"):
        homepage_link_list.append(link.get("href"))


link_list = []
for item in homepage_link_list:
    response = requests.get(item)
    soup = BeautifulSoup(response.content, 'lxml')
    for link in soup.findAll("a"):
        if link.get("href"):
            link_list.append(link.get("href"))

虽然我遇到的问题是它只获取网页链接中的链接，但我怎样才能让它获取网站所有链接中的所有链接。

【问题讨论】：

视频教程：How to Build a Web Crawler

标签： python web-scraping beautifulsoup python-requests

【解决方案1】：

您需要一个递归调用流程。我在下面写了一个面向类的代码。要点如下：

此实现是深度优先的
跟踪已抓取的网址，以免我们再次抓取它们
忽略页面上的目标。例如。如果http://example.com#item1，忽略item1
如果https://example.com已经被爬取，忽略http://example.com
丢弃尾部斜杠。例如。如果http://example.com 已经被抓取，忽略http://example.com/

''' Scraper.
'''

import re
from urllib.parse import urljoin, urlsplit, SplitResult
import requests
from bs4 import BeautifulSoup


class RecursiveScraper:
    ''' Scrape URLs in a recursive manner.
    '''
    def __init__(self, url):
        ''' Constructor to initialize domain name and main URL.
        '''
        self.domain = urlsplit(url).netloc
        self.mainurl = url
        self.urls = set()

    def preprocess_url(self, referrer, url):
        ''' Clean and filter URLs before scraping.
        '''
        if not url:
            return None

        fields = urlsplit(urljoin(referrer, url))._asdict() # convert to absolute URLs and split
        fields['path'] = re.sub(r'/$', '', fields['path']) # remove trailing /
        fields['fragment'] = '' # remove targets within a page
        fields = SplitResult(**fields)
        if fields.netloc == self.domain:
            # Scrape pages of current domain only
            if fields.scheme == 'http':
                httpurl = cleanurl = fields.geturl()
                httpsurl = httpurl.replace('http:', 'https:', 1)
            else:
                httpsurl = cleanurl = fields.geturl()
                httpurl = httpsurl.replace('https:', 'http:', 1)
            if httpurl not in self.urls and httpsurl not in self.urls:
                # Return URL only if it's not already in list
                return cleanurl

        return None

    def scrape(self, url=None):
        ''' Scrape the URL and its outward links in a depth-first order.
            If URL argument is None, start from main page.
        '''
        if url is None:
            url = self.mainurl

        print("Scraping {:s} ...".format(url))
        self.urls.add(url)
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'lxml')
        for link in soup.findAll("a"):
            childurl = self.preprocess_url(url, link.get("href"))
            if childurl:
                self.scrape(childurl)


if __name__ == '__main__':
    rscraper = RecursiveScraper("http://bbc.com")
    rscraper.scrape()
    print(rscraper.urls)

【讨论】：

很遗憾这还没有被选为答案。
在preprocess_url() 的else: 块下，httpurl = httpurl.replace('https:', 'http:', 1) 应该是httpurl = httpsurl.replace('https:', 'http:', 1)。

【解决方案2】：

可能是您要抓取的链接实际上不是链接。它们可能是图像。很抱歉在这里写下这个答案，实际上我没有太多的声誉可以评论，

【讨论】：

是的，我知道，虽然我想获取类似于 Screaming Frog 的链接中的所有链接。

【解决方案3】：

您的代码没有获取网站的所有链接，因为它不是递归的。您正在获取主页链接并遍历主页链接内容中可用的链接。但是，您并没有遍历您在刚刚遍历的那些链接的内容中获得的链接。我的建议是您应该检查一些树遍历算法并根据算法开发遍历（递归）方案。树的节点将代表链接，根节点是您在开始时传递的链接。

【讨论】：

你有样品吗？
是的。但不幸的是，截至目前，没有。（因为我现在没有笔记本电脑）。但是，为什么需要样品？您自己的代码可以通过简单的操作进行递归。我建议你遍历一次树。那么你就不需要样品了。
如何转换成树遍历，抱歉不熟悉
实际上没有。希望你能帮助我。
在开始开发脚本之前，您应该首先学习计算机科学的花絮，例如数据结构和算法。从长远来看，这将对您有很大帮助。我建议你先抢先一步，然后再开始编写脚本和程序。