使用python从网站获取所有url答案

【问题标题】：Get all urls from a website using python使用python从网站获取所有url
【发布时间】：2014-06-21 13:43:31
【问题描述】：

我正在学习构建网络爬虫，目前正致力于从网站获取所有网址。我一直在玩，没有与以前相同的代码，但我已经能够获得所有链接，但我的问题是递归我需要一遍又一遍地做同样的事情，但我认为我的问题是它所做的递归适合我编写的代码。我的代码如下

#!/usr/bin/python
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

def getAllUrl(url):
    page = urllib2.urlopen( url ).read()
    urlList = []
    try:
        soup = BeautifulSoup(page)
        soup.prettify()
        for anchor in soup.findAll('a', href=True):
            if not 'http://' in anchor['href']:
                if urlparse.urljoin('http://bobthemac.com', anchor['href']) not in urlList:
                    urlList.append(urlparse.urljoin('http://bobthemac.com', anchor['href']))
            else:
                if anchor['href'] not in urlList:
                    urlList.append(anchor['href'])

        length = len(urlList)

        for url in urlList:
            getAllUrl(url)

        return urlList
    except urllib2.HTTPError, e:
        print e

if __name__ == "__main__":
    urls = getAllUrl('http://bobthemac.com')
    for x in urls:
        print x

我想要实现的是获取具有当前设置的站点的所有 url，程序运行直到内存不足，我想要的只是从站点获取 url。有没有人知道如何做到这一点，认为我的想法是正确的，只需要对代码进行一些小的改动。

编辑

对于你们这些感兴趣的人，下面是我的工作代码，它可以获取网站的所有内容，有人可能会觉得它有用。这不是最好的代码，确实需要一些工作，但通过一些工作可能会非常好。

#!/usr/bin/python
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

def getAllUrl(url):
urlList = []
try:
    page = urllib2.urlopen( url ).read()
    soup = BeautifulSoup(page)
    soup.prettify()
    for anchor in soup.findAll('a', href=True):
        if not 'http://' in anchor['href']:
            if urlparse.urljoin('http://bobthemac.com', anchor['href']) not in urlList:
                urlList.append(urlparse.urljoin('http://bobthemac.com', anchor['href']))
        else:
            if anchor['href'] not in urlList:
                urlList.append(anchor['href'])

    return urlList

except urllib2.HTTPError, e:
    urlList.append( e )

if __name__ == "__main__":
urls = getAllUrl('http://bobthemac.com')

fullList = []

for x in urls:
    listUrls = list
    listUrls = getAllUrl(x)
    try:
        for i in listUrls:
            if not i in fullList:
                fullList.append(i)
    except TypeError, e:
        print 'Woops wrong content passed'

for i in fullList:
    print i

【问题讨论】：

看起来你的函数没有返回任何东西。
是的，这是一项正在进行中的工作，'print urlList' 是返回的地方，我只是想尝试一下。编辑以显示退货的情况。
讨厌人们无缘无故地给负面标记
你创建了一个递归并且永远不会破坏它，我认为这让你的程序永远不会结束 util 内存不足。
我知道这一点我在我的帖子中提到了这一点，我想看看我使用的方法是否正确以及如何破解它。

标签： python beautifulsoup urllib2 web-crawler

【解决方案1】：

我认为这可行：

#!/usr/bin/python
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

def getAllUrl(url):
    try:
        page = urllib2.urlopen( url ).read()
    except:
        return []
    urlList = []
    try:
        soup = BeautifulSoup(page)
        soup.prettify()
        for anchor in soup.findAll('a', href=True):
            if not 'http://' in anchor['href']:
                if urlparse.urljoin(url, anchor['href']) not in urlList:
                    urlList.append(urlparse.urljoin(url, anchor['href']))
            else:
                if anchor['href'] not in urlList:
                    urlList.append(anchor['href'])

        length = len(urlList)

        return urlList
    except urllib2.HTTPError, e:
        print e

def listAllUrl(urls):
    for x in urls:
        print x
        urls.remove(x)
        urls_tmp = getAllUrl(x)
        for y in urls_tmp:
            urls.append(y)


if __name__ == "__main__":
    urls = ['http://bobthemac.com']
    while(urls.count>0):
        urls = getAllUrl('http://bobthemac.com')
        listAllUrl(urls)

【讨论】：

这和我刚才做的一样，找到页面上的链接，然后一遍又一遍地重复它们。
我在 print(x) 之后用一行 urls.remove(x) 编辑了代码，这样即使递归完成，内存也不会变窄。您可以通过添加 print len(urls) 来检查差异，其中 urls.remove(x) 行已注释和未注释。

【解决方案2】：

在您的函数getAllUrl 中，您在for 循环中再次调用getAllUrl，它会进行递归。

元素一旦放入urlList就永远不会被移出，所以urlList永远不会为空，那么递归就永远不会中断。

这就是为什么您的程序永远不会出现 util 内存不足的原因。

【讨论】：

我知道我明白这一点我可能没有解释我需要什么我正在寻找做一些递归。
我无法用几句话来解释它，但我为类似的工作写了一个库（递归地抓取链接），这是一个链接：github.com/zhaoqifa/scod/blob/master/lib/utils.py。开始使用crawl_links 函数。
感谢您最终将其分类，您指出我的功能帮助了我，但可能不是您的想法。一些小的调整让我得到了运行良好的正确代码。我会在上面发布它可能会更快。