Python BeautifulSoup - 从页面中获取内部链接答案

【问题标题】：Python BeautifulSoup - Grab internal links from pagePython BeautifulSoup - 从页面中获取内部链接
【发布时间】：2012-05-04 16:41:35
【问题描述】：

我有一个基本循环来查找我用 urllib2.urlopen 检索到的页面上的链接，但是我试图只关注页面上的内部链接..

有什么想法可以让我的下面的循环只获取同一域上的链接吗？

for tag in soupan.findAll('a', attrs={'href': re.compile("^http://")}): 
                webpage = urllib2.urlopen(tag['href']).read()
                print 'Deep crawl ----> ' +str(tag['href'])
                try:
                    code-to-look-for-some-data...

                except Exception, e:
                    print e

【问题讨论】：

标签： python web-crawler beautifulsoup

【解决方案1】：

>>> import urllib
>>> print urllib.splithost.__doc__
splithost('//host[:port]/path') --> 'host[:port]', '/path'.

如果主机相同或主机为空（用于相对路径），则url属于同一主机。

for tag in soupan.findAll('a', attrs={'href': re.compile("^http://")}):

            href = tag['href']
            protocol, url = urllib.splittype(href) # 'http://www.xxx.de/3/4/5' => ('http', '//www.xxx.de/3/4/5')
            host, path =  urllib.splithost(url)    # '//www.xxx.de/3/4/5' => ('www.xxx.de', '/3/4/5')
            if host.lower() != theHostToCrawl and host != '':
                continue

            webpage = urllib2.urlopen(href).read()

            print 'Deep crawl ----> ' +str(tag['href'])
            try:
                code-to-look-for-some-data...

            except:
                import traceback
                traceback.print_exc()

因为你这样做

'href': re.compile("^http://")

不会使用相对路径。就像

<a href="/folder/file.htm"></a>

也许根本不使用 re？

【讨论】：

不确定我是否理解如何在我的循环中实现它，但我看到了逻辑:) 你知道如何在循环中实现它吗？
你说根本不用re，但你可以想出一个匹配http://whatever和(no http://)的正则表达式

【解决方案2】：

对您的爬虫的一些建议：将 mechanize 与 BeautifulSoup 结合使用，这将大大简化您的任务。

【讨论】：