如何从 beautifulsoup 获取完整链接，而不仅仅是内部链接答案

【问题标题】：How can I get the full link from beautifulsoup instead of only the internal link如何从 beautifulsoup 获取完整链接，而不仅仅是内部链接
【发布时间】：2015-06-10 02:09:57
【问题描述】：

我是 python 新手。我正在为我工作的公司构建一个爬虫。爬取它的网站，有一个内部链接不是它习惯的链接格式。如何获取整个链接而不是仅获取目录。如果我不太清楚，请运行我制作的代码：

import urllib2
from bs4 import BeautifulSoup

web_page_string = []

def get_first_page(seed):
    response = urllib2.urlopen(seed)
    web_page = response.read()
    soup = BeautifulSoup(web_page)
    for link in soup.find_all('a'):
        print (link.get('href'))
    print soup


print get_first_page('http://www.fashionroom.com.br')
print web_page_string

【问题讨论】：

整个链接是什么意思？
print seed + '/' + link.get('href')?
我想在上面的案例中获得 htt://www.fashionroom.com.br/indexnew.html。相反，我刚刚得到了 indexnew.html

标签： python web-scraping beautifulsoup web-crawler

【解决方案1】：

感谢大家的答案，我试图在脚本中添加一个 if。如果有人发现我将来会发现的东西有潜在问题，请告诉我

import urllib2
from bs4 import BeautifulSoup

web_page_string = []

def get_first_page(seed):
    response = urllib2.urlopen(seed)
    web_page = response.read()
    soup = BeautifulSoup(web_page)
    final_page_string = soup.get_text()
    for link in soup.find_all('a'):
        if (link.get('href'))[0:4]=='http':
            print (link.get('href'))
        else:
            print seed+'/'+(link.get('href'))
    print final_page_string


print get_first_page('http://www.fashionroom.com.br')
print web_page_string

【讨论】：