【问题标题】:How can I get the full link from beautifulsoup instead of only the internal link如何从 beautifulsoup 获取完整链接,而不仅仅是内部链接
【发布时间】:2015-06-10 02:09:57
【问题描述】:

我是 python 新手。我正在为我工​​作的公司构建一个爬虫。爬取它的网站,有一个内部链接不是它习惯的链接格式。如何获取整个链接而不是仅获取目录。如果我不太清楚,请运行我制作的代码:

import urllib2
from bs4 import BeautifulSoup

web_page_string = []

def get_first_page(seed):
    response = urllib2.urlopen(seed)
    web_page = response.read()
    soup = BeautifulSoup(web_page)
    for link in soup.find_all('a'):
        print (link.get('href'))
    print soup


print get_first_page('http://www.fashionroom.com.br')
print web_page_string

【问题讨论】:

  • 整个链接是什么意思?
  • print seed + '/' + link.get('href')?
  • 我想在上面的案例中获得 htt://www.fashionroom.com.br/indexnew.html。相反,我刚刚得到了 indexnew.html

标签: python web-scraping beautifulsoup web-crawler


【解决方案1】:

感谢大家的答案,我试图在脚本中添加一个 if。如果有人发现我将来会发现的东西有潜在问题,请告诉我

import urllib2
from bs4 import BeautifulSoup

web_page_string = []

def get_first_page(seed):
    response = urllib2.urlopen(seed)
    web_page = response.read()
    soup = BeautifulSoup(web_page)
    final_page_string = soup.get_text()
    for link in soup.find_all('a'):
        if (link.get('href'))[0:4]=='http':
            print (link.get('href'))
        else:
            print seed+'/'+(link.get('href'))
    print final_page_string


print get_first_page('http://www.fashionroom.com.br')
print web_page_string

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2020-10-14
    • 2021-12-25
    • 2017-05-09
    • 2017-09-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多