如何使用 Beautiful soup 从 HTML 锚标签返回目的地答案

【问题标题】：How to use Beautiful soup to return destination from HTML anchor tags如何使用 Beautiful soup 从 HTML 锚标签返回目的地
【发布时间】：2014-10-09 18:22:11
【问题描述】：

我正在使用 python 2 和 Beautiful soup 来解析使用 requests 模块检索到的 HTML

import requests
from bs4 import BeautifulSoup

site = requests.get("http://www.stackoverflow.com/")
HTML = site.text
links = BeautifulSoup(HTML).find_all('a')

返回一个列表，其中包含类似于<a href="hereorthere.com">Navigate</a>的输出

每个锚标记的href属性的内容可以有多种形式，例如它可以是页面上的javascript调用，它可以是具有相同域(/next/one/file.php)的页面的相对地址，也可以是特定的网址 (http://www.stackoverflow.com/)。

使用 BeautifulSoup 是否可以将相对地址和特定地址的网址返回到一个列表，不包括所有 javascript 调用等，只留下可导航的链接？

【问题讨论】：

这是您要找的吗？：stackoverflow.com/questions/9057809/…

标签： python beautifulsoup

【解决方案1】：

来自BS docs：

One common task is extracting all the URLs found within a page’s <a> tags:

for link in soup.find_all('a'):
    print(link.get('href'))

【讨论】：

【解决方案2】：

您可以像这样过滤掉 href="javascript:whatever()" 的情况：

hrefs = []
for link in soup.find_all('a'):
    if link.has_key('href') and not link['href'].lower().startswith('javascript:'):
        hrefs.append(link['href'])

【讨论】：