使用python和BeautifulSoup从网页中检索链接[关闭]答案

【问题标题】：retrieve links from web page using python and BeautifulSoup [closed]使用python和BeautifulSoup从网页中检索链接[关闭]
【发布时间】：2010-11-08 00:01:44
【问题描述】：

如何使用 Python 检索网页的链接并复制链接的 url 地址？

【问题讨论】：

这是一个更新的代码 sn-p，它在 30 行中完全符合您的要求。 github.com/mujeebishaque/extract-urls
我尝试了这个链接并得到了这样的输出 this/info-service/downloads/#unserekataloge' 。无法获得完整的可访问链接吗？而不仅仅是子链接的一部分？我想获取网站@MujeebIshaque 上提供的所有 pdf 文件的链接

标签： python web-scraping hyperlink beautifulsoup

【解决方案1】：

这是在 BeautifulSoup 中使用 SoupStrainer 类的简短 sn-p：

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

BeautifulSoup 文档其实相当不错，涵盖了一些典型场景：

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

编辑：请注意，我使用 SoupStrainer 类是因为它更高效（内存和速度方面），如果您提前知道要解析什么。

【讨论】：

+1，使用汤过滤器是个好主意，因为当您只需要链接时，它可以让您避免大量不必要的解析。
请注意：/usr/local/lib/python2.7/site-packages/bs4/__init__.py:128: UserWarning: The "parseOnlyThese" argument to the BeautifulSoup constructor has been renamed to "parse_only."
BeautifulSoup 3.2.1 版中没有has_attr。相反，我看到有一个叫做 has_key 的东西，它可以工作。
从 bs4 导入 BeautifulSoup。（不是从 BeautifulSoup 导入 BeautifulSoup..）需要更正。
python3 和最新 bs4 的更新代码 - gist.github.com/PandaWhoCodes/7762fac08c4ed005cec82204d7abd61b

【解决方案2】：

为了完整起见，BeautifulSoup 4 版本也使用了服务器提供的编码：

from bs4 import BeautifulSoup
import urllib.request

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

或 Python 2 版本：

from bs4 import BeautifulSoup
import urllib2

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

还有一个使用requests library 的版本，正如所写的那样，它可以在 Python 2 和 3 中工作：

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

soup.find_all('a', href=True) 调用查找所有具有href 属性的<a> 元素；没有该属性的元素会被跳过。

BeautifulSoup 3 于 2012 年 3 月停止开发；新项目确实应该始终使用 BeautifulSoup 4。

请注意，您应该将 HTML 从字节解码到 BeautifulSoup。您可以通知 BeautifulSoup 在 HTTP 响应标头中找到的字符集以帮助解码，但是这个可能是错误的并且与 HTML 本身中的 <meta> 标头信息冲突，这就是为什么上面使用 BeautifulSoup 内部类方法EncodingDetector.find_declared_encoding() 来确保此类嵌入的编码提示能够胜过配置错误的服务器。

对于requests，如果响应具有text/* mimetype，则response.encoding 属性默认为Latin-1，即使没有返回字符集。这与 HTTP RFC 一致，但在与 HTML 解析一起使用时会很痛苦，因此当 Content-Type 标头中未设置 charset 时，您应该忽略该属性。

【讨论】：

bs4 有类似 StrainedSoup 的东西吗？（我现在不需要它，只是想知道，如果有你可能想添加它）
@AnttiHaapala：SoupStrainer 你的意思是？它didn't go anywhere, it is still part of the project.
这段代码没有将“features=”传递给 BeautifulSoup 构造函数是否有原因？ BeautifulSoup 给了我一个关于使用默认解析器的警告。
@MikeB：当我写这个答案时，如果你没有，BeautifulSoup 还没有发出警告。

【解决方案3】：

可以有许多重复的链接以及外部和内部链接。要区分两者并使用集合获取唯一链接：

# Python 3.
import urllib    
from bs4 import BeautifulSoup

url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))  
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
    link = line.get('href')
    if not link:
        continue
    if link.startswith('http'):
        external_links.add(link)
    else:
        internal_links.add(link)

# Depending on usage, full internal links may be preferred.
full_internal_links = {
    urllib.parse.urljoin(url, internal_link) 
    for internal_link in internal_links
}

# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
    print(link)

【讨论】：

【解决方案4】：

以下代码是使用urllib2和BeautifulSoup4检索网页中所有可用的链接：

import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)

for line in soup.find_all('a'):
    print(line.get('href'))

【讨论】：

【解决方案5】：

链接可以在多种属性中，因此您可以将这些属性的列表传递给选择

例如，带有 src 和 href 属性（这里我使用以 ^ 开头的运算符来指定这些属性值中的任何一个都以 http 开头。您可以根据需要进行调整

from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)

Attribute = value selectors

[属性^=值]

表示属性名称为 attr 的元素，其值以值作为前缀（在前）。

【讨论】：

【解决方案6】：

BeatifulSoup 自己的解析器可能很慢。使用能够直接从 URL 解析的 lxml 可能更可行（有一些限制如下所述）。

import lxml.html

doc = lxml.html.parse(url)

links = doc.xpath('//a[@href]')

for link in links:
    print link.attrib['href']

上面的代码将按原样返回链接，在大多数情况下，它们将是相对链接或来自站点根目录的绝对链接。因为我的用例是只提取某种类型的链接，所以下面是一个将链接转换为完整 URL 的版本，并且可以选择接受像 *.mp3 这样的全局模式。虽然它不会处理相对路径中的单点和双点，但到目前为止我还不需要它。如果您需要解析包含../ 或./ 的URL 片段，那么urlparse.urljoin 可能会派上用场。

注意：直接 lxml url 解析不处理来自 https 的加载，也不做重定向，因此下面的版本使用 urllib2 + lxml。

#!/usr/bin/env python
import sys
import urllib2
import urlparse
import lxml.html
import fnmatch

try:
    import urltools as urltools
except ImportError:
    sys.stderr.write('To normalize URLs run: `pip install urltools --user`')
    urltools = None


def get_host(url):
    p = urlparse.urlparse(url)
    return "{}://{}".format(p.scheme, p.netloc)


if __name__ == '__main__':
    url = sys.argv[1]
    host = get_host(url)
    glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'

    doc = lxml.html.parse(urllib2.urlopen(url))
    links = doc.xpath('//a[@href]')

    for link in links:
        href = link.attrib['href']

        if fnmatch.fnmatch(href, glob_patt):

            if not href.startswith(('http://', 'https://' 'ftp://')):

                if href.startswith('/'):
                    href = host + href
                else:
                    parent_url = url.rsplit('/', 1)[0]
                    href = urlparse.urljoin(parent_url, href)

                    if urltools:
                        href = urltools.normalize(href)

            print href

用法如下：

getlinks.py http://stackoverflow.com/a/37758066/191246
getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*"
getlinks.py http://fakedomain.mu/somepage.html "*.mp3"

【讨论】：

lxml只能处理有效输入，怎么能代替BeautifulSoup？
@alexis：我认为lxml.html 比lxml.etree 宽松一点。如果您的输入格式不正确，那么您可以显式设置 BeautifulSoup 解析器：lxml.de/elementsoup.html。如果你确实选择了 BeatifulSoup，那么 BS3 是一个更好的选择。

【解决方案7】：

经过以下更正（涵盖无法正常工作的场景），我通过@Blairg23 working 找到了答案：

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
            wget.download(full_path)

对于 Python 3：

必须使用urllib.parse.urljoin 来获取完整的 URL。

【讨论】：

【解决方案8】：

这是一个使用@ars 接受的答案和BeautifulSoup4、requests 和wget 模块来处理下载的示例。

import requests
import wget
import os

from bs4 import BeautifulSoup, SoupStrainer

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'

response = requests.get(url)

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path = url + link['href']
            wget.download(full_path)

【讨论】：

【解决方案9】：

为了找到所有的链接，我们将在这个例子中一起使用 urllib2 模块使用 re.module * re 模块中最强大的函数之一是“re.findall()”。虽然 re.search() 用于查找模式的第一个匹配项，但 re.findall() 查找 all 匹配项并将它们作为字符串列表返回，每个字符串代表一个匹配项*

import urllib2

import re
#connect to a URL
website = urllib2.urlopen(url)

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

【讨论】：

【解决方案10】：

此脚本可以满足您的需求，但也可以将相对链接解析为绝对链接。

import urllib
import lxml.html
import urlparse

def get_dom(url):
    connection = urllib.urlopen(url)
    return lxml.html.fromstring(connection.read())

def get_links(url):
    return resolve_links((link for link in get_dom(url).xpath('//a/@href')))

def guess_root(links):
    for link in links:
        if link.startswith('http'):
            parsed_link = urlparse.urlparse(link)
            scheme = parsed_link.scheme + '://'
            netloc = parsed_link.netloc
            return scheme + netloc

def resolve_links(links):
    root = guess_root(links)
    for link in links:
        if not link.startswith('http'):
            link = urlparse.urljoin(root, link)
        yield link  

for link in get_links('http://www.google.com'):
    print link

【讨论】：

这不符合 ti 的意图；如果 resolve_links() 没有根，那么它永远不会返回任何 URL。

【解决方案11】：

import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']

【讨论】：

【解决方案12】：

BeautifulSoup 现在使用 lxml。请求、lxml 和列表推导是一个杀手级组合。

import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)

[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]

在list comp中，“if '//' and 'url.com' not in x”是一种简单的方法来清理站点'internal'导航url等的url列表。

【讨论】：

如果是转帖，为什么原帖不包括： 1. 请求 2.list comp 3. 清理站点内部和垃圾链接的逻辑？尝试比较两个帖子的结果，我的列表组合在清除垃圾链接方面做得非常出色。
OP 没有要求这些功能，他要求的部分已经发布并使用与您发布的完全相同的方法解决。但是，我将删除反对票，因为列表理解确实为确实想要这些功能的人增加了价值，并且您确实在帖子正文中明确提到了它们。此外，您可以使用代表 :)

【解决方案13】：

其他人推荐了 BeautifulSoup，但使用 lxml 会更好。尽管它的名字，它也用于解析和抓取 HTML。它比 BeautifulSoup 快得多，而且它甚至比 BeautifulSoup 更好地处理“损坏的”HTML（他们声名狼藉）。如果你不想学习 lxml API，它也有一个 BeautifulSoup 的兼容性 API。

Ian Blicking agrees.

没有理由再使用 BeautifulSoup，除非您使用的是 Google App Engine 或不允许使用任何非纯 Python 的东西。

lxml.html 也支持 CSS3 选择器，所以这种事情是微不足道的。

使用 lxml 和 xpath 的示例如下所示：

import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')

dom =  lxml.html.fromstring(connection.read())

for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)
    print link

【讨论】：

BeautifulSoup 4 将使用 lxml 作为默认解析器（如果已安装）。

【解决方案14】：

为什么不用正则表达式：

import urllib2
import re
url = "http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
    print('href: %s, HTML text: %s' % (link[0], link[1]))

【讨论】：

我很想能够理解这一点，我在哪里可以有效地找出(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page) 的含义？谢谢！
真是个坏主意。到处都是损坏的 HTML。
为什么不用正则表达式解析html：stackoverflow.com/questions/1732348/…
@user1063287，网络上到处都是正则表达式教程。值得花时间读几本。虽然 RE 可能会变得非常复杂，但您要询问的是非常基本的。

【解决方案15】：

仅用于获取链接，没有 B.soup 和正则表达式：

import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
    if "<a href" in item:
        try:
            ind = item.index(tag)
            item=item[ind+len(tag):]
            end=item.index(endtag)
        except: pass
        else:
            print item[:end]

对于更复杂的操作，当然还是首选 BSoup。

【讨论】：

如果，例如，<a 和 href 之间有什么东西？说rel="nofollow" 或onclick="..."，甚至只是换行？ stackoverflow.com/questions/1732348/…
有没有办法只过滤掉一些链接？比如说我只想要链接中包含“剧集”的链接？

【解决方案16】：

import urllib2
import BeautifulSoup

request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
  if 'national-park' in a['href']:
    print 'found a url with national-park in the link'

【讨论】：

这解决了我的代码问题。谢谢！