Python BeautifulSoup 提取特定的 URL答案

【问题标题】：Python BeautifulSoup Extract specific URLsPython BeautifulSoup 提取特定的 URL
【发布时间】：2013-02-25 03:13:03
【问题描述】：

是否可以只获取特定的 URL？

喜欢：

<a href="http://www.iwashere.com/washere.html">next</a>
<span class="class">...</span>
<a href="http://www.heelo.com/hello.html">next</a>
<span class="class">...</span>
<a href="http://www.iwashere.com/wasnot.html">next</a>
<span class="class">...</span>

输出应该只是来自http://www.iwashere.com/的网址

喜欢，输出网址：

http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html

我是通过字符串逻辑做到的。有没有直接使用 BeautifulSoup 的方法？

【问题讨论】：

标签： python python-2.7 web-scraping beautifulsoup

【解决方案1】：

您可以匹配多个方面，包括对属性值使用正则表达式：

import re
soup.find_all('a', href=re.compile('http://www\.iwashere\.com/'))

匹配（例如）：

[<a href="http://www.iwashere.com/washere.html">next</a>, <a href="http://www.iwashere.com/wasnot.html">next</a>]

所以任何带有href 属性的<a> 标记，其值以字符串http://www.iwashere.com/ 开头。

您可以遍历结果并仅选择 href 属性：

>>> for elem in soup.find_all('a', href=re.compile('http://www\.iwashere\.com/')):
...     print elem['href']
... 
http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html

要改为匹配所有相对路径，请使用否定前瞻断言来测试该值是否不以模式（例如http: 或mailto:）或双斜杠开头(//hostname/path);任何这样的值必须改为相对路径：

soup.find_all('a', href=re.compile(r'^(?!(?:[a-zA-Z][a-zA-Z0-9+.-]*:|//))'))

【讨论】：

效果很好。对于不了解图书馆的人。你需要from bs4 import BeautifulSoup import re
我还有一个问题。如果链接是http://www.iwashere.com/xyz...abc.html 格式，我们可以完美提取链接。但是，如果链接是本地的。比如说[<a href="washere.html">next</a>, <a href="wwasnot.html">next</a>]。如何提取底层链接？当看到 HTML 代码时，链接会以正确的位置超链接。有什么方法可以提取此类链接？
@searcoding：您需要匹配任何不以方案或双斜杠开头的内容；任何 not 以这些开头的 href 值都是相对 URL。使用href=re.compile(r'^(?!(?:[a-zA-Z][a-zA-Z0-9+.-]*:|//))')（这是一个否定的前瞻来测试方案或双斜线，任何具有这些的东西都不匹配）。

【解决方案2】：

如果您使用的是BeautifulSoup 4.0.0 或更高版本：

soup.select('a[href^="http://www.iwashere.com/"]')

【讨论】：

【解决方案3】：

您可以通过gazpacho 中的部分匹配来解决此问题：

输入：

html = """\
<a href="http://www.iwashere.com/washere.html">next</a>
<span class="class">...</span>
<a href="http://www.heelo.com/hello.html">next</a>
<span class="class">...</span>
<a href="http://www.iwashere.com/wasnot.html">next</a>
<span class="class">...</span>
"""

代码：

from gazpacho import Soup

soup = Soup(html)
links = soup.find('a', {'href': "http://www.iwashere.com/"}, partial=True)
[link.attrs['href'] for link in links]

将输出：

# ['http://www.iwashere.com/washere.html', 'http://www.iwashere.com/wasnot.html']

【讨论】：