仅使用 bs4 和请求获取原始链接答案

【问题标题】：Getting only the raw link using bs4 and requests仅使用 bs4 和请求获取原始链接
【发布时间】：2018-07-11 02:36:06
【问题描述】：

我想要获得的只是原始链接，然后我可以使用它来下载图像。但我不断收到一些额外的字符以及链接。从 bs4 导入 BeautifulSoup 导入请求

from bs4 import BeautifulSoup
import requests

def getPages():
    x = 0
    url = 'https://readheroacademia.net/manga/boku-no-hero-academia-chapter-137/'
    req = requests.get(url)
    webpage = req.content
    soup = BeautifulSoup(webpage, 'html.parser')
    pages = soup.findAll('div', attrs={'class': 'acp_content'})
    for p in pages:
        y = p.findAll('img')
        print(y)
getPages()

我最终得到的是这样的：

[<img src="https://2.bp.blogspot.com/-p72DilhF-_s/WRSF41vu50I/AAAAAAAAlsk/6BTxzQAzPkwteMgEHch2JFH0JKKpbKrZACHM/s16000/0137-001.png"/>]

我希望我能得到这样的东西：

https://2.bp.blogspot.com/-p72DilhF-_s/WRSF41vu50I/AAAAAAAAlsk/6BTxzQAzPkwteMgEHch2JFH0JKKpbKrZACHM/s16000/0137-001.png

【问题讨论】：

标签： python string web-scraping beautifulsoup python-requests

【解决方案1】：

如果你只想得到src，你可以这样做：

for p in pages:
    y = [tag["src"] for tag in p.findAll("img")]
    print(y)

它从每个 img 标签中获取 url，而不是获取整个标签。

另外，如果您使用的是bs4 或BeautifulSoup4，请使用find_all 而不是findAll。 findAll 是 bs3，旧版本。

【讨论】：

【解决方案2】：

我认为它会起作用：

>>> from bs4 import BeautifulSoup
>>> data = """<img src="https://2.bp.blogspot.com/-p72DilhF-_s/WRSF41vu50I/AAAAAAAAlsk/6BTxzQAzPkwteMgEHch2JFH0JKKpbKrZACHM/s16000/0137-001.png"/>"""
>>> soap = BeautifulSoup(data,"lxml")
>>> for i in soap.find_all("img"):
        link = i.get("src")
        print(link)

【讨论】：

【解决方案3】：

另一种方法是使用 XPath。我建议在这里使用 lxml，因为 Beautiful 中没有 XPath 支持。这其实是一个非常简单的解决方案：

from lxml import html
import requests

page = requests.get('https://readheroacademia.net/manga/boku-no-hero-academia-chapter-137/')
tree = html.fromstring(page.content)
#This will create a list of img src attributes beneth the `<div id="acp_content" class="acp_content">` tag:
srcs = tree.xpath('//div[@id="acp_content"]//img/@src')

【讨论】：

强制警告，内置 lxml 容易受到通过恶意构造的有效载荷的多种攻击：docs.python.org/3/library/xml.html#xml-vulnerabilities
@BaileyParker 感谢您的警告。这总是需要注意的。但这在这里重要吗？
除非您完全信任所有可能干扰该页面内容的人，否则是的。即使你认为你这样做了，为什么要冒这个风险？
@BaileyParker 我明白了。但是，对任意内容使用任何软件/库可能会带来风险，并且 lxml 不一定是最严重的违规者，或者 - 如果需要 xpath，您会建议什么？
引用lxml FAQ：lxml 是否容易受到 XML 炸弹的攻击？ “这与 lxml 本身无关，仅与 libxml2 的解析器有关。从 libxml2 2.7 版开始，解析器对输入文档施加了硬性安全限制，以防止使用伪造的输入数据进行 DoS 攻击。从 lxml 2.2.1 开始，您可以禁用这些如果您需要解析非常大的受信任文档，则使用 huge_tree 解析器选项进行限制。所有 lxml 版本将默认启用这些限制。看起来使用当前版本的 lxml 对于像这样的一次性工作应该没问题。