无法过滤图片的 beautifulsoup 结果答案

【问题标题】：Cannot filter beautifulsoup results for images无法过滤图片的 beautifulsoup 结果
【发布时间】：2017-04-25 06:39:57
【问题描述】：

我正在尝试获取网页上图片的 URL 并使用此代码：

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('URL')
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('img')):
        if "visibility:hidden" not in link:
                print "IMAGE PATH: "+link['src']

我想过滤指向不可见图像的链接，例如：

img style="position:absolute;z-index:-3334;top:0px;left:0px;visibility:hidden;" src="https://.....

但我无法过滤“链接”变量。如果总是通过。

链接变量是什么类型的？细绳？我可以将其转换为字符串类型吗？请问我该怎么做？谢谢。

编辑：谢谢梁老师我尝试使用您提供的构造函数： BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('img') 但这对我来说失败了：回溯（最近一次通话最后）：文件“getLinksFromPage3.py”，第 10 行，在 BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('img')) 中的链接： init 中的文件“/usr/lib/python2.7/dist-packages/BeautifulSoup.py”，第 1522 行 BeautifulStoneSoup.init(self, *args, **kwargs) TypeError: init() 得到了一个意外的关键字参数 'parse_only'

【问题讨论】：

在if语句中：你试过link['style']吗？
请分享网址
在 BeautifulSoup 文档中：crummy.com/software/BeautifulSoup/bs4/doc 您有一个明确的示例，即捕获的内容是字典，所以我认为使用 link['style'] 应该可以解决问题。
是的，我尝试使用以下代码：如果链接 ['style'] 和“隐藏”不在链接 ['style'] 中：打印“找到隐藏”否则：打印“图像路径：”+链接['src'] 如果链接 ['style'] 和“隐藏”不在链接 ['style'] 中，则返回：文件“/usr/lib/python2.7/dist-packages/BeautifulSoup.py”，第 613 行，在 getitem return self._getAttrMap()[key] KeyError: 'style' 我猜它在第一个没有 img 样式的链接上失败了...

标签： python beautifulsoup

【解决方案1】：

使用tag.attrs先获取标签上的attrs，再过滤attr。以下代码有效。

另外，你应该定义哪个解析器适合这种情况，会得到更好的结果。

for link in BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('img')):
    if 'style' in link.attrs:
        if "visibility:hidden" not in link['style']:
            print link['src']
    else:
        print link['src']

【讨论】：

感谢您的帮助。我会尽快测试它。你是什么意思我的解析器类型？ html.parser 不适合使用吗？你有什么推荐的？
BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('img')): 对我来说失败：TypeError: __init__() got an unexpected keyword argument 'parse_only' 你用什么版本?

【解决方案2】：

谢谢梁先生。我也不得不换成bs4。

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('URL')

for link in BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('img')):
    if 'style' in link.attrs:
        if "visibility:hidden" not in link['style']:
            print link['src']
    else:
        print link['src']

【讨论】：