python beautifulsoup iframe 文档 html 提取答案

【问题标题】：python beautifulsoup iframe document html extractpython beautifulsoup iframe 文档 html 提取
【发布时间】：2023-03-25 21:47:01
【问题描述】：

我正在尝试学习一些漂亮的汤，并从一些 iFrame 中获取一些 html 数据 - 但到目前为止我还没有很成功。

因此，解析 iFrame 本身似乎不是 BS4 的问题，但我似乎没有从中获得嵌入的内容 - 无论我做什么。

例如，考虑下面的 iFrame（这是我在 chrome 开发者工具上看到的）：

<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"
src="http://www.engineeringmaterials.com/boron/728x90.html "width="728" height="90">
#document <html>....</html></iframe>

其中，<html>...</html> 是我有兴趣提取的内容。

但是，当我使用以下 BS4 代码时：

iFrames=[] # qucik bs4 example
for iframe in soup("iframe"):
    iFrames.append(soup.iframe.extract())

我明白了：

<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO" src="http://www.engineeringmaterials.com/boron/728x90.html" width="728" height="90">

换句话说，我得到的 iFrame 中没有文档 <html>...</html>。

我尝试了一些类似的方法：

iFrames=[] # qucik bs4 example
iframexx = soup.find_all('iframe')
for iframe in iframexx:
    print iframe.find_all('html')

.. 但这似乎不起作用..

所以，我想我的问题是，如何可靠地从 iFrame 元素中提取这些文档对象<html>...</html>。

【问题讨论】：

标签： python html iframe beautifulsoup

【解决方案1】：

浏览器在单独的请求中加载 iframe 内容。你也必须这样做：

for iframe in iframexx:
    response = urllib2.urlopen(iframe.attrs['src'])
    iframe_soup = BeautifulSoup(response)

记住：BeautifulSoup 不是浏览器；它也不会为您获取图像、CSS 和 JavaScript 资源。

【讨论】：