使用 BeautifulSoup 在 python 中解析带有 img 标签的表答案

【问题标题】：parsing tables with img tags in python with BeautifulSoup使用 BeautifulSoup 在 python 中解析带有 img 标签的表
【发布时间】：2013-09-19 05:26:28
【问题描述】：

我正在使用BeautifulSoup 来解析一个 html 页面。我需要处理页面中的第一个表。该表包含几行。然后每一行都包含一些“td”标签，其中一个“td”标签有一个“img”标签。我想获取该表中的所有信息。但是，如果我打印该表，我不会得到任何与“img”标签相关的数据。

我正在使用 soap.findAll("table") 来获取所有表，然后选择第一个表进行处理。 html 看起来像这样：

<table id="abc"
  <tr class="listitem-even">
    <td class="listitem-even">
      <table border = "0"> <tr> <td class="gridcell">
               <img id="img_id" title="img_title" src="img_src" alt="img_alt" /> </td> </tr>
      </table>
    </td>
    <td class="listitem-even"
      <span>some_other_information</span>
    </td>
  </tr>
</table>

如何获取表格中的所有数据，包括“img”标签？谢谢，

【问题讨论】：

soup.find('table') 会给你第一张桌子；如果您只需要第一个，则无需全部查找。
您可以在任何 BeautifulSoup 元素上使用 .find 和 .find_all()； table.find('img') 也会给你图片。您希望准确提取哪些信息？
感谢这些提示，我可以使用 td.find('img') 或类似的东西吗？我想知道什么是 src 标签以及它与什么 'td' 相关联。
我实际上需要阅读'img'标签的标题，然后我必须根据该标题来决定相应的'td'是否对我有价值。
soup.findall('img') 给了我所有的图像，但是 table = soup.find('table') 然后 table.findall('img') 给了我“无”的任何想法？

标签： python html-parsing beautifulsoup

【解决方案1】：

你有一个嵌套表，所以你需要在解析 tr/td/img 标签之前检查你在树中的位置。

from bs4 import BeautifulSoup
f = open('test.html', 'rb')
html = f.read()
f.close()
soup = BeautifulSoup(html)

tables = soup.find_all('table')

for table in tables:
     if table.find_parent("table") is not None:
         for tr in table.find_all('tr'):
                 for td in table.find_all('td'):
                         for img in td.find_all('img'):
                                 print img['id']
                                 print img['src']
                                 print img['title']
                                 print img['alt']

它根据您的示例返回以下内容：

img_id
img_src
img_title
img_alt

【讨论】：