如何使用 beautifulsoup 为 html 嵌套标签定义 findAll答案

【问题标题】：How to define findAll for html nested tags using beautifulsoup如何使用 beautifulsoup 为 html 嵌套标签定义 findAll
【发布时间】：2011-02-07 19:05:55
【问题描述】：

给定

<a href="www.example.com/"></a>

<table class="theclass">
<tr><td>
<a href="www.example.com/two">two</a>
</td></tr>
<tr><td>
<a href ="www.example.com/three">three</a>
<span>blabla<span>
</td></td>
</table>

我怎样才能只抓取表 class="the class" 中的内容？我尝试使用

soup = util.mysoupopen(theexample) 
infoText = soup.findAll("table", {"class": "the class"})

但我不知道如何进一步定义发现语句。我尝试过的其他方法是将 findAll() 的结果转换为数组。然后寻找针何时出现的模式，但我找不到一致的模式。谢谢

【问题讨论】：

你想废弃什么？你说“我怎样才能只刮掉表 class="the class" 里面的那个？”你指的是链接吗？

标签： python html beautifulsoup

【解决方案1】：

如果我理解你的问题。那是应该工作的python代码。迭代查找所有带有 class="theclass" 的表，然后查找其中的链接。

>>> foo = """<a href="www.example.com/"></a>
... <table class="theclass">
... <tr><td>
... <a href="www.example.com/two">two</a>
... </td></tr>
... <tr><td>
... <a href ="www.example.com/three">three</a>
... <span>blabla<span>
... </td></td>
... </table>
... """
>>> import BeautifulSoup as bs
>>> soup = bs.BeautifulSoup(foo)
>>> for table in soup.findAll('table', {'class':'theclass'} ):
...     links=table.findAll('a')
... 
>>> print links
[<a href="www.example.com/two">two</a>, <a href="www.example.com/three">three</a>]

【讨论】：

【解决方案2】：

infoText 是一个列表。您应该对其进行迭代。

>>>for info in infoText:
>>>    print info.tr.td.a
<a href="www.example.com/two">two</a>

然后您可以访问<table> 元素。如果您只希望文档中有一个带有“theclass”类的表格元素，soup.find("table", {"class": "the class"}) 会直接为您提供表格。

【讨论】：

我收到了这个错误，我不知道为什么会这样。 Traceback (most recent call last): File "test.py", line 10, in <module> print info.tr.td.a File "/nfs/home/j/d/jdiaz/cs171/BeautifulSoup.py", line 402, in __getattr__ raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr) AttributeError: 'NavigableString' object has no attribute 'tr'