Beautifulsoup：循环遍历元素以获取文本答案

【问题标题】：Beautifulsoup: Looping over elements to get textBeautifulsoup：循环遍历元素以获取文本
【发布时间】：2015-10-08 10:47:20
【问题描述】：

我正在学习 BeautifulSoup，并且有一个网页，其正文如下：

html:

<div>
 <table>
 <tr>
  <td>
   <div>
     <a name='abc'>....</a>
   </div>
  </td>
 </tr>
</table>
</div>
<a name='pqr'>...</a> 
<div>text1</div>
<div>text2</div>
<div>text3</div>
 <a name='mno'>...</a> 

<div>
 <table>
 <tr>
  <td>
   <div>
     <a name='xyz'>....</a>
   </div>
  </td>
 </tr>
</table>
</div>

预期结果：

<a name='pqr'>...</a> 
<div>text1</div>
<div>text2</div>
<div>text3</div> 
<a name='mno'>...</a>

我的意思是，在到达 'a name='xyz'' 标记之前获取所有内容

【问题讨论】：

标签： python web-scraping beautifulsoup html-parsing

【解决方案1】：

您可以通过make a function 获取所有具有前一个兄弟pqr 链接和下一个兄弟mno 链接的div 元素：

def desired_divs(elm):
    if elm and elm.name == "div" and \
            elm.find_previous_sibling("a", {"name": "pqr"}) and \
            elm.find_next_sibling("a", {"name": "mno"}):
        return elm

for div in soup.find_all(desired_divs):
    print(div.text)

打印：

text1
text2
text3

或者，您可以找到开头的a 元素，然后遍历所有后续元素，并在遇到结尾的a 元素时停止收集div 途中的文本：

beginning = soup.find("a", {"name": "pqr"})
for elm in beginning.find_next_siblings():
    if elm.name == "a" and elm.get("name") == "mno":
        break

    print elm.text

【讨论】：

试图传递 ('a',{'name':'abc'}) 作为参数，但它只返回写在 'a' 标记之间的文本
@anonymous 你为什么这么做？ desired_divs 函数旨在传递给 find() 或 find_all()，如示例代码中提供的那样。
哦，是的，多么愚蠢的错误。但问题是我认为你的代码函数会给所有'div'标签之间的文本。该网页非常大，并且正文几乎重复。包括名为“mno”和“pqr”的标签。有没有办法可以从“abc”开始解析到“xyz”，因为这些是不断变化的
@anonymous 好的，谢谢，你真的需要检查a 链接的名称，或者你可以获取a 元素之间的所有divs？..跨度>
只要我得到特定 a 元素之间的文本，任何逻辑都可以。我实际上是在考虑使用 findAllNext() 的循环或条件，但它在使用 ahref=find('a',{'name':'abc'}) 后给出了全部内容。但我似乎不能停留在标签

【解决方案2】：

我试过了，效果很好：

 aref=soup.find('a',{"name": "abc"})

 for i in aref.findAllNext(): 
    if(i.attrs=={'name': 'xyz'}):
       break
    else:
       print(i.text)

【讨论】：