beautifulsoup - 在 div 中提取链接答案

【问题标题】：beautifulsoup - extracting link within a divbeautifulsoup - 在 div 中提取链接
【发布时间】：2013-07-19 18:27:52
【问题描述】：

我有一个汤，内容如下

许多 div，我感兴趣的是具有“foo”类的那些

在每个 div 中，有很多链接和其他内容，我对第二个链接感兴趣（第二个<a> </a>）=> 它总是第二个我想抓取链接（在href属性中）和第二个链接标签<a> </a>之间的文字

例如：

<div class ="foo">
     <a href ="http://example.com"> </a>
     <a href ="http://example2.com"> Title here </a>
</div>

<div class ="foo">
     <a href ="http://example3.com"> </a>
     <a href ="http://example4.com"> Title 2 here </a>
</div>

这里是我想要的：

此处的标题 => http://example2.com

此处为标题 2 => http://example4.com

我试过写一些代码：

soup.findAll("div", { "class" : "foo" })

但这会返回一个包含所有 div 及其内容的列表，我不知道如何进一步

谢谢:)

【问题讨论】：

标签： python beautifulsoup screen-scraping

【解决方案1】：

迭代divs 并在那里找到a。

from bs4 import BeautifulSoup

example = '''
<div class ="foo">
     <a href ="http://example.com"> </a>
     <a href ="http://example2.com"> Title here </a>
</div>

<div class ="foo">
     <a href ="http://example3.com"> </a>
     <a href ="http://example4.com"> Title 2 here </a>
'''

soup = BeautifulSoup(example)
for div in soup.findAll('div', {'class': 'foo'}):
    a = div.findAll('a')[1]
    print a.text.strip(), '=>', a.attrs['href']

【讨论】：