BS4 + Python3：无法爬取树：“NavigableString”对象没有属性“has_attr”答案

【问题标题】：BS4 + Python3: unable to crawl tree: 'NavigableString' object has no attribute 'has_attr'BS4 + Python3：无法爬取树：“NavigableString”对象没有属性“has_attr”
【发布时间】：2014-03-24 16:44:50
【问题描述】：

我是 Python 新手（我只知道 powershell），我正在尝试使用 BS4+Python3 学习 Web Crawling。

这是我正在练习的一个简单练习：

<h1 class="entry-title">
<a href="test1.html">test1</a></h1>
<h1 class="entry-title">
<a href="test2.html" rel="bookmark">test2</a></h1>

我想要做的是只获取具有“rel”属性的详细信息（href 和 .string）

这是我的代码

for h1_Tag in soup.find_all(("h1", { "class" : "entry-title" })):
    for a_Tag in h1_Tag.contents:
        if a_Tag.has_attr('rel'):
           print (a_Tag)

但我得到： AttributeError：“NavigableString”对象没有属性“has_attr”

我做错了什么？任何帮助表示赞赏。

谢谢！

【问题讨论】：

标签： python python-3.x beautifulsoup

【解决方案1】：

您正在遍历所有内容，包括 NavigableString 对象；例如文本。

如果您想查找所有具有rel 属性的元素，请改为搜索它们：

for h1_Tag in soup.find_all(("h1", { "class" : "entry-title" })):
    for a_Tag in h1_Tag.find_all('a', rel=True):
       print(a_Tag)

rel=True 关键字参数将搜索限制为具有该属性的元素；没有rel 属性的<a> 标签将被跳过。

【讨论】：

【解决方案2】：

另一种方法是使用SoupStrainer。这将允许您根据预定义的条件解析文档。使用 Python 2.7 和 BeautifulSoup 4.3.2，所以逻辑类似。

from bs4 import BeautifulSoup as bsoup, SoupStrainer as strain

ofile = open("test.html")
strain = strain(rel=True)
soup = bsoup(ofile, parse_only=strain)

print soup

结果：

<a href="test2.html" rel="bookmark">test2</a>
[Finished in 0.2s]

如果这有帮助，请告诉我们。

【讨论】：