尝试使用 BeautifulSoup4 从 Python 中解析的 PDF 文档中设置第一个标签的属性时出现“NoneType”错误答案

【问题标题】：'NoneType' error when trying to set attributes for the first tag from a parsed PDF document in Python with BeautifulSoup4尝试使用 BeautifulSoup4 从 Python 中解析的 PDF 文档中设置第一个标签的属性时出现“NoneType”错误
【发布时间】：2019-08-12 19:40:18
【问题描述】：

我正在编写一个 Python 脚本，使用 pdfminer.six 将大量 pdf 转换为 html，然后将它们上传到电子商店。到目前为止，主要的文本块已经被很好地解析了，但是在这个过程中，由于显而易见的原因，我不得不将所有 span 替换为 div（并从它们的属性中去除 span），所以现在一个文档的结构如下：

<div> #first main block
    <div>
        Product desc heading
    </div>
    <div>
        Product desc text
    </div>
    #etc etc
</div>

<div> #second main block
    <div>
        Product specs heading
    </div>
    <div>
        Product specs text
    </div>
    #etc etc
</div>

问题在于相同 div 中的导航。如果我尝试找到第一个 div 并向其添加一些属性，例如 docs 建议：

firstdiv = soup.find('div')
firstdiv['class'] = 'main_productinfo'

结果完全可以预测 - IDLE 打印出以下错误：

File "C:\Users\blabla\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\element.py", line 1036, in __setitem__
    self.attrs[key] = value
TypeError: 'NoneType' object does not support item assignment

，因为find() 方法不返回特定结果（可能找到也可能找不到）。

我想过滤每个文件中的第一个块，然后将表（在下面的规范块中找到）解析为 html 并将这两个加入每个上传文件中。如何在不将汤一次又一次地转换为字符串的情况下向第一个标签添加属性（从而使它变得非常非常难看，因为它转换了新提炼的汤而没有任何空格）并替换str(soup)中的部分字符串？我对 Python 很陌生，没有什么容易想到的。

更新：我在 Win 7 64 上使用 Python 3.7.2。

【问题讨论】：

可以没有div吗？
@QHarr 问题是，有一个 div，它打印出来就好了，但是值分配不起作用，将 found div 视为'在那里。
我在已删除的答案中做了与@chittown 相同的操作，也得到了任务。所以我猜你的实际数据还有其他东西在起作用。
@QHarr 也许是因为它在一个for 循环中迭代一个目录中的多个文件，而不仅仅是一个文件？

标签： python html css parsing beautifulsoup

【解决方案1】：

我没有收到那个错误：

import bs4

html = '''<div> #first main block
    <div>
        Product desc heading
    </div>
    <div>
        Product desc text
    </div>
    #etc etc
</div>

<div> #second main block
    <div>
        Product specs heading
    </div>
    <div>
        Product specs text
    </div>
    #etc etc
</div>'''

soup = bs4.BeautifulSoup(html, 'html.parser')

firstdiv = soup.find('div')

输出：

print (firstdiv)
<div> #first main block
    <div>
        Product desc heading
    </div>
<div>
        Product desc text
    </div>
    #etc etc
</div>

然后：

firstdiv['class'] = 'main_productinfo'
print (firstdiv)

<div class="main_productinfo"> #first main block
    <div>
        Product desc heading
    </div>
<div>
        Product desc text
    </div>
    #etc etc
</div>

【讨论】：

好吧，这就是重点 - firstdiv 打印得很好，但是这个作业对我不起作用。
html源代码是如何存储的？换句话说，您存储的代码部分soup 在哪里？我在上面看到了，不是你声明的地方。
它存储在一个字符串变量中，来自 pdfminer 的原始解析结果存放在其中。然后将此变量输入soup。
你在使用'html.parser'吗？还是“lxml”？
我正在使用html.parser。