如何修复 bs4 选择错误：'TypeError: __init__() 关键字必须是字符串'答案

【问题标题】：How to fix bs4 select error: 'TypeError: __init__() keywords must be strings'如何修复 bs4 选择错误：'TypeError: __init__() 关键字必须是字符串'
【发布时间】：2019-06-01 08:33:37
【问题描述】：

我正在编写一个脚本，它使用一个发布请求并获得一个 XML 作为回报。我需要解析该 XML 以了解发布请求是否被接受。我正在使用 bs4 对其进行解析，直到大约一周前我开始遇到以前没有遇到的错误时它才能正常工作：

TypeError: __init__() keywords must be strings

我在同一个文件的其他部分使用 bs4 的 select 函数没有出现这个错误，我在网上找不到任何关于它的信息。一开始以为是版本问题，但是python3.7和3.6都试过了，还是一样的错误。

这是用于产生错误的代码：

res = requests.post(url, data = body, headers = headers)
logging.debug('Res HTTP status is {}'.format(res.status_code))

try:
    res.raise_for_status()
    resSoup = BeautifulSoup(res.text, 'xml')
    # get the resultcode from the resultcode tag
    resCode = resSoup.select_one('ResultCode').text

完整的错误消息：

Traceback (most recent call last):
  File "EbarInt.py", line 292, in <module>
    resCode = resSoup.select_one('ResultCode').text
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\bs4\element.py", line 1345, in select_one
    value = self.select(selector, namespaces, 1, **kwargs)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\bs4\element.py", line 1377, in select
    return soupsieve.select(selector, self, namespaces, limit, **kwargs)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\soupsieve\__init__.py", line 108, in select
    return compile(select, namespaces, flags).select(tag, limit)
  File "C:\Program Files (x86)\Python36-32\lib\site-packages\soupsieve\__init__.py", line 50, in compile
    namespaces = ct.Namespaces(**(namespaces))
TypeError: __init__() keywords must be strings

当我检查 res.text 类型时，我得到了预期的 class 'str'。

当我登录 res.text 时，我得到：

<?xml version="1.0" encoding="utf-8"?><soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"><soap:Header><wsa:Action>Trackem.Web.Services/CreateOrUpdateTaskResponse</wsa:Action><wsa:MessageID>urn:uuid:3ecae312-d416-40a5-a6a3-9607ebf28d7a</wsa:MessageID><wsa:RelatesTo>urn:uuid:6ab7e354-6499-4e37-9d6e-61219bac11f6</wsa:RelatesTo><wsa:To>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:To><wsse:Security><wsu:Timestamp wsu:Id="Timestamp-6b84a16f-327b-42db-987f-7f1ea52ef802"><wsu:Created>2019-01-06T10:33:08Z</wsu:Created><wsu:Expires>2019-01-06T10:38:08Z</wsu:Expires></wsu:Timestamp></wsse:Security></soap:Header><soap:Body><CreateOrUpdateTaskResponse xmlns="Trackem.Web.Services"><CreateOrUpdateTaskResult><ResultCode>OK</ResultCode><ResultCodeAsInt>0</ResultCodeAsInt><TaskNumber>18000146</TaskNumber></CreateOrUpdateTaskResult></CreateOrUpdateTaskResponse></soap:Body></soap:Envelope>

【问题讨论】：

旁注：您确实想将 Unicode 解码留给 XML 解析器，请使用 res.content，而不是 res.text。但是，这不是您当前问题的原因。
其实 text 方法是使用解析器猜测的编码的方法， content 方法返回字节而不是字符串。
是的，但是 XML 包含它自己的编码信息，并且 XML 解析器会处理这些信息。 HTTP 服务器很容易出错。

标签： python xml python-3.x beautifulsoup

【解决方案1】：

更新：BeautifulSoup 4.7.1 已经发布，修复了默认命名空间问题。请参阅release notes。您可能只想升级性能修复。

原答案：

您必须升级到 BeautifulSoup 4.7，它用soupsieve project 替换了简单而有限的内部 CSS 解析器，这是一个更完整的 CSS 实现。

该项目的默认命名空间存在问题，附加到您的响应中的元素之一：

<CreateOrUpdateTaskResponse xmlns="Trackem.Web.Services">

用于构建 BeautifulSoup 对象树的 XML 解析器正确地传达了命名空间字典中的 None -> 'Trackem.Web.Services' 映射，但 soupsieve 代码要求所有命名空间都有一个前缀名称 (xmlns:prefix ) 使用空字符串而不是None 标记的默认命名空间，导致此错误。我已将此报告为issue #68 to the soupsieve project。

您根本不需要在这里使用select_one，除了元素名称之外，您没有使用任何 CSS 语法。请改用soup.find()：

resCode = resSoup.find('ResultCode').text

【讨论】：

soup find 有同样的错误吗？如何降级到以前的版本？不能改变命名空间的前缀
@youngmarx: 不，.find() 不需要 CSS 选择器，也不会使用新的 soupsieve 项目。你可以使用pip install -I beautifulsoup4<4.7.0 降级，但你真的不需要在这里。 BeautifulSoup 和 soupsieve 的作者已经意识到了这个问题，并且已经在研究如何最好地解决这个问题，但是您的特定用例无论如何都不会受到影响。