循环通过 SEC 索引时出现 JSONDecodeError答案

【问题标题】：JSONDecodeError when looping through SEC indices循环通过 SEC 索引时出现 JSONDecodeError
【发布时间】：2021-07-01 10:39:49
【问题描述】：

当我尝试从网上抓取 SEC 文件时，总是发生一些奇怪的事情。我在 Python 3 中执行的网络抓取代码循环通过 CIK 列表（公司的唯一归档 ID）。这是代码中断的地方（代码早期）：

from bs4 import BeautifulSoup
import requests

base_url = 'https://www.sec.gov/Archives/edgar/data/'

for cik_number in ciks['public_ciks']:

    url = f'{base_url}{cik_number}/index.json'
    response = requests.get(url)

    # Parse the response
    soup = BeautifulSoup(response.content, 'lxml')

    # **This is where the error occurs**
    decoded_content = response.json()

    JSONDecodeError: Expecting value: line 1 column 1 (char 0)

当我前几天运行这个命令时，它工作得很好。今天，这个命令不仅不断地抛出错误，而且它发生在循环的不同部分：有时是第一个 url，有时是第 5 个，第 8 个等等，没有一致性。当我隔离这些 URL 并对单个实例执行此命令时，永远不会出现错误，这让我的这个问题变得更加陌生。任何人都可以帮忙吗？谢谢！

【问题讨论】：

您尝试检索的某些文档可能有问题，您可能会收到格式为 HTML 的错误消息，而不是您期望的 JSON。您应该在尝试转换为 json 之前打印出 response.text。

标签： python json web-scraping beautifulsoup

【解决方案1】：

代替

 decoded_content = response.json()

你必须导入 json 模块然后你应该使用：

decoded_content = json.loads(response.text, encoding="utf-8")

【讨论】：

【解决方案2】：

将strict 属性添加为False 试试这个：

decoded_content = json.loads(response.text, encoding="utf-8", strict=False)

【讨论】：