无法使用 Python 请求获取整个页面答案

【问题标题】：Unable to GET entire page with Python request无法使用 Python 请求获取整个页面
【发布时间】：2023-11-12 10:04:01
【问题描述】：

我正在尝试从网页获取较长的 JSON 响应（约 75 MB），但是我只能收到前 25 MB 左右。

我使用过urllib2 和python-requests，但都不起作用。我试过reading parts in separately 和streaming the data，但这也不起作用。

可以在此处找到数据示例：

http://waterservices.usgs.gov/nwis/iv/?site=14377100&format=json&parameterCd=00060&period=P260W

我的代码如下：

r = requests.get("http://waterservices.usgs.gov/nwis/iv/?site=14377100&format=json&parameterCd=00060&period=P260W")

usgs_data = r.json() # script breaks here

# Save Longitude and Latitude of river
latitude = usgs_data["value"]["timeSeries"][0]["sourceInfo"]["geoLocation"]["geogLocation"]["latitude"]
longitude = usgs_data["value"]["timeSeries"][0]["sourceInfo"]["geoLocation"]["geogLocation"]["longitude"]

# dictionary of all past river flows in cubic feet per second
river_history = usgs_data['value']['timeSeries'][0]['values'][0]['value']

它打破了：

ValueError: Expecting object: line 1 column 13466329 (char 13466328)

当脚本尝试解码 JSON（即usgs_data = r.json()）时。

这是因为尚未收到完整数据，因此不是有效的 JSON 对象。

【问题讨论】：

有趣，它对我有用，r.json() 不会抛出错误..
@alecxe 它似乎偶尔对我有用，其他时候是错误的。我想这支持了他们的服务器有问题的说法

标签： python http python-requests urllib2

【解决方案1】：

问题似乎是服务器一次不能提供超过 13MB 的数据。

我已经使用多个 HTTP 客户端（包括 curl 和 wget）尝试了该 URL，所有这些客户端都以大约 13MB 的大小爆炸。我也尝试过启用 gzip 压缩（你也应该这样做），但解压后结果仍被截断为 13MB。

您请求的数据过多，因为period=P260W 指定了 260 周。如果您尝试设置period=P52W，您应该会发现您能够检索到有效的 JSON 响应。

要减少传输的数据量，请像这样设置Accept-Encoding 标头：

url = 'http://waterservices.usgs.gov/nwis/iv/'
params = {'site': 11527000, 'format': 'json', 'parameterCd': '00060', 'period': 'P52W'}
r = requests.get(url, headers={'Accept-Encoding': 'gzip,deflate'})

【讨论】：

其实requests默认设置了Accept-Encoding: gzip, deflate标头，所以你应该没必要这么做。
不幸的是，我需要为这个项目提供 260 周的数据，所以我有点卡在那里。我能做些什么来让服务器推送更多信息吗？它似乎偶尔会起作用。
@Ben：作为一种解决方法，我建议您使用 startDT 和 endDT 参数以更短的时间间隔发出多个请求，然后合并结果，但是，即使使用仅 1 天的间隔有时会导致响应被截断请求可能由不同的服务器处理。试试这个想法，如果您遇到问题，我认为您可能需要与网络服务提供商联系。