【发布时间】:2021-04-13 15:09:34
【问题描述】:
为了提供一些上下文,我有一个转换器,它分两步请求 JSON 数据,第一步请求完整数量的 Data_points,第二步请求每个数据点的详细信息。跟踪进度,因为我希望从这一点上请求大量数据。这就是我使用 tqdm 的原因,它确实使我的程序完成速度至少降低了 8 倍。
import requests
import json
import os
import time
from datetime import timedelta
from datetime import datetime
from datetime import date
import pandas as pd
import shutil
import zipfile
import smtplib, ssl
from progress.bar import Bar
from tqdm import tqdm
from time import sleep
代码如下:
def fetch_data_points(url: str):
limit_request = 100
# Placeholder for limit: please do not remove = 1000000000 -JJ
folder_path_reset("api_request_jsons","csv","Geographic_information")
total_start_time = start_time_measure()
start_time = start_time_measure(
'Starting Phase 1: First request from API: Data Points')
for i in tqdm(range(limit_request)):
response = requests.get(url,params={"limit": limit_request})
API_status_report(response)
end_time_measure(total_start_time, "Request completed: ")
end_time_measure(total_start_time, "End of Phase 1, completed in: ")
return response.json()
注意这里的时间:
这是使用 tqdm 的控制台。
Starting Phase 1: First request from API: Data Points
100%|██████████| 100/100 [00:21<00:00, 4.69it/s]Successfull connection!
Request completed: 0:00:21.359000
End of Phase 1, completed in: 0:00:21.359000
Saving points
Exported_data\api_request_jsons\Fetch_points\Points.json saved
Point saved: 0:00:00.016000
Data saved. Total time of program run: 0:00:00.016000
Starting Phase 2: Second request from API: 100 requested
9%|▉ | 9/100 [02:12<22:17, 14.69s/it]
这里是没有 tqdm 的控制台:
Starting Phase 1: First request from API: Data Points
Successfull connection!
Request completed: 0:00:00.297000
End of Phase 1, completed in: 0:00:00.297000
Saving points
Exported_data\api_request_jsons\Fetch_points\Points.json saved
Point saved: 0:00:00.015000
Data saved. Total time of program run: 0:00:00.015000
Starting Phase 2: Second request from API: 100 requested
10%|█ | 10/100 [01:54<16:52, 11.25s/it]
正如您在此处看到的,该程序的速度几乎减慢了十倍。 100 个点的请求通常需要 00.297000 秒。但是对于 0:00:21.359000 的 TQDM。比应有的速度慢五倍以上。我希望慢两倍,但它的速度是五倍。任何人都可以给我一些建议,以尽可能减少这种减速。
编辑:好的,我决定放弃对第一个函数的 tqdm 度量,我只是无法正确完成它。这需要太多调整,我注意到当我调整请求的数据量时,数据明显不一致。
所以我尝试了第二个函数,相关的代码是这样的:
为了解释它,请求每个数据点的详细信息并将它们放入一个数组中以供以后使用:
def fetch_details_of_data_points(url: str):
input_json = fetch_data_points(url)
fetch_points_save(input_json)
all_location_points_details = []
amount_of_objects = len(input_json)
total_start_time = start_time_measure()
start_time = start_time_measure(f'Starting Phase 2: Second request from API: {str(amount_of_objects)} requested')
for i in tqdm(range(amount_of_objects),miniters=1):
for obj in input_json:
all_location_points_details.append(fetch_details(obj.get("detail")))
def fetch_details(url: str):
response = requests.get(url)
# Makes request call to get the data of detail
# save_file(folder_path,GipodId,text2)
# any other processe
return response.json()
但我在这里遇到错误:
Message=('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Source=C:\Users\xxxxxx\GIPOD_REQUEST_CONVERSION.py
StackTrace:
File "C:\Users\QF6207\xxxxxx\GIPOD_REQUEST_CONVERSION.py", line 195, in fetch_details
response = requests.get(url)
File "C:\Users\xxxxxx\GIPOD_REQUEST_CONVERSION.py", line 361, in fetch_details_of_data_points
all_location_points_details.append(fetch_details(obj.get("detail")))
File "C:\Users\xxxxxx\GIPOD_REQUEST_CONVERSION.py", line 446, in <module> (Current frame)
fetch_details_of_data_points(api_response_url)
据我所知,对一个点数据点的请求显然需要太长时间,从而导致断开连接。 值得注意的是,我从经验中知道,一个数据点的请求大约需要 0.25 秒才能请求数据。所以理论上进度条应该每 0.25 秒更新一次并向上计数一次。
现在,如果这可以通过将 get 命令的响应时间作为更新时间来解决,这将有助于使进度条更加准确。
那么我该怎么做呢?
编辑:我已经找到了解决我的问题的方法,实际上没有做太多的延迟,在通读之后,我发现了一种创造性的方法,可以在功能完成后手动更新。
with tqdm (total=limit) as firstrequest:
all_location_points_details = fetch_details_of_data_points(url,limit)
firstrequest.update(limit)
with tqdm(total=amount_of_objects) as second_request:
for obj in input_json:
all_location_points_details.append(fetch_details(obj.get("detail")))
second_request.update(1)
【问题讨论】:
-
控制台 I/O 本身就很慢,而 tqdm 做了很多 很多 - 所以我怀疑这只是你必须做出的权衡。减少更新频率是唯一的选择。
-
我正在考虑为我的计划的第一阶段取消 TQDM 措施。因为事情是我刚刚注意到,对于第一个请求,数据在时间方面确实很多,但我可能会为第二个功能完成它(也就是为每个数据点获取请求)但我遇到了连接问题错误。因此,我将发布更新。