【发布时间】:2020-05-01 21:47:25
【问题描述】:
使用多线程队列下载文件时内存使用量不断增长,请求数据似乎存储在内存中并没有释放,看起来很奇怪。有人知道为什么吗?
这是我的源代码:
import threading
from queue import Queue
import requests
import shutil
class MultiThreadsModule():
def __init__(self, thread_num):
self.print_lock = threading.Lock()
self.compress_queue = Queue()
self.thread_num = thread_num
self.all_nodes = None
self.func_main = None
def thread_loop(self):
self.thread_list = []
for _ in range(self.thread_num):
t = threading.Thread(target=self.process_queue)
t.daemon = True
t.start()
self.thread_list.append(t)
def node_queue(self, nodes):
for node in nodes:
self.compress_queue.put(node)
self.compress_queue.join()
def process_queue(self):
while True:
node = self.compress_queue.get()
self.func_main(node)
self.compress_queue.task_done()
def run(self):
self.node_queue(self.all_nodes)
def download_file_(url):
r = requests.get(url, stream=True, timeout=600)
return r.text
if __name__ == '__main__':
mtm = MultiThreadsModule(20)
mtm.all_nodes = ["https://www.sec.gov/Archives/edgar/data/913951/000095013399003276/0000950133-99-003276-d2.pdf"] * 1000
mtm.func_main = download_file_
mtm.thread_loop()
mtm.run()
我的记忆随着下载的 pdf 大小而不断增长。 当我大喊下载脚本时,内存恢复正常。
这是我的记忆改变历史:
(base) jay@ubuntu:~$ free
total used free shared buff/cache available
Mem: 16030900 7501440 6342032 207336 2187428 8000996
Swap: 2097148 66828 2030320
(base) jay@ubuntu:~$ free
total used free shared buff/cache available
Mem: 16030900 7499512 6344080 207124 2187308 8003148
Swap: 2097148 66828 2030320
(base) jay@ubuntu:~$ free
total used free shared buff/cache available
Mem: 16030900 7484624 6304240 202692 2242036 8022128
Swap: 2097148 66828 2030320
(base) jay@ubuntu:~$ free
total used free shared buff/cache available
Mem: 16030900 7482960 6305724 202692 2242216 8023788
Swap: 2097148 66828 2030320
(base) jay@ubuntu:~$ free
total used free shared buff/cache available
Mem: 16030900 7559828 6210116 216200 2260956 7933424
Swap: 2097148 66828 2030320
(base) jay@ubuntu:~$ free
total used free shared buff/cache available
Mem: 16030900 7559700 6204536 217868 2266664 7931840
Swap: 2097148 66828 2030320
(base) jay@ubuntu:~$ free
total used free shared buff/cache available
Mem: 16030900 7637356 6127720 212544 2265824 7859580
(base) jay@ubuntu:~$ free
total used free shared buff/cache available
Mem: 16030900 7871248 5816500 262944 2343152 7575232
Swap: 2097148 66828 2030320
(base) jay@ubuntu:~$ free
total used free shared buff/cache available
Mem: 16030900 8412848 5193552 252832 2424500 7042864
Swap: 2097148 66828 2030320
最奇怪的是如果更改另一个网站文件链接,一切正常,我的记忆从来没有像上面那样增长。其他链接:sse.com.cn/disclosure/listedinfo/announcement/c/2019-12-27/….
(base) jay@ubuntu:~$ curl -I https://www.sec.gov/Archives/edgar/data/913951/000095013399003276/0000950133-99-003276-d2.pdf
HTTP/1.1 200 OK
Date: Thursday, 16-Jan-20 12:07:34 CST
Keep-Alive: timeout=58
Content-Length: 0
HTTP/1.1 200 OK
Accept-Ranges: bytes
Content-Type: application/pdf
ETag: "9a2245050c11f5db74ed734eb31b31a0"
Last-Modified: Mon, 02 Oct 2017 21:50:29 GMT
Server: AmazonS3
x-amz-id-2: BwWoaQxPxWSoKT3cJz2fpFLf9j53sdO20m4IedR9I5ZJNBHIFyH4AuqiN9HRx45sSdw/NmhkAjs=
x-amz-meta-mode: 33188
x-amz-replication-status: REPLICA
x-amz-request-id: 33A22AD1A6F87DE0
x-amz-version-id: lVtscFRHVvquEIo8.Q7sUnmAO1nQkKm7
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Content-Length: 1741924
Date: Thu, 16 Jan 2020 04:07:35 GMT
Connection: keep-alive
Strict-Transport-Security: max-age=31536000 ; includeSubDomains ; preload
(base) jay@ubuntu:~$ curl -I http://www.sse.com.cn/disclosure/listedinfo/announcement/c/2019-12-27/603818_2018_nA.pdf
HTTP/1.1 200 OK
Content-Length: 3899122
Accept-Ranges: bytes
Age: 588
Content-Type: application/pdf
Date: Thu, 16 Jan 2020 04:08:09 GMT
Etag: "WAa9c77cbc49bb90b3"
Keep-Alive: timeout=58
Last-Modified: Thu, 26 Dec 2019 09:35:22 GMT
Server: Apache
X-Wa-Info: [V2.S11101.A12708.P79382.N26848.RN0.U4201449325].[OT/pdf.OG/documents]
由于系统无法分配内存资源,内存将被耗尽
【问题讨论】:
-
首先,如果您打算对同一文件进行分段下载,则需要指定要下载的文件的字节范围作为标题。现在,您只是从头开始下载相同的文件。运行
curl -I http://file以查看是否支持“接受范围”。其次,您只是下载数据而不是迭代获取的数据/保存它。看看this -
@Xosrov 感谢您的回复,相同的文件链接仅用于测试。
-
@Xosrov 最奇怪的是如果更改另一个网站文件链接,一切正常,我的记忆从未像上述问题一样增长。另一个链接:sse.com.cn/disclosure/listedinfo/announcement/c/2019-12-27/…。我绞尽脑汁还是不知道为什么,还有其他建议吗?
标签: python multithreading python-requests queue web-crawler