首先,尽管使用了multiprocessing 模块,但您在这里使用的是多线程,因为.dummy 使用线程而不是进程。
我最初认为 OP 可以很好地处理多线程,因为在示例中没有迹象表明繁重的 cpu 绑定工作,但由于我们现在知道 OP 确实可能想要使用多处理,所以我还提供了一个多处理解决方案。
OP 的示例需要对整个代理处理的同步进行返工。我通过“模拟”请求部分并删除汤汁部分来简化示例,因为这对问题并不重要。
多处理
此解决方案使用multiprocessing.Value 作为共享计数器来索引代理列表。如果工作人员超时,它会增加共享索引。在Pool'sinitializer-参数的帮助下,共享计数器和代理列表在(工作)进程启动时注册一次。
对非静态共享资源的任何非原子操作使用锁很重要。 multiprocessing.Value 默认有一个 multiprocessing.RLock 附加我们可以使用。
import time
import random
import logging
from multiprocessing import Pool, Value, get_logger, log_to_stderr
def request_get(link, proxies, timeout):
"""Dummy request.get()"""
res = random.choices(["Result", "Timeout"], [0.5, 0.5])
if res[0] == "Result":
time.sleep(random.uniform(0, timeout))
return f"{res[0]} from {link}"
else:
time.sleep(timeout)
raise TimeoutError
def parse_product_info(link):
global proxy_list, proxy_index
while True:
with proxy_index.get_lock():
idx = proxy_index.value
try:
proxy = {'https': proxy_list[idx]}
except IndexError:
# get_logger().info(f"No proxies left.")
return
try:
# get_logger().info(f"attempt using: {proxy}")
res = request_get(link, proxies=proxy, timeout=5)
except TimeoutError:
# get_logger().info(f"timeout with: {proxy}")
with proxy_index.get_lock():
# check with lock held if index is still the same
if idx == proxy_index.value:
proxy_index.value += 1
# get_logger().info(f"incremented index: {proxy_index.value}")
else:
# get_logger().info(f"processing: {res}")
return
def _init_globals(proxy_list, proxy_index):
globals().update(
{'proxy_list': proxy_list, 'proxy_index': proxy_index}
)
主要:
if __name__ == '__main__':
log_to_stderr(logging.INFO)
links = [
'https://www.amazon.com/dp/B00OI0RGGO',
'https://www.amazon.com/dp/B00TPKOPWA',
'https://www.amazon.com/dp/B00TH42HWE',
'https://www.amazon.com/dp/B00TPKNREM',
]
proxies = [
'103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632',
'1.20.101.95:49001', '200.10.193.90:8080', '173.164.26.117:3128',
'103.228.118.66:43002', '178.128.231.201:3128', '1.2.169.54:55312',
'181.52.85.249:31487', '97.64.135.4:8080', '190.96.214.123:53251',
'52.144.107.142:31923', '45.5.224.145:52035', '89.218.22.178:8080',
'192.241.143.186:80', '113.53.29.218:38310', '36.78.131.182:39243'
]
proxies = [f"http://{proxy}" for proxy in proxies]
proxy_index = Value('i', 0)
with Pool(
processes=3,
initializer=_init_globals,
initargs=(proxies, proxy_index)
) as pool:
pool.map(parse_product_info, links)
示例输出:
[INFO/MainProcess] allocating a new mmap of length 4096
[INFO/ForkPoolWorker-1] child process calling self.run()
...
[INFO/ForkPoolWorker-1] attempt using: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-2] attempt using: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-3] attempt using: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-2] processing: Result from https://www.amazon.com/dp/B00TPKOPWA
[INFO/ForkPoolWorker-2] attempt using: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-3] timeout with: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-3] incremented index: 1
[INFO/ForkPoolWorker-3] attempt using: {'https': 'http://180.254.218.229:8080'}
[INFO/ForkPoolWorker-1] timeout with: {'https': 'http://103.110.37.244:36022'}
[INFO/ForkPoolWorker-1] attempt using: {'https': 'http://180.254.218.229:8080'}
[INFO/ForkPoolWorker-3] processing: Result from https://www.amazon.com/dp/B00TH42HWE
[INFO/ForkPoolWorker-2] processing: Result from https://www.amazon.com/dp/B00TPKNREM
[INFO/ForkPoolWorker-1] processing: Result from https://www.amazon.com/dp/B00OI0RGGO
[INFO/ForkPoolWorker-3] process shutting down
[INFO/ForkPoolWorker-2] process shutting down
...
Process finished with exit code 0
多线程
下面的提议在threading.Lock 的帮助下同步代理处理(也可以包装为multiprocessing.dummy.Lock),这是可能的,因为multiprocessing.dummy 仅使用线程。
请注意,multiprocessing.Lock(不是来自.dummy)相比之下是一个沉重(相对较慢)的 IPC 锁定,它会根据您同步的频率对整体性能产生影响。
编辑:
多线程解决方案已从早期草案中重构,以从上述多处理解决方案中汲取设计。 parse_product_info() 现在对于多线程/多处理几乎相同。
import time
import random
import logging
from itertools import repeat
from multiprocessing.dummy import Pool, Lock
get_logger = logging.getLogger
def request_get(link, proxies, timeout):
... # same as in multiprocessing solution above
def parse_product_info(link):
global proxies, proxy_index
while True:
with proxy_lock:
idx_proxy = proxy_index
try:
proxy = {'https': proxies[idx_proxy]}
except IndexError:
# get_logger().info(f"No proxies left.")
return
try:
# get_logger().info(f"attempt using: {proxy}")
res = request_get(link, proxies=proxy, timeout=5)
except TimeoutError:
# get_logger().info(f"timeout with: {proxy}")
with proxy_lock:
if idx_proxy == proxy_index:
proxy_index += 1
# get_logger().info(f"incremented index:{proxy_index}")
else:
# get_logger().info(f"processing: {res}")
return
def init_logging(level=logging.INFO):
fmt = '[%(asctime)s %(threadName)s] --- %(message)s'
logging.basicConfig(format=fmt, level=level)
return logging.getLogger()
主要:
if __name__ == '__main__':
init_logging()
linklist = ... # same as in multiprocessing solution above
proxies = ... # same as in multiprocessing solution above
proxy_index = 0
proxy_lock = Lock()
with Pool(processes=3) as pool:
pool.map(parse_product_info, links)
示例输出:
[2019-12-18 01:40:25,799 Thread-1] --- attempt using: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:25,799 Thread-2] --- attempt using: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:25,799 Thread-3] --- attempt using: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:26,164 Thread-1] --- processing: Result from https://www.amazon.com/dp/B00OI0RGGO
[2019-12-18 01:40:26,164 Thread-1] --- attempt using: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:29,568 Thread-1] --- processing: Result from https://www.amazon.com/dp/B00TPKNREM
[2019-12-18 01:40:30,800 Thread-2] --- timeout with: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:30,800 Thread-2] --- incremented index: 1
[2019-12-18 01:40:30,800 Thread-2] --- attempt using: {'https': 'http://180.254.218.229:8080'}
[2019-12-18 01:40:30,800 Thread-3] --- timeout with: {'https': 'http://103.110.37.244:36022'}
[2019-12-18 01:40:30,801 Thread-3] --- attempt using: {'https': 'http://180.254.218.229:8080'}
[2019-12-18 01:40:32,941 Thread-3] --- processing: Result from https://www.amazon.com/dp/B00TH42HWE
[2019-12-18 01:40:34,677 Thread-2] --- processing: Result from https://www.amazon.com/dp/B00TPKOPWA
Process finished with exit code 0
回复 OP 的最新评论:
如果您愿意,您可以在使用完所有代理后交换 IndexError 异常处理程序块中的代理列表。在代码中您将return 交换为:
with proxy_lock:
proxies = new_proxies
proxy_index = 0
continue