多线程加速下载答案

【问题标题】：Multithreading for faster downloading多线程加速下载
【发布时间】：2012-05-17 08:27:23
【问题描述】：

如何同时下载多个链接？我下面的脚本有效，但一次只能下载一个，而且速度非常慢。我不知道如何在我的脚本中加入多线程。

Python 脚本：

from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re

print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
  url = link.get('href')
  name = urlparse.urlparse(url).path.split('/')[-1]
  dirname = urlparse.urlparse(url).path.split('.')[-1]
  f = urllib2.urlopen(url)
  s = f.read()
  if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname)
  soup = BeautifulSoup(s)
  articleTag = soup.html.body.article
  converted = str(articleTag)
  full_path = os.path.join(dirname, name)
  open(full_path, 'w').write(converted)
  print(name)

名为links.html的HTML文件：

<a href="http://www.youversion.com/bible/gen.1.nmv-fas">http://www.youversion.com/bible/gen.1.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.2.nmv-fas">http://www.youversion.com/bible/gen.2.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.3.nmv-fas">http://www.youversion.com/bible/gen.3.nmv-fas</a>

<a href="http://www.youversion.com/bible/gen.4.nmv-fas">http://www.youversion.com/bible/gen.4.nmv-fas</a>

【问题讨论】：

您还没有尝试过任何东西，所以您实际上没有我们可以帮助解决的问题。
open(full_path, 'wb').write(converted) !!!你想下载二进制文件

标签： python beautifulsoup lxml urllib2 urllib

【解决方案1】：

我使用multiprocessing 进行并行处理——出于某种原因，我比threading 更喜欢它

from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
import multiprocessing


print ("downloading and parsing Bibles...")
def download_stuff(link):
  url = link.get('href')
  name = urlparse.urlparse(url).path.split('/')[-1]
  dirname = urlparse.urlparse(url).path.split('.')[-1]
  f = urllib2.urlopen(url)
  s = f.read()
  if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname)
  soup = BeautifulSoup(s)
  articleTag = soup.html.body.article
  converted = str(articleTag)
  full_path = os.path.join(dirname, name)
  open(full_path, 'w').write(converted)
  print(name)

root = html.parse(open('links.html'))
links = root.findall('//a')
pool = multiprocessing.Pool(processes=5) #use 5 processes to download the data
output = pool.map(download_stuff,links)  #output is a list of [None,None,...] since download_stuff doesn't return anything

【讨论】：

我收到此错误AssertionError: invalid Element proxy at 163319020
@Blainer ：这有点奇怪（虽然我不知道 lxml.html 是如何工作的，所以也许不是......）。您可以尝试传递 url 而不是链接。有可能一些信息 lxml.html 在文件中的位置上保留了某种代理/句柄，而 multiprocessing 没有正确腌制/取消腌制。如果 link.get 返回一个字符串，那应该会更好玩一些...

【解决方案2】：

在 2017 年，现在还有一些其他选项，例如 asyncio 和 ThreadPoolExecutor。

这是一个 ThreadPoolExecutor 的示例（包含在 Python 期货中）

from concurrent.futures import ThreadPoolExecutor

def download(url, filename):
    ... your dowload function...
    pass

with ThreadPoolExecutor(max_workers=12) as executor:
    future = executor.submit(download, url, filename)
    print(future.result())

submit() 函数将任务提交到队列。（队列管理为您完成）

Python version 3.5 and above:
if max_workers is None or not given, it will default to the number of processors on the 
machine, multiplied by 5.

您可以将 max_workers 设置为实际 CPU 核心数的几倍，根据上下文切换开销进行一些测试，看看您能提高多少。

【讨论】：

【解决方案3】：

在我看来，这就像消费者 - 生产者问题 - 参见维基百科

你可以使用

import Queue, thread

# create a Queue.Queue here
queue = Queue.Queue()

print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
  url = link.get('href')
  queue.put(url) # produce




def thrad():
  url = queue.get() # consume
  name = urlparse.urlparse(url).path.split('/')[-1]
  dirname = urlparse.urlparse(url).path.split('.')[-1]
  f = urllib2.urlopen(url)
  s = f.read()
  if (os.path.isdir(dirname) == 0): 
    os.mkdir(dirname)
  soup = BeautifulSoup(s)
  articleTag = soup.html.body.article
  converted = str(articleTag)
  full_path = os.path.join(dirname, name)
  open(full_path, 'wb').write(converted)
  print(name)

thread.start_new(thrad, ()) # start 1 threads

【讨论】：

我收到此错误NameError: name 'queue' is not defined ，如果我将“队列”大写，我收到此错误AttributeError: 'module' object has no attribute 'put'
当您导入同名模块时，您真的想要一个名为thread 的函数吗？
正确。这只是一个结构建议，而不是解决方案，因为他知道多线程。
@Blainer，第一行应该可能是from Queue import Queue（对于python 2.X）和from threading import Thread，那么最后一行应该是Thread.start_new(thread,()) .. 。我想。（我并没有真正使用threading，所以我不确定。）
@user1320237，我们是来解决问题的。挥手和理论解决方案无济于事