Python从页面上的链接下载多个文件答案

【问题标题】：Python download multiple files from links on pagesPython从页面上的链接下载多个文件
【发布时间】：2018-02-26 02:11:33
【问题描述】：

我正在尝试从这个site 下载所有PGNs。

我想我必须使用urlopen 打开每个网址，然后使用urlretrieve 下载每个 pgn，方法是从每个游戏底部附近的下载按钮访问它。我必须为每个游戏创建一个新的BeautifulSoup 对象吗？我也不确定urlretrieve 的工作原理。

import urllib
from urllib.request import urlopen, urlretrieve, quote
from bs4 import BeautifulSoup

url = 'http://www.chessgames.com/perl/chesscollection?cid=1014492'
u = urlopen(url)
html = u.read().decode('utf-8')

soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a'):
    urlopen('http://chessgames.com'+link.get('href'))

【问题讨论】：

标签： python python-3.x beautifulsoup urllib

【解决方案1】：

accepted answer 很棒，但任务是embarrassingly parallel；无需一次检索这些子页面和文件。这个答案显示了如何加快速度。

第一步是在向单个主机发送多个请求时使用requests.Session()。引用 requests 文档中的 Advanced Usage: Session Objects：

Session 对象允许您跨请求保留某些参数。它还在会话实例发出的所有请求中保留 cookie，并将使用urllib3 的connection pooling。因此，如果您向同一主机发出多个请求，则会重用底层 TCP 连接，从而显着提高性能（请参阅HTTP persistent connection）。

接下来，异步、多处理或多线程可用于并行化工作负载。每个都有与手头任务相关的权衡，您选择的最佳选择可能是通过基准测试和分析来确定。 This page 提供了这三个方面的绝佳示例。

出于本文的目的，我将展示多线程。 GIL 的影响不应该成为太大的瓶颈，因为这些任务大多是 IO 绑定的，包括空中的保姆请求以等待响应。当线程在 IO 上被阻塞时，它可以让位于解析 HTML 或执行其他 CPU 密集型工作的线程。

代码如下：

import os
import re
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

def download_pgn(task):
    session, host, page, destination_path = task
    response = session.get(host + page)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "lxml")
    game_url = host + soup.find("a", text="download").get("href")
    filename = re.search(r"\w+\.pgn", game_url).group()
    path = os.path.join(destination_path, filename)
    response = session.get(game_url, stream=True)
    response.raise_for_status()

    with open(path, "wb") as f:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)

if __name__ == "__main__":
    host = "http://www.chessgames.com"
    url_to_scrape = host + "/perl/chesscollection?cid=1014492"
    destination_path = "pgns"
    max_workers = 8

    if not os.path.exists(destination_path):
        os.makedirs(destination_path)

    with requests.Session() as session:
        response = session.get(url_to_scrape)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "lxml")
        pages = soup.find_all("a", href=re.compile(r".*chessgame\?.*"))
        tasks = [
            (session, host, page.get("href"), destination_path) 
            for page in pages
        ]

        with ThreadPoolExecutor(max_workers=max_workers) as pool:
            pool.map(download_pgn, tasks)

我在这里使用了response.iter_content，这对于如此小的文本文件是不必要的，但它是一种概括，因此代码将以一种内存友好的方式处理较大的文件。

粗略基准测试的结果（第一个请求是瓶颈）：

max workers	session?	seconds
1	no	126
1	yes	111
8	no	24
8	yes	22
32	yes	16

【讨论】：

【解决方案2】：

您的问题没有简短的答案。我将向您展示一个完整的解决方案并评论此代码。

首先，导入必要的模块：

from bs4 import BeautifulSoup
import requests
import re

接下来，获取索引页面并创建BeautifulSoup对象：

req = requests.get("http://www.chessgames.com/perl/chesscollection?cid=1014492")
soup = BeautifulSoup(req.text, "lxml")

我强烈建议使用lxml 解析器，不常用html.parser 之后，您应该准备游戏的链接列表：

pages = soup.findAll('a', href=re.compile('.*chessgame\?.*'))

您可以通过搜索包含“chessgame”字样的链接来做到这一点。现在，您应该准备为您下载文件的函数：

def download_file(url):
    path = url.split('/')[-1].split('?')[0]
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        with open(path, 'wb') as f:
            for chunk in r:
                f.write(chunk)

最后的魔法是重复所有前面的步骤，为文件下载器准备链接：

host = 'http://www.chessgames.com'
for page in pages:
    url = host + page.get('href')
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "lxml")
    file_link = soup.find('a',text=re.compile('.*download.*'))
    file_url = host + file_link.get('href')
    download_file(file_url)

（首先搜索描述中包含“下载”文本的链接，然后构建完整的 url - 连接主机名和路径，最后下载文件）

希望你可以不加修正地使用这段代码！

【讨论】：

为什么我应该使用 requests 而不是 urllib 顺便说一句？你能用 urllib 做同样的事情吗？
当然，您可以像在问题中那样使用urllib。但是使用requests 是一个好习惯。您可以获取更多信息here
这两行代码是做什么的？req = requests.get(url) soup = BeautifulSoup(req.text, "lxml")我知道第一行发出了请求，但不是很明白这意味着什么？ .text 方法在 BS 构造函数中做了什么？我知道它会获取所有子字符串，但我再次不确定在这种情况下这意味着什么。
req = requests.get(url) 为您提供requests.models.Response 类型的对象。它是一个类，包含 http 响应本身以及一些要处理的属性和方法。其中一个属性是text，它可以让您获得接收到的网页的纯html代码。您应该将此 html 代码作为参数 (soup = BeautifulSoup(req.text, "lxml")) 传递给 BeautifuSoup 对象的构造函数，以获取 BeautifulSoup 对象 - 一个允许您搜索某些标签和其他标签的类。