【问题标题】:Python Multi Threading using Requests and BeautifulSoup使用请求和 BeautifulSoup 的 Python 多线程
【发布时间】:2026-01-09 05:10:01
【问题描述】:

我正在写一个网络爬虫。我本可以只使用scrapy,但决定从头开始编写它以便我可以练习。

我创建了一个使用请求和 BeautifulSoup 成功运行的刮板。它浏览大约 135 个页面,每个页面有 12 个项目,获取链接,然后从链接目标获取信息。最后,它将所有内容写入 CSV 文件。它只抓取字符串,不会下载任何图像或类似的东西……暂时。

问题?这很慢。仅从一页的内容中抓取所有内容大约需要 5 秒,因此 135 次大约是 11 分钟。

所以我的问题是如何在我的代码中实现线程,以便更快地获取数据。

代码如下:

import requests
from bs4 import BeautifulSoup
import re
import csv


def get_actor_dict_from_html(url, html):
    soup = BeautifulSoup(html, "html.parser")

    #There must be a better way to handle this, but let's assign a NULL value to all upcoming variables.
    profileName = profileImage = profileHeight = profileWeight = 'NULL'

    #Let's get the name and image..
    profileName = str.strip(soup.find('h1').get_text())
    profileImage = "http://images.host.com/actors/" + re.findall(r'\d+', url)[0] + "/actor-large.jpg"

    #Now the rest of the stuff..
    try:
        profileHeight = soup.find('a', {"title": "Height"}).get_text()
    except:
        pass
    try:
        profileWeight = soup.find('a', {"title": "Weight"}).get_text()
    except:
        pass

    return {
        'Name': profileName,
        'ImageUrl': profileImage,
        'Height': profileHeight,
        'Weight': profileWeight,
        }


def lotta_downloads():
    output = open("/tmp/export.csv", 'w', newline='')
    wr = csv.DictWriter(output, ['Name','ImageUrl','Height','Weight'], delimiter=',')
    wr.writeheader()

    for i in range(135):
        url = "http://www.host.com/actors/all-actors/name/{}/".format(i)
        response = requests.get(url)
        html = response.content
        soup = BeautifulSoup(html, "html.parser")
        links = soup.find_all("div", { "class" : "card-image" })

        for a in links:
            for url in a.find_all('a'):
                url = "http://www.host.com" + url['href']
                print(url)
                response = requests.get(url)
                html = response.content
                actor_dict = get_actor_dict_from_html(url, html)
                wr.writerow(actor_dict)
    print('All Done!')

if __name__ == "__main__":
    lotta_downloads()

谢谢!

【问题讨论】:

  • 通常你最好不要重新发明*,而是使用像Scrapy这样的网络框架。

标签: python multithreading web-scraping beautifulsoup python-requests


【解决方案1】:

您为什么不尝试使用gevent 库?

gevent 库有monkey patch 将阻塞函数变为非阻塞函数。

可能wait time 的请求太多太慢了。

所以我认为将请求作为非阻塞函数可以让你的程序更快。

在 python 2.7.10 上 示例:

import gevent
from gevent import monkey; monkey.patch_all() # Fix import code
import reqeusts

actor_dict_list = []

def worker(url):
    content = requests.get(url).content
    bs4.BeautifulSoup(content)
    links = soup.find_all('div', {'class': 'card-image'})

    for a in links:
        for url in a.find_all('a'):
            response = requests.get(url) # You can also use gevent spawn function on this line
            ...
            actor_dict_list.append(get_actor_dict_from_html(url, html)) # Because of preventing race condition

output = open("/tmp/export.csv", "w", newline='')
wr = csv.DictWriter(output, ['Name', 'ImageUrl', 'Height', 'Weight'], delimiter=',')
wr.writeheader()

urls = ["http://www.host.com/actors/all-actors/name/{}/".format(i) for i in range(135)]
jobs = [gevent.spawn(worker, url) for url in urls]
gevent.joinall(jobs)
for i in actor_dict_list:
    wr.writerow(actor_dict)

公共事件文档:doc

附言

如果你有 ubuntu 操作系统,你必须安装 python-gevent

sudo apt-get install python-gevent

【讨论】:

  • 我无法确定我可以将它放在我的代码中的什么位置。它必须是函数中的函数吗? (lotta_downloads)?
  • 哦,对不起。是我的错。再看代码。我已经修好了。