【问题标题】:Starting a new thread for each page?为每个页面启动一个新线程?
【发布时间】:2019-12-18 16:54:55
【问题描述】:

我正在尝试为每个页面启动一个新线程,但是这样会在另一个线程/函数完成后启动一个新线程。 任何人都可以帮助我独立运行它们吗? 例子: 线程 1: 打开第 1 页 线程 2: 打开第 2 页

并为 X 数量的页面执行此操作。 我是 python 初学者,请原谅我乱七八糟的代码。

import random
import string
import threading
from time import sleep

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait


# driver.find_element_by_css_selector("a[onclick*='if (!window.__cfRLUnblockHandlers) return false; bail()']")


def randomStringDigits(stringLength=6):
    """Generate a random string of letters and digits """
    lettersAndDigits = string.ascii_letters + string.digits
    return ''.join(random.choice(lettersAndDigits) for i in range(stringLength))


def startscrape(url):
    driver = webdriver.Chrome(executable_path='chromedriver.exe')
    driver.get("urlhere")
    cookies_list = driver.get_cookies()
    cookies_dict = {}  # create dictionary
    usrelem = driver.find_element_by_name("login")
    usrelem.send_keys("user")
    pwdelem = driver.find_element_by_name("password")
    pwdelem.send_keys("pass")
    pwdelem.send_keys(Keys.RETURN)
    sleep(1)
    driver.get(url)
    wait = WebDriverWait(driver, 10)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    xx = soup.find("input",
                   {"class": "input input--number js-numberBoxTextInput input input--numberNarrow js-pageJumpPage"})
    driver.get(page)
    wait = WebDriverWait(driver, 10)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    xxx = soup.findAll("a", {"class": "js-lbImage"})
    # find all thumbs
    for link in xxx:
        xxx = soup.find("a", {"href": link.get('href')})
        dlfullimg = driver.find_element_by_xpath("//a[@href='" + xxx.get('href') + "']")
        wait = WebDriverWait(driver, 10)
        dlfullimg.click()
        thumbs = soup.findAll("div", {"class": "lg-thumb-item"})
        dlfullimg = driver.find_element_by_id('lg-download').click()
        close = driver.find_element_by_xpath("//span[@class='lg-close lg-icon']").click()
        sleep(1)
    assert "No results found." not in driver.page_source


url = input("Main URL: ")
driver = webdriver.Chrome(executable_path='chromedriver.exe')
driver.get("urlhere")
cookies_list = driver.get_cookies()
cookies_dict = {}  # create dictionary
usrelem = driver.find_element_by_name("login")
usrelem.send_keys("user")
pwdelem = driver.find_element_by_name("password")
pwdelem.send_keys("pass")
pwdelem.send_keys(Keys.RETURN)
sleep(1)
driver.get(url)
wait = WebDriverWait(driver, 10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
# Find page number with soup.find
xx = soup.find("input",
               {"class": "input input--number js-numberBoxTextInput input input--numberNarrow js-pageJumpPage"})
driver.close()

threads = []
for i in range(int(xx.get('max'))):
    page = url + "page-" + str(i + 1)
    t = threading.Thread(target=startscrape(url), args=[])
    threads.append(t)
for t in threads:
    t.start()
for t in threads:
    t.join()

【问题讨论】:

  • 好的开始!要记住的几件事你想限制你创建的线程数量或者这可能是一团糟,集成一个队列并将所有 URL 推送到它并根据需要弹出,selenium 需要大量资源,这也是为什么你有一个单独的线程要启动程序(最后两行),'target' 需要一个 args 元组,所以你应该在页面 target=startthread(page,) 之后添加一个逗号
  • 谢谢你的回复,我才意识到我不应该在最后两行开始一个线程。我会调查 args。
  • Np,还可以考虑将“driver = webdriver.Chrome()”移动到 startthread 函数的内部,以在不同的驱动程序上启动所有线程,否则它们都会在同一个驱动程序中打开它们的 URL跨度>
  • 变量和函数名称一般应遵循lower_case_with_underscores 样式。一致性是最高优先级,我在您的代码中看到至少 3 种不同的命名约定。无论如何,你能说得更具体点吗?
  • 线程文档是个不错的起点,不是吗?

标签: python multithreading


【解决方案1】:

您可以使用 concurrent.futures 为您处理繁重的工作

这是一个伪代码

import concurrent.futures
from selenium import webdriver

def process_url(url):
    driver = webdriver.Chrome()
    driver.get(url)
    # process page 
    driver.close

# Find number of pages here 
driver = webdriver.Chrome()
driver.get(url)
# urls = find list of urls
driver.close

threads_count = 10
with concurrent.futures.ThreadPoolExecutor(threads_count) as executor:
    executor.map(process_url, urls)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2011-09-09
    • 2012-08-14
    • 1970-01-01
    • 1970-01-01
    • 2012-05-06
    • 2014-03-19
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多