【发布时间】:2019-12-18 16:54:55
【问题描述】:
我正在尝试为每个页面启动一个新线程,但是这样会在另一个线程/函数完成后启动一个新线程。 任何人都可以帮助我独立运行它们吗? 例子: 线程 1: 打开第 1 页 线程 2: 打开第 2 页
并为 X 数量的页面执行此操作。 我是 python 初学者,请原谅我乱七八糟的代码。
import random
import string
import threading
from time import sleep
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
# driver.find_element_by_css_selector("a[onclick*='if (!window.__cfRLUnblockHandlers) return false; bail()']")
def randomStringDigits(stringLength=6):
"""Generate a random string of letters and digits """
lettersAndDigits = string.ascii_letters + string.digits
return ''.join(random.choice(lettersAndDigits) for i in range(stringLength))
def startscrape(url):
driver = webdriver.Chrome(executable_path='chromedriver.exe')
driver.get("urlhere")
cookies_list = driver.get_cookies()
cookies_dict = {} # create dictionary
usrelem = driver.find_element_by_name("login")
usrelem.send_keys("user")
pwdelem = driver.find_element_by_name("password")
pwdelem.send_keys("pass")
pwdelem.send_keys(Keys.RETURN)
sleep(1)
driver.get(url)
wait = WebDriverWait(driver, 10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
xx = soup.find("input",
{"class": "input input--number js-numberBoxTextInput input input--numberNarrow js-pageJumpPage"})
driver.get(page)
wait = WebDriverWait(driver, 10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
xxx = soup.findAll("a", {"class": "js-lbImage"})
# find all thumbs
for link in xxx:
xxx = soup.find("a", {"href": link.get('href')})
dlfullimg = driver.find_element_by_xpath("//a[@href='" + xxx.get('href') + "']")
wait = WebDriverWait(driver, 10)
dlfullimg.click()
thumbs = soup.findAll("div", {"class": "lg-thumb-item"})
dlfullimg = driver.find_element_by_id('lg-download').click()
close = driver.find_element_by_xpath("//span[@class='lg-close lg-icon']").click()
sleep(1)
assert "No results found." not in driver.page_source
url = input("Main URL: ")
driver = webdriver.Chrome(executable_path='chromedriver.exe')
driver.get("urlhere")
cookies_list = driver.get_cookies()
cookies_dict = {} # create dictionary
usrelem = driver.find_element_by_name("login")
usrelem.send_keys("user")
pwdelem = driver.find_element_by_name("password")
pwdelem.send_keys("pass")
pwdelem.send_keys(Keys.RETURN)
sleep(1)
driver.get(url)
wait = WebDriverWait(driver, 10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
# Find page number with soup.find
xx = soup.find("input",
{"class": "input input--number js-numberBoxTextInput input input--numberNarrow js-pageJumpPage"})
driver.close()
threads = []
for i in range(int(xx.get('max'))):
page = url + "page-" + str(i + 1)
t = threading.Thread(target=startscrape(url), args=[])
threads.append(t)
for t in threads:
t.start()
for t in threads:
t.join()
【问题讨论】:
-
好的开始!要记住的几件事你想限制你创建的线程数量或者这可能是一团糟,集成一个队列并将所有 URL 推送到它并根据需要弹出,selenium 需要大量资源,这也是为什么你有一个单独的线程要启动程序(最后两行),'target' 需要一个 args 元组,所以你应该在页面 target=startthread(page,) 之后添加一个逗号
-
谢谢你的回复,我才意识到我不应该在最后两行开始一个线程。我会调查 args。
-
Np,还可以考虑将“driver = webdriver.Chrome()”移动到 startthread 函数的内部,以在不同的驱动程序上启动所有线程,否则它们都会在同一个驱动程序中打开它们的 URL跨度>
-
变量和函数名称一般应遵循
lower_case_with_underscores样式。一致性是最高优先级,我在您的代码中看到至少 3 种不同的命名约定。无论如何,你能说得更具体点吗? -
线程文档是个不错的起点,不是吗?
标签: python multithreading