用 selenium 抓取动态网页答案

【问题标题】：Scraping dynamic web page with selenium用 selenium 抓取动态网页
【发布时间】：2020-12-24 21:39:10
【问题描述】：

我正在尝试获取此页面上帖子的链接，但它们显然是通过单击每个帖子图像生成的。我在 Python 3.8 中使用 Selenium 和 beautifulsoup4。知道如何在 selenium 继续下一页时获取链接吗？

网址：https://www.goplaceit.com/cl/mapa?id_modalidad=1&tipo_pro//*[@id=%22gpi-property-list-container%22]/div[3]/div[1]/div[1]/imgpiedad=1%2C2&selectedTool=list#12/-33.45/-70.66667

点击图片后，它会打开一个新标签页，其中包含以下类型的缩短网址：https://www.goplaceit.com/propiedad/6198212

将我发送到 url 类型：

https://www.goplaceit.com/cl/propiedad/venta/departamento/santiago/6198212-depto-con-1d-1b-y-terraza-a-pasos-del-metro-toesca-bodega

我的代码：

from bs4 import BeautifulSoup
from selenium import webdriver
import time
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import winsound
from timeit import default_timer as timer
from selenium.webdriver.common.keys import Keys
start = timer()

PROXY = "PROXY" # IP:PORT or HOST:PORT
path_to_extension = r"extension"
options = Options()
#options.add_argument("--incognito")
options.add_argument('load-extension=' + path_to_extension)
#options.add_argument('--disable-java')
options.headless = False
prefs = {"profile.default_content_setting_values.notifications" : 2}
prefs2 = {"profile.managed_default_content_settings.images": 2}
prefs.update(prefs2)
prefs3 = {"profile.default_content_settings.cookies": 2}
prefs.update(prefs3)
options.add_experimental_option("prefs",prefs)
options.add_argument("--start-maximized")
options.add_argument('--proxy-server=%s' % PROXY)
driver = webdriver.Chrome('chromedriver.exe', options=options)
driver.get('https://www.goplaceit.com/cl/')
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="root"]/nav/div/div[2]/div[1]/button'))).click()
correo = driver.find_element(By.XPATH, '//*[@id="email"]')
correo.send_keys("Mail")
contraseña = driver.find_element(By.XPATH, '//*[@id="password"]')
contraseña.send_keys("password")
contraseña.send_keys(Keys.ENTER)
time.sleep(7)


elem.driver.find_element(By.XPATH, '//*[@id="gpi-main-landing-search-input"]/div/input')
elem.click()
elem.send_keys("keywords")
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="gpi-main-landing-search-input"]/div/div[1]/ul/li[1]/a/div/div[1]'))).click()
buscador.send_keys(Keys.ENTER)
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="root"]/div/div/div[1]/div[2]/div/div[1]/div/div[1]/button'))).click()
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="custom-checkbox"]'))).click()

page_number = 0
max_page_number = 30
while page_number<=max_page_number:
    WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '//button[contains(text(),"paginator-btn-right")]'))).click()

【问题讨论】：

标签： python-3.x selenium-webdriver web-scraping

【解决方案1】：

您可以通过单击图像、保存您的网址、返回第一页并对所有图像重复此操作来轻松获取网址：

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver

driver.get("https://www.goplaceit.com/cl/mapa?id_modalidad=1&tipo_propiedad=1%2C2&selectedTool=list#8/-33.958/-71.206")
images = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@class='sc-iyvyFf ljSqTz']//img")))
urls = []
for i, image in enumerate(images):
    window_before = driver.window_handles[0]
    image.click()
    driver.implicitly_wait(2)
    window_after = driver.window_handles[i+1]
    driver.switch_to.window(window_after)
    urls.append(driver.current_url)
    driver.switch_to.window(window_before)

【讨论】：

谢谢，但它不起作用。我将动态 Xpath 更改为固定的：images = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="gpi-property-list-container"]//img')))，但它只给了我图片的链接，我需要的是点击每个图片时页面打开的链接
哦，抱歉，不是很清楚！我刚刚为您的案例编辑了答案，现在您将获得通过单击图像打开的页面的网址
搜索我发现了相同的，但我很感激你的回答！