【问题标题】:Scraping dynamic web page with selenium用 selenium 抓取动态网页
【发布时间】:2020-12-24 21:39:10
【问题描述】:

我正在尝试获取此页面上帖子的链接,但它们显然是通过单击每个帖子图像生成的。我在 Python 3.8 中使用 Selenium 和 beautifulsoup4。 知道如何在 selenium 继续下一页时获取链接吗?

网址:https://www.goplaceit.com/cl/mapa?id_modalidad=1&tipo_pro//*[@id=%22gpi-property-list-container%22]/div[3]/div[1]/div[1]/imgpiedad=1%2C2&selectedTool=list#12/-33.45/-70.66667

点击图片后,它会打开一个新标签页,其中包含以下类型的缩短网址:https://www.goplaceit.com/propiedad/6198212

将我发送到 url 类型:

https://www.goplaceit.com/cl/propiedad/venta/departamento/santiago/6198212-depto-con-1d-1b-y-terraza-a-pasos-del-metro-toesca-bodega

我的代码:

from bs4 import BeautifulSoup
from selenium import webdriver
import time
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import winsound
from timeit import default_timer as timer
from selenium.webdriver.common.keys import Keys
start = timer()

PROXY = "PROXY" # IP:PORT or HOST:PORT
path_to_extension = r"extension"
options = Options()
#options.add_argument("--incognito")
options.add_argument('load-extension=' + path_to_extension)
#options.add_argument('--disable-java')
options.headless = False
prefs = {"profile.default_content_setting_values.notifications" : 2}
prefs2 = {"profile.managed_default_content_settings.images": 2}
prefs.update(prefs2)
prefs3 = {"profile.default_content_settings.cookies": 2}
prefs.update(prefs3)
options.add_experimental_option("prefs",prefs)
options.add_argument("--start-maximized")
options.add_argument('--proxy-server=%s' % PROXY)
driver = webdriver.Chrome('chromedriver.exe', options=options)
driver.get('https://www.goplaceit.com/cl/')
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="root"]/nav/div/div[2]/div[1]/button'))).click()
correo = driver.find_element(By.XPATH, '//*[@id="email"]')
correo.send_keys("Mail")
contraseña = driver.find_element(By.XPATH, '//*[@id="password"]')
contraseña.send_keys("password")
contraseña.send_keys(Keys.ENTER)
time.sleep(7)


elem.driver.find_element(By.XPATH, '//*[@id="gpi-main-landing-search-input"]/div/input')
elem.click()
elem.send_keys("keywords")
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="gpi-main-landing-search-input"]/div/div[1]/ul/li[1]/a/div/div[1]'))).click()
buscador.send_keys(Keys.ENTER)
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="root"]/div/div/div[1]/div[2]/div/div[1]/div/div[1]/button'))).click()
WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="custom-checkbox"]'))).click()

page_number = 0
max_page_number = 30
while page_number<=max_page_number:
    WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '//button[contains(text(),"paginator-btn-right")]'))).click()
    

【问题讨论】:

    标签: python-3.x selenium-webdriver web-scraping


    【解决方案1】:

    您可以通过单击图像、保存您的网址、返回第一页并对所有图像重复此操作来轻松获取网址:

    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium import webdriver
    
    driver.get("https://www.goplaceit.com/cl/mapa?id_modalidad=1&tipo_propiedad=1%2C2&selectedTool=list#8/-33.958/-71.206")
    images = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@class='sc-iyvyFf ljSqTz']//img")))
    urls = []
    for i, image in enumerate(images):
        window_before = driver.window_handles[0]
        image.click()
        driver.implicitly_wait(2)
        window_after = driver.window_handles[i+1]
        driver.switch_to.window(window_after)
        urls.append(driver.current_url)
        driver.switch_to.window(window_before)
    

    【讨论】:

    • 谢谢,但它不起作用。我将动态 Xpath 更改为固定的:images = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="gpi-property-list-container"]//img'))),但它只给了我图片的链接,我需要的是点击每个图片时页面打开的链接
    • 哦,抱歉,不是很清楚!我刚刚为您的案例编辑了答案,现在您将获得通过单击图像打开的页面的网址
    • 搜索我发现了相同的,但我很感激你的回答!
    猜你喜欢
    • 2021-11-15
    • 2012-01-28
    • 2020-06-21
    • 2021-02-18
    • 2019-09-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-05-19
    相关资源
    最近更新 更多