无法从动态网页中获取所有 xpath 元素答案

【问题标题】：Can't get all xpath elements from dynamic webpage无法从动态网页中获取所有 xpath 元素
【发布时间】：2021-06-28 03:25:49
【问题描述】：

第一次来这里问。希望有人可以帮助我，这让我发疯了！

我正在尝试抓取我所在国家/地区的二手车网页。当您开始向下滚动时会加载数据，因此，代码的第一部分用于向下滚动并加载网页。
我正在尝试获取此处发布的每辆汽车的链接，这就是我在 try-except 部分使用 find_elements_by_xpath 的原因。

好吧，问题是，每次装载（向下滚动）时，汽车都会以 11 个一包的形式出现，因此每次向下滚动时 11 个 xpath 都会重复；

表示来自的xpaths

"//*[@id='w1']/div[1]/div/div[1]/a"

到

"//*[@id='w11']/div[1]/div/div[1]/a"

所有库都在代码开头调用，不用担心。

from selenium import webdriver
from bs4 import BeautifulSoup
import time

links = []

url = ('https://buy.olxautos.cl/buscar?VehiculoEsSearch%5Btipo_valor%5D=1&VehiculoEsSearch%5Bprecio_range%5D=3990000%3B15190000')
driver = webdriver.Chrome('')
driver.get(url)
time.sleep(5)

SCROLL_PAUSE_TIME = 3

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

try:
    zelda = driver.find_elements_by_xpath("//*[@id='w1']/div[1]/div/div[1]/a").get_attribute('href')
    links.append(zelda)
except:
    pass

print(links)

所以这段代码的预期输出应该是这样的：

['link_car_1', 'link_car_12', 'link_car_23', '...']

但是当我运行这段代码时，它返回一个空列表。但是当我用 find_element_by_xpath 返回第一个链接运行它时，我做错了什么????????，我就是想不通！！。

谢谢！

【问题讨论】：

All libraries are called at the start of the code, don't worry. - 重点是包括所有必要的，以运行提供的代码并在问题中重现问题。
@QHarr 已添加，我应该把它当作更新吗？还是像我一样编辑？
就像你做的那样完美。 +

标签： python python-3.x selenium-webdriver web-scraping xpath

【解决方案1】：

您只会获得一个链接，因为 XPATH 对所有链接都不相同。您可以使用bs4通过驱动页面源提取链接，如下所示。

from bs4 import BeautifulSoup
import lxml

links = []

url = ('https://buy.olxautos.cl/buscar?VehiculoEsSearch%5Btipo_valor%5D=1&VehiculoEsSearch%5Bprecio_range%5D=3990000%3B15190000')
driver = webdriver.Chrome(executable_path = Path)
driver.get(url)
time.sleep(5)

SCROLL_PAUSE_TIME = 3

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    # Scroll down to bottom

    page_source_ = driver.page_source
    
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")

    #use BeautifulSoup to extract links
    sup = BeautifulSoup(page_source_, 'lxml')
    sub_ = sup.findAll('div', {'class': 'owl-item active'})
    
    for link_ in sub_:
        link = link_.find('a', href= True)
        #link = 'https://buy.olxautos.cl' + link #if needed (adding prefix)
        links.append(link['href'])
    
    if new_height == last_height:
        break
    last_height = new_height
    
print('>> Total length of list : ', len(links))
print('\n',links)

【讨论】：

非常感谢！，我昨天太倾斜了，看不到别的路了。这是完美的答案。