【问题标题】:Web scrape with Selenium webdriver, the first iteration got skipped使用 Selenium webdriver 进行 Web 抓取,第一次迭代被跳过
【发布时间】:2018-10-29 07:04:58
【问题描述】:

我正在尝试通过 Selenium 抓取 Copart 网站。数据显示在包含标题的行中。我用这一段代码先获取整个页面的HTML。

from bs4 import BeautifulSoup as soup
import requests
from selenium import webdriver

filename = "coparttest.csv"
f = open(filename, "w", encoding="utf-8")
headers = "lotnumber,makeyear,makebrand,model,location,sale_date,odometer,doc_type,damage,est_retail_value,current_bid,photos\n"
f.write(headers)

chrome_driver = "/Users/nguyenquanghung/Desktop/webscrape/silenium/chromedriver"
driver = webdriver.Chrome(chrome_driver)

url = "https://www.copart.com/vehicleFinderSearch/?displayStr=BMW,%5B2014%20TO%202019%5D&from=%2FvehicleFinder%2F%3Fintcmp%3Dweb_homepage_hero_vehiclefinder_en&searchStr=%7B%22MISC%22:%5B%22%23MakeCode:BMW%20OR%20%23MakeDesc:BMW%22,%22%23VehicleTypeCode:VEHTYPE_V%22,%22%23LotYear:%5B2014%20TO%202019%5D%22%5D,%22sortByZip%22:false,%22buyerEnteredZip%22:null%7D"
driver.get(url)

page = driver.execute_script("return document.documentElement.outerHTML")
page_soup = soup(page, "html.parser")
rows = page_soup.findAll("tr",{"role":"row"})

然后,我运行一个 for 循环来获取我需要的所有数据,包括每行的照片,这些照片仅在单击缩放按钮时出现。因此,我使用 driver.find_element_by_xpath(...).click() 单击对应按钮以打开照片轮播,然后通过以下方式再次获取 HTML: driver. execute_script("return document.documentElement.outerHTML") 终于拿到照片了。请注意,我也跳过了第一行,因为它是标题。代码工作得很好。除了,第一行没有照片,第一张照片附加到第二行,依此类推……看起来内部 for 循环跳过了第一次迭代。以下是其余代码:

for index, row in enumerate(rows[1:]):
    lotnumber = row.find("div",{"class":""}).a.text
    makeyear = row.find("span",{"data-uname":"lotsearchLotcenturyyear"}).text
    makebrand = row.find("span",{"data-uname":"lotsearchLotmake"}).text
    model = row.find("span",{"data-uname":"lotsearchLotmodel"}).text
    location = row.find("span",{"data-uname":"lotsearchLotyardname"}).text
    sale_date = row.find("span",{"data-uname":"lotsearchLotauctiondate"}).text
    odometer = row.find("span",{"data-uname":"lotsearchLotodometerreading"}).text.replace(",","")
    doc_type = row.find("span",{"data-uname":"lotsearchSaletitletype"}).text
    damage = row.find("span",{"data-uname":"lotsearchLotdamagedescription"}).text
    est_retail_value = row.find("span",{"data-uname":"lotsearchLotestimatedretailvalue"}).text.replace(",","")

    bid = row.findAll("ul",{"class":"list-unstyled"})[0]
    bid_span = bid.li.ul.li.findAll("span")
    current_bid = bid_span[1].text.replace(",","")

    #Get photo
    #zoom photo
    zoom_button = str(index + 1)
    driver.find_element_by_xpath('//*[@id="serverSideDataTable"]/tbody/tr[' + zoom_button + ']/td[2]/div[1]/span').click()
    photo_html = driver.execute_script("return document.documentElement.outerHTML")
    photo_soup = soup(photo_html, "html.parser")
    # print("photo_soup ---> ",photo_soup)
    photos_list = photo_soup.findAll("img",{"class":"zoomImg"})
    photos = [index]
    for photo in photos_list:
        src = photo["src"]
        photos.append(src)
        print("print photo ---> ",index, src)
    photos = str(photos).replace(","," |")
    #close photo
    driver.find_element_by_xpath('//*[@id="lotImage"]/div/div/div[1]/h4/button').click()

    print("print row ---> ",index,zoom_button,lotnumber,makeyear,makebrand,model,location,sale_date,odometer,doc_type,damage,est_retail_value,current_bid,photos)

    #write row to csv
    f.write(lotnumber+","+makeyear+","+makebrand+","+model+","+location+","+sale_date+","+odometer+","+doc_type+","+damage+","+est_retail_value+","+current_bid+","+photos+"\n")


driver.close()
f.close()       

有谁知道为什么知道第一行如何/为什么获得空数据?

【问题讨论】:

    标签: selenium selenium-webdriver web-scraping beautifulsoup


    【解决方案1】:

    尝试替换代码:

    photo_html = driver.execute_script("return document.documentElement.outerHTML")
    photo_soup = soup(photo_html, "html.parser")
    # print("photo_soup ---> ",photo_soup)
    photos_list = photo_soup.findAll("img",{"class":"zoomImg"})
    photos = [index]
    for photo in photos_list:
        src = photo["src"]
        photos.append(src)
        print("print photo ---> ",index, src)
    photos = str(photos).replace(","," |")
    

    与:

    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    #...
    
    driver.find_element_by_xpath('//*[@id="serverSideDataTable"]/tbody/tr[' + zoom_button + ']/td[2]/div[1]/span').click()
    WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".zoomImg")))
    photos_list = driver.execute_script("return [...document.querySelectorAll('.zoomImg')].map(e=>e.getAttribute('src'))")
    for photo in photos_list:
      print("print photo ---> ", photo)
    

    缩放按钮索引zoom_button = str(index + 1)有问题,缩放按钮应该是zoom_button = str(index)

    工作 Java 代码:

    WebDriverWait wait = new WebDriverWait(driver, 20);
    
    driver.get("https://www.copart.com/vehicleFinderSearch/?displayStr=BMW,%5B2014%20TO%202019%5D&from=%2FvehicleFinder%2F%3Fintcmp%3Dweb_homepage_hero_vehiclefinder_en&searchStr=%7B%22MISC%22:%5B%22%23MakeCode:BMW%20OR%20%23MakeDesc:BMW%22,%22%23VehicleTypeCode:VEHTYPE_V%22,%22%23LotYear:%5B2014%20TO%202019%5D%22%5D,%22sortByZip%22:false,%22buyerEnteredZip%22:null%7D");
    
    List<WebElement> rows = wait.until(ExpectedConditions.numberOfElementsToBe(By.cssSelector("tbody tr[role=row]"), 21));
    for (WebElement row:rows) {
        row.findElement(By.cssSelector("span.searchiconbtn")).click();
        ArrayList<String> photos = (ArrayList)((JavascriptExecutor) driver).executeScript("return [...document.querySelectorAll('.zoomImg')].map(e=>e.getAttribute('src'))");
    }
    

    【讨论】:

    • 非常感谢您的回复!我刚试过这个,但不幸的是,它并没有做任何更好的事情。它确实有助于缩短我的代码,但第一个 photo_lists 仍然是空的。看起来,在 driver.execute_script(...) 之后,内部的 for 循环被跳过了,它直接跳转到打印行。
    • @LongNguyen 你可以等到元素的可见性,检查我的答案更新
    • 感谢您的及时回复。我试过了,但收到一条错误消息:`selenium.common.exceptions.TimeoutException: Message:`。即使在 driver.execute_script() 之后,看起来第一个 photos_list 在 HTML 中也不包含“.zoomImg”
    【解决方案2】:

    @sers 最后,我找到了一种解决方法。在获取第一行的任何数据之前,我必须打开和关闭一次缩放按钮。我不知道为什么。但是,谢谢,我已经学习了 WebDriverWait 和 EC。这是我所拥有的:

    zoom_button = str(index + 1)
    
    Open and close for the first time:
    
    driver.find_element_by_xpath('//*[@id="serverSideDataTable"]/tbody/tr[' + zoom_button + ']/td[2]/div[1]/span').click()
    photos_list = driver.execute_script("return [...document.querySelectorAll('.zoomImg')].map(e=>e.getAttribute('src'))")
    driver.implicitly_wait(10)
    driver.find_element_by_xpath('//*[@id="lotImage"]/div/div/div[1]/h4/button').click()
    
    Open it again and get data:
    
    driver.find_element_by_xpath('//*[@id="serverSideDataTable"]/tbody/tr[' + zoom_button + ']/td[2]/div[1]/span').click()
    photos_list = driver.execute_script("return [...document.querySelectorAll('.zoomImg')].map(e=>e.getAttribute('src'))")
    photos = []
    for photo in photos_list:
        photos.append(photo)
        print("print photo ---> ", photo)
    photos = str(photos)
    driver.implicitly_wait(10)
    driver.find_element_by_xpath('//*[@id="lotImage"]/div/div/div[1]/h4/button').click()
    print("print row ---> ",index,zoom_button,lotnumber,makeyear,makebrand,model,location,sale_date,odometer,doc_type,damage,est_retail_value,current_bid,photos)
    

    【讨论】:

    • 试试zoom_button = str(index)
    • @Sers 这是我尝试的第一件事,想到索引偏移 1。但它没有这样做
    猜你喜欢
    • 1970-01-01
    • 2013-10-03
    • 2021-12-26
    • 2011-08-08
    • 1970-01-01
    • 1970-01-01
    • 2019-02-14
    • 2021-10-08
    • 2021-08-28
    相关资源
    最近更新 更多