【发布时间】:2018-10-29 07:04:58
【问题描述】:
我正在尝试通过 Selenium 抓取 Copart 网站。数据显示在包含标题的行中。我用这一段代码先获取整个页面的HTML。
from bs4 import BeautifulSoup as soup
import requests
from selenium import webdriver
filename = "coparttest.csv"
f = open(filename, "w", encoding="utf-8")
headers = "lotnumber,makeyear,makebrand,model,location,sale_date,odometer,doc_type,damage,est_retail_value,current_bid,photos\n"
f.write(headers)
chrome_driver = "/Users/nguyenquanghung/Desktop/webscrape/silenium/chromedriver"
driver = webdriver.Chrome(chrome_driver)
url = "https://www.copart.com/vehicleFinderSearch/?displayStr=BMW,%5B2014%20TO%202019%5D&from=%2FvehicleFinder%2F%3Fintcmp%3Dweb_homepage_hero_vehiclefinder_en&searchStr=%7B%22MISC%22:%5B%22%23MakeCode:BMW%20OR%20%23MakeDesc:BMW%22,%22%23VehicleTypeCode:VEHTYPE_V%22,%22%23LotYear:%5B2014%20TO%202019%5D%22%5D,%22sortByZip%22:false,%22buyerEnteredZip%22:null%7D"
driver.get(url)
page = driver.execute_script("return document.documentElement.outerHTML")
page_soup = soup(page, "html.parser")
rows = page_soup.findAll("tr",{"role":"row"})
然后,我运行一个 for 循环来获取我需要的所有数据,包括每行的照片,这些照片仅在单击缩放按钮时出现。因此,我使用
driver.find_element_by_xpath(...).click()
单击对应按钮以打开照片轮播,然后通过以下方式再次获取 HTML:
driver. execute_script("return document.documentElement.outerHTML")
终于拿到照片了。请注意,我也跳过了第一行,因为它是标题。代码工作得很好。除了,第一行没有照片,第一张照片附加到第二行,依此类推……看起来内部 for 循环跳过了第一次迭代。以下是其余代码:
for index, row in enumerate(rows[1:]):
lotnumber = row.find("div",{"class":""}).a.text
makeyear = row.find("span",{"data-uname":"lotsearchLotcenturyyear"}).text
makebrand = row.find("span",{"data-uname":"lotsearchLotmake"}).text
model = row.find("span",{"data-uname":"lotsearchLotmodel"}).text
location = row.find("span",{"data-uname":"lotsearchLotyardname"}).text
sale_date = row.find("span",{"data-uname":"lotsearchLotauctiondate"}).text
odometer = row.find("span",{"data-uname":"lotsearchLotodometerreading"}).text.replace(",","")
doc_type = row.find("span",{"data-uname":"lotsearchSaletitletype"}).text
damage = row.find("span",{"data-uname":"lotsearchLotdamagedescription"}).text
est_retail_value = row.find("span",{"data-uname":"lotsearchLotestimatedretailvalue"}).text.replace(",","")
bid = row.findAll("ul",{"class":"list-unstyled"})[0]
bid_span = bid.li.ul.li.findAll("span")
current_bid = bid_span[1].text.replace(",","")
#Get photo
#zoom photo
zoom_button = str(index + 1)
driver.find_element_by_xpath('//*[@id="serverSideDataTable"]/tbody/tr[' + zoom_button + ']/td[2]/div[1]/span').click()
photo_html = driver.execute_script("return document.documentElement.outerHTML")
photo_soup = soup(photo_html, "html.parser")
# print("photo_soup ---> ",photo_soup)
photos_list = photo_soup.findAll("img",{"class":"zoomImg"})
photos = [index]
for photo in photos_list:
src = photo["src"]
photos.append(src)
print("print photo ---> ",index, src)
photos = str(photos).replace(","," |")
#close photo
driver.find_element_by_xpath('//*[@id="lotImage"]/div/div/div[1]/h4/button').click()
print("print row ---> ",index,zoom_button,lotnumber,makeyear,makebrand,model,location,sale_date,odometer,doc_type,damage,est_retail_value,current_bid,photos)
#write row to csv
f.write(lotnumber+","+makeyear+","+makebrand+","+model+","+location+","+sale_date+","+odometer+","+doc_type+","+damage+","+est_retail_value+","+current_bid+","+photos+"\n")
driver.close()
f.close()
有谁知道为什么知道第一行如何/为什么获得空数据?
【问题讨论】:
标签: selenium selenium-webdriver web-scraping beautifulsoup