【发布时间】:2016-10-13 06:02:24
【问题描述】:
我正在遍历tripadvisor 以保存cmets(非翻译,原始)和翻译的cmets(从葡萄牙语到英语)。 所以爬虫首先选择要显示的葡萄牙语cmets,然后像往常一样将它们一一转换成英文并将翻译后的cmets保存在com_中,而扩展的未翻译的cmets保存在expanded_cmets中。
代码在第一页上运行良好,但从第二页开始,它无法保存翻译后的 cmets。奇怪的是,它只翻译每个页面的第一条评论,甚至不保存它们。
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
com_=[]
expanded_comments=[]
date_=[]
driver = webdriver.Chrome("C:\Users\shalini\Downloads\chromedriver_win32\chromedriver.exe")
driver.maximize_window()
from bs4 import BeautifulSoup
def expand_reviews(driver):
# TRYING TO EXPAND REVIEWS (& CLOSE A POPUP)
try:
driver.find_element_by_class_name("moreLink").click()
except:
print "err"
try:
driver.find_element_by_class_name("ui_close_x").click()
except:
print "err2"
try:
driver.find_element_by_class_name("moreLink").click()
except:
print "err3"
def save_comments(driver):
expand_reviews(driver)
# SELECTING ALL EXPANDED COMMENTS
#xpanded_com_elements=driver.find_elements_by_class_name("entry")
time.sleep(3)
#or i in expanded_com_elements:
# expanded_comments.append(i.text)
spi=driver.page_source
sp=BeautifulSoup(spi)
for t in sp.findAll("div",{"class":"entry"}):
if not t.findAll("p",{"class":"partial_entry"}):
#print t
expanded_comments.append(t.getText())
# Saving review date
for d in sp.findAll("span",{"class":"recommend-titleInline"}) :
date=d.text
date_.append(date_)
# SELECTING ALL GOOGLE-TRANSLATOR links
gt= driver.find_elements(By.CSS_SELECTOR,".googleTranslation>.link")
# NOW PRINTING TRANSLATED COMMENTS
for i in gt:
try:
driver.execute_script("arguments[0].click()",i)
#com=driver.find_element_by_class_name("ui_overlay").text
com= driver.find_element_by_xpath(".//span[@class = 'ui_overlay ui_modal ']//div[@class='entry']")
com_.append(com.text)
time.sleep(5)
driver.find_element_by_class_name("ui_close_x").click().perform()
time.sleep(5)
except Exception as e:
pass
# ITERATING THROIGH ALL 200 tripadvisor webpages and saving comments & translated comments
for i in range(200):
page=i*10
url="https://www.tripadvisor.com/Airline_Review-d8729164-Reviews-Cheap-Flights-or"+str(page)+"-TAP-Portugal#REVIEWS"
driver.get(url)
wait = WebDriverWait(driver, 10)
if i==0:
# SELECTING PORTUGUESE COMMENTS ONLY # Run for one time then iterate over pages
try:
langselction = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "span.sprite-date_picker-triangle")))
langselction.click()
driver.find_element_by_xpath("//div[@class='languageList']//li[normalize-space(.)='Portuguese first']").click()
time.sleep(5)
except Exception as e:
print e
save_comments(driver)
【问题讨论】:
-
由于代码行太多需要分析,您能否定位问题?
-
@Andersson 的问题是,对于第一页(for 的第一个循环),所有原始和翻译的 cmets 都保存在各自的列表中( com_ & expand_cmets )但此后所有页面只有第一条评论被翻译,然后跳到下一页,而不翻译其余的 cmets。只需运行此代码并在第 3 次/第 4 次循环后分析列表 com_ 和 expand_cmets。这会给你一个想法
-
@Andersson 你能告诉我如何跳过那些英文的 cmets(因此他们下面没有“谷歌翻译”小部件)
标签: python selenium web-scraping