使用 Selenium 抓取 Tripadvisor 的问题答案

【问题标题】：Problems with scraping Tripadvisor with Selenium使用 Selenium 抓取 Tripadvisor 的问题
【发布时间】：2020-01-18 21:16:58
【问题描述】：

我有这个用于抓取网站 TripAdvisor 的 Python 代码，直到去年（18 天前）该代码运行良好，但现在代码无法正常运行，我得到了这个结果（如上所示）

我一直在尝试像 container = driver.find_elements_by_xpath("//div[@class='data-test-target']") 这样的更改但不要工作。另外，我注意到现在该网站没有元素 taLnk ulBlueLinks 或元素 review-container。

拜托，如果你能帮助我编写代码，那就太棒了。

PD：另外，我正在尝试用漂亮的汤，但无论如何代码都不起作用。

import csv
import time
from selenium import webdriver
import datetime
from selenium.common.exceptions import NoSuchElementException

#Common
now = datetime.datetime.now()
driver = webdriver.Chrome('chromedriver.exe')
italia = "https://www.tripadvisor.it/Attraction_Review-g657290-d2213040-Reviews-Ex_Stabilimento_Florio_delle_Tonnare_di_Favignana_e_Formica-Isola_di_Favig.html"
driver.get(italia)

place = 'Ex_Stabilimento_Florio_delle_Tonnare_di_Favignana'
lang = 'it'


def check_exists_by_xpath(xpath):
    try:
        driver.find_element_by_xpath(xpath)
    except NoSuchElementException:
        return False
    return True





for i in range(0, 2):
    try:
        if (check_exists_by_xpath("//span[@class='taLnk ulBlueLinks']")):
            driver.find_element_by_xpath("//span[@class='taLnk ulBlueLinks']").click()
            time.sleep(5)
        container = driver.find_elements_by_xpath("//div[@class='review-container']")

        num_page_items = len(container)
        for j in range(num_page_items):

            csvFile = open(r'Italia_en.csv', 'a')
            csvWriter = csv.writer(csvFile)

            time.sleep(10)
            rating_a = container[j].find_element_by_xpath(
                ".//span[contains(@class, 'ui_bubble_rating bubble_')]").get_attribute("class")
            rating_b = rating_a.split("_")
            rating = rating_b[3]

            review = container[j].find_element_by_xpath(".//p[@class='partial_entry']").text.replace("\n", "")
            title = container[j].find_element_by_class_name('quote').find_element_by_tag_name(
                'a').find_element_by_class_name('noQuotes').text

            print(review)
            rating_date = container[j].find_element_by_class_name('ratingDate').get_attribute('title')
            print(rating, review, title, "--", sep='\n')
            link_list = []
            for link in container[j].find_elements_by_tag_name('a'):
                link_previous = (link.get_attribute('href'))
                link_list.append(link_previous)

            print(link_list[1], "--", sep='\n')

            csvWriter.writerow([place, rating, title, review, rating_date, link_list[1], now, lang])

        driver.find_element_by_xpath('//a[@class="nav next taLnk ui_button primary"]').click()

        time.sleep(5)

    except:

        driver.find_element_by_xpath('//a[@class="nav next taLnk ui_button primary"]').click()
        time.sleep(5)

结果是

    Traceback (most recent call last):
  File "gh_code2.py", line 63, in <module>
    driver.find_element_by_xpath('//a[@class="nav next taLnk ui_button primary"]').click()
  File ".\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 394, in find_element_by_xpath
    return self.find_element(by=By.XPATH, value=xpath)
  File ".\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 978, in find_element
    'value': value})['value']
  File ".\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File ".\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//a[@class="nav next taLnk ui_button primary"]"}
  (Session info: chrome=79.0.3945.117)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "gh_code2.py", line 69, in <module>
    driver.find_element_by_xpath('//a[@class="nav next taLnk ui_button primary"]').click()
  File ".\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 394, in find_element_by_xpath
    return self.find_element(by=By.XPATH, value=xpath)
  File ".\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 978, in find_element
    'value': value})['value']
  File ".\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File ".\Programs\Python\Python37\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//a[@class="nav next taLnk ui_button primary"]"}
  (Session info: chrome=79.0.3945.117)

【问题讨论】：

您使用 xpath 为 nav next taLnk ui_button primary 搜索的按钮似乎不再存在。您必须更新代码以匹配新页面。
@IgnacioAguirre 这听起来像X-Y problem。与其寻求解决问题的帮助，不如编辑您的问题并询问实际问题。你想做什么？

标签： python selenium web-scraping tripadvisor

【解决方案1】：

不幸的是，根据我抓取网站的经验，如果网站已更新并且那些 xpath 引用已消失，您必须更新您的脚本。在这种情况下，似乎

class="nav next taLnk ui_button primary"

不再是可行的选择器。如果这是一个常见的参考，随着时间的推移具有相似的结构，我会尝试使用迭代而不是确切的类名。即导航按钮 [0] 或类似的东西（我面前没有确切的语法）。

否则，请查看 FireFox 的 selenium IDE 浏览器。它可以帮助您在点击网站时找到引用项目的其他方式。

希望这是一些帮助！

【讨论】：