【问题标题】:Feed dataframe with webscraping带有网页抓取的提要数据框
【发布时间】:2021-07-13 20:22:03
【问题描述】:

我没有尝试将一些抓取的值附加到数据框中。我有这个代码:

import time
import requests
import pandas

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import json

# Grab content from URL
url = "https://www.remax.pt/comprar?searchQueryState={%22regionName%22:%22%22,%22businessType%22:1,%22listingClass%22:1,%22page%22:1,%22sort%22:{%22fieldToSort%22:%22ContractDate%22,%22order%22:1},%22mapIsOpen%22:false,%22listingTypes%22:[],%22prn%22:%22%22}"

PATH = 'C:\DRIVERS\chromedriver.exe'
driver = webdriver.Chrome(PATH)
option = Options()
option.headless = False
#chromedriver = 
#driver = webdriver.Chrome(chromedriver)
#driver = webdriver.Firefox() #(options=option)
#driver.get(url)
#driver.implicitly_wait(10)  # in seconds

time.sleep(1)
wait = WebDriverWait(driver, 10)
driver.get(url)

rows = driver.find_elements_by_xpath("//div[@class='row results-list ']/div")
data=[]
for row in rows:
    price=row.find_element_by_xpath(".//p[@class='listing-price']").text
    print(price)
    address=row.find_element_by_xpath(".//p[@class='listing-address']").text
    print(address)
    Tipo=row.find_element_by_xpath(".//p[@class='listing-type']").text
    print(Tipo)
    Area=row.find_element_by_xpath(".//p[@class='listing-area']").text
    print(Area)
    Quartos=row.find_element_by_xpath(".//p[@class='icon-bedroom-full']").text
    print(Quartos)
    data.append([price],[address],[Tipo],[Area],[Quartos])



#driver.quit()

问题是它返回以下错误:

NoSuchElementException                    Traceback (most recent call last)
<ipython-input-16-9e4d01985cda> in <module>
     49     price=row.find_element_by_xpath(".//p[@class='listing-price']").text
     50     print(price)
---> 51     address=row.find_element_by_xpath(".//p[@class='listing-address']").text
     52     print(address)
     53     Tipo=row.find_element_by_xpath(".//p[@class='listing-type']").text

~\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in find_element_by_xpath(self, xpath)
    349             element = element.find_element_by_xpath('//div/td[1]')
    350         """
--> 351         return self.find_element(by=By.XPATH, value=xpath)
    352 
    353     def find_elements_by_xpath(self, xpath):

~\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in find_element(self, by, value)
    656                 value = '[name="%s"]' % value
    657 
--> 658         return self._execute(Command.FIND_CHILD_ELEMENT,
    659                              {"using": by, "value": value})['value']
    660 

~\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py in _execute(self, command, params)
    631             params = {}
    632         params['id'] = self._id
--> 633         return self._parent.execute(command, params)
    634 
    635     def find_element(self, by=By.ID, value=None):

~\anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py in execute(self, driver_command, params)
    319         response = self.command_executor.execute(driver_command, params)
    320         if response:
--> 321             self.error_handler.check_response(response)
    322             response['value'] = self._unwrap_value(
    323                 response.get('value', None))

~\anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py in check_response(self, response)
    240                 alert_text = value['alert'].get('text')
    241             raise exception_class(message, screen, stacktrace, alert_text)
--> 242         raise exception_class(message, screen, stacktrace)
    243 
    244     def _value_or_default(self, obj, key, default):

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":".//p[@class='listing-address']"}
  (Session info: chrome=90.0.4430.72)

但是当我只尝试使用第一个元素时,它会返回一个价格列表。如果我给它在数据框中的不同位置并且我使用相同类型的路径有什么区别?

【问题讨论】:

  • 请发布被测页面的 HTML 以诊断此错误。

标签: python pandas selenium


【解决方案1】:

您遇到的主要问题是定位器。 1 首先,比较我使用的定位器和您代码中的定位器。 2 二、添加显式等待from selenium.webdriver.support import expected_conditions as EC 3 第三,去掉不必要的代码。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
url = "https://www.remax.pt/comprar?searchQueryState={%22regionName%22:%22%22,%22businessType%22:1,%22listingClass%22:1,%22page%22:1,%22sort%22:{%22fieldToSort%22:%22ContractDate%22,%22order%22:1},%22mapIsOpen%22:false,%22listingTypes%22:[],%22prn%22:%22%22}"
driver.get(url)
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div[@class='row results-list ']/div")))
rows = driver.find_elements_by_xpath("//div[@class='row results-list ']/div")
data = []
for row in rows:
    price_p = row.find_element_by_xpath(".//p[@class='listing-price']").text
    address = row.find_element_by_xpath(".//h2[@class='listing-address']").text
    type = row.find_element_by_xpath(".//li[@class='listing-type']").text
    area = row.find_element_by_xpath(".//li[@class='listing-area']").text
    quartos = row.find_element_by_xpath(".//li[@class='listing-bedroom']").text
    data.append([price, address, price_p, area, quartos])
driver.close()
driver.quit()

请注意,我是在 Linux 上进行的。您的 Chrome 驱动程序位置不同。 此外,要打印列表,请使用:

for p in data:
    print(p.text, sep='\n')

您可以随意修改它。 我收到以下输出:

['240 000 €', 'Lisboa -  Lisboa, Carnide', 'Apartamento', '54 m\n2', '1']
['280 000 €', 'Lisboa -  Lisboa, Beato', 'Apartamento', '80 m\n2', '1']
['285 000 €', 'Lisboa -  Lisboa, Beato', 'Apartamento', '83 m\n2', '1']
['290 000 €', 'Lisboa -  Lisboa, Beato', 'Apartamento', '85 m\n2', '1']
['280 000 €', 'Lisboa -  Lisboa, Beato', 'Apartamento', '80 m\n2', '1']
['290 000 €', 'Lisboa -  Lisboa, Beato', 'Apartamento', '85 m\n2', '1']
['285 000 €', 'Lisboa -  Lisboa, Beato', 'Apartamento', '83 m\n2', '1']
['80 000 €', 'Santarém -  Cartaxo, Ereira e Lapa', 'Terreno', '12440 m\n2', '1']
['260 000 €', 'Lisboa -  Sintra, Queluz e Belas', 'Prédio', '454 m\n2', '1']
['37 500 €', 'Santarém -  Torres Novas, Torres Novas (Santa Maria, Salvador e Santiago)', 'Prédio', '92 m\n2', '1']
['505 000 €', 'Lisboa -  Sintra, Algueirão-Mem Martins', 'Duplex', '357 m\n2', '1']
['135 700 €', 'Lisboa -  Mafra, Milharado', 'Terreno', '310 m\n2', '1']
['132 800 €', 'Lisboa -  Mafra, Milharado', 'Terreno', '310 m\n2', '1']
['133 440 €', 'Lisboa -  Mafra, Milharado', 'Terreno', '310 m\n2', '1']
['179 000 €', 'Lisboa -  Mafra, Milharado', 'Terreno', '310 m\n2', '1']
['75 000 €', 'Lisboa -  Vila Franca de Xira, Vila Franca de Xira', 'Apartamento', '52 m\n2', '1']
['575 000 €', 'Porto -  Matosinhos, Matosinhos e Leça da Palmeira', 'Apartamento', '140 m\n2', '1']
['35 000 €', 'Setúbal -  Almada, Caparica e Trafaria', 'Outros - Habitação', '93 m\n2', '1']
['550 000 €', 'Leiria -  Alcobaça, Évora de Alcobaça', 'Moradia', '160 m\n2', '1']
['550 000 €', 'Lisboa -  Loures, Santa Iria de Azoia, São João da Talha e Bobadela', 'Moradia', '476 m\n2', '1']

【讨论】:

  • data.append([price,address,type,area,quartos]) 您也可以像这样将其附加到列表中。
  • 来自元素的 Xpathing 需要一个 .否则,您将始终从第一行检索。
  • 我认为它有效,但我错了,因为后来,当我声明“数据”时,结果给了我这样的东西:[,我期待一个包含所有对应字段的数据框。
  • data 基本上是一个元素列表。你将如何修改它是一个不同的问题。
  • 剩下的就是将列表转换为 pandas 数据框。 Stackoverflow 已经回答了很多关于如何实现这一点的问题。我也更新了一个定位器,之前不正确,最后一个
猜你喜欢
  • 1970-01-01
  • 2013-11-05
  • 2021-05-18
  • 2021-10-15
  • 2021-12-28
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-08-29
相关资源
最近更新 更多