【发布时间】:2021-08-26 17:52:18
【问题描述】:
在 selenium 和 scrapy 的帮助下,我只得到了 487 项中的 12 项。如何刮掉所有物品。我无法确定我在这里有什么问题。感谢任何人的帮助。
我的代码:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from shutil import which
from scrapy.selector import Selector
from selenium_stealth import stealth
from time import sleep
class CpcuSpider(CrawlSpider):
name = 'cp'
allowed_domains = ['www.arp.fr']
start_urls = [
'https://www.arp.fr/produits-portables-tablettes-ordinateurs-portables/?queryString=JTdCJTIyYXJlYUlkJTIyJTNBJTIyMkVEODhGMjctOTNFOS00NzQzLUI3NDYtRUNFQUJENUZFRDA4JTIyJTJDJTIyaXNRdWVyeSUyMiUzQWZhbHNlJTJDJTIyc29ydEF0dHJpYnV0ZSUyMiUzQW51bGwlMkMlMjJzb3J0RGlyZWN0aW9uJTIyJTNBbnVsbCUyQyUyMnBhZ2VubyUyMiUzQSUyMjElMjIlMkMlMjJwZXJQYWdlJTIyJTNBJTIyMTIlMjIlMkMlMjJ2YWx1ZXMlMjIlM0ElNUIlNUQlMkMlMjJwcm9kdWN0SWRzJTIyJTNBJTVCJTVEJTJDJTIycGFydG5lcklkJTIyJTNBbnVsbCUyQyUyMm9wdGlvbnMlMjIlM0ElNUJudWxsJTJDbnVsbCUyQ251bGwlNUQlN0Q=&page='+str(x)+'&productfilter=&sort=null' for x in range(1,6)]
rules = (
Rule(LinkExtractor(restrict_xpaths='//a[@class="rasEpicTitle rasElementReaction"]'), callback='parse_item', follow=False),
#Rule(LinkExtractor(restrict_xpaths='//*[@class="fielddata"]/a'), callback='parse_item', follow=True),
)
def __init__(self):
# this page loads
CrawlSpider.__init__(self)
chrome_path = which("chromedriver")
self.driver = webdriver.Chrome(executable_path=chrome_path)
print(dir(self.driver))
self.driver.maximize_window()
# self.driver.quit()
def parse_item(self, response):
self.driver.get(response.url)
sleep(5)
title = Selector(text=self.driver.page_source)
#for list_node in lists.xpath('//*[@class="rasEpicBoxContainer"]'):
yield{
'Title': title.xpath('//*[@title="028001007"]/text()').get()
}
#self.driver.close()
【问题讨论】:
-
由于网站是打开后加载的,所以必须先用selenium的google chrome web driver打开,然后一个一个点击按钮继续,然后就可以拉取所有数据了。
标签: python selenium xpath scrapy