【问题标题】:Unable to scrape all of the items无法抓取所有项目
【发布时间】:2021-08-26 17:52:18
【问题描述】:

在 selenium 和 scrapy 的帮助下,我只得到了 487 项中的 12 项。如何刮掉所有物品。我无法确定我在这里有什么问题。感谢任何人的帮助。

URL

我的代码:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from shutil import which
from scrapy.selector import Selector
from selenium_stealth import stealth
from time import sleep


class CpcuSpider(CrawlSpider):
    name = 'cp'
    allowed_domains = ['www.arp.fr']
    start_urls = [
        'https://www.arp.fr/produits-portables-tablettes-ordinateurs-portables/?queryString=JTdCJTIyYXJlYUlkJTIyJTNBJTIyMkVEODhGMjctOTNFOS00NzQzLUI3NDYtRUNFQUJENUZFRDA4JTIyJTJDJTIyaXNRdWVyeSUyMiUzQWZhbHNlJTJDJTIyc29ydEF0dHJpYnV0ZSUyMiUzQW51bGwlMkMlMjJzb3J0RGlyZWN0aW9uJTIyJTNBbnVsbCUyQyUyMnBhZ2VubyUyMiUzQSUyMjElMjIlMkMlMjJwZXJQYWdlJTIyJTNBJTIyMTIlMjIlMkMlMjJ2YWx1ZXMlMjIlM0ElNUIlNUQlMkMlMjJwcm9kdWN0SWRzJTIyJTNBJTVCJTVEJTJDJTIycGFydG5lcklkJTIyJTNBbnVsbCUyQyUyMm9wdGlvbnMlMjIlM0ElNUJudWxsJTJDbnVsbCUyQ251bGwlNUQlN0Q=&page='+str(x)+'&productfilter=&sort=null' for x in range(1,6)]

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//a[@class="rasEpicTitle rasElementReaction"]'), callback='parse_item', follow=False),
        #Rule(LinkExtractor(restrict_xpaths='//*[@class="fielddata"]/a'), callback='parse_item', follow=True),
    )

    def __init__(self):
        # this page loads
        CrawlSpider.__init__(self)
        chrome_path = which("chromedriver")
        self.driver = webdriver.Chrome(executable_path=chrome_path)
        
        print(dir(self.driver))
        self.driver.maximize_window()
        # self.driver.quit()

    def parse_item(self, response):
        self.driver.get(response.url)
        sleep(5)

        title = Selector(text=self.driver.page_source)
        #for list_node in lists.xpath('//*[@class="rasEpicBoxContainer"]'):
            
        yield{
            'Title': title.xpath('//*[@title="028001007"]/text()').get()
        }
        #self.driver.close()
    
        
    
    

【问题讨论】:

  • 由于网站是打开后加载的,所以必须先用selenium的google chrome web driver打开,然后一个一个点击按钮继续,然后就可以拉取所有数据了。

标签: python selenium xpath scrapy


【解决方案1】:

start_urls 开始,您的代码中有很多错误。如果您检查网站,您会发现分页不适用于 URL。例如,您不能使用https://www.arp.fr/produits-portables-tablettes-ordinateurs-portables/?queryString=JTdCJTIyYXJlYUlkJTIyJTNBJTIyMkVEODhGMjctOTNFOS00NzQzLUI3NDYtRUNFQUJENUZFRDA4JTIyJTJDJTIyaXNRdWVyeSUyMiUzQWZhbHNlJTJDJTIyc29ydEF0dHJpYnV0ZSUyMiUzQW51bGwlMkMlMjJzb3J0RGlyZWN0aW9uJTIyJTNBbnVsbCUyQyUyMnBhZ2VubyUyMiUzQSUyMjElMjIlMkMlMjJwZXJQYWdlJTIyJTNBJTIyMTIlMjIlMkMlMjJ2YWx1ZXMlMjIlM0ElNUIlNUQlMkMlMjJwcm9kdWN0SWRzJTIyJTNBJTVCJTVEJTJDJTIycGFydG5lcklkJTIyJTNBbnVsbCUyQyUyMm9wdGlvbnMlMjIlM0ElNUJudWxsJTJDbnVsbCUyQ251bGwlNUQlN0Q=&page=3&productfilter=&sort=null 加载第三页。您将获得一个 FIRST 页面。

我建议使用另一种方法:在 Scrapy 蜘蛛中模拟 Javascript 调用。内部网站使用对特殊 URL 的调用来接收 JSON 并将其显示给您。我们可以尝试执行相同的操作:

import scrapy
import json
import base64
import urllib
from scrapy.http import HtmlResponse # to update response from a string
import chompjs # to parse Javascript object


def generate_query_string(query):
    # Website send pagination and a query string using special HTTP header
    # This header is Base64 encoded and URL encoded
    query_string_raw = json.dumps(query)
    query_string_urlencoded = urllib.parse.quote_plus(query_string_raw)
    query_string = base64.b64encode(query_string_urlencoded.encode('ascii')).decode('ascii')
    return query_string

class ArpSpider(scrapy.Spider):
    name = '68943284'
    # I got above query params from your URL using online Base64 decoder and next online URL encoder
    # Most interesting that we can set 500 results per page and get everything in a SINGLE call!
    query = {
        "areaId": "2ED88F27-93E9-4743-B746-ECEABD5FED08", 
        "isQuery": False, 
        "sortAttribute": None, 
        "sortDirection": None, 
        "pageno": "1", 
        "perPage": "500", 
        "values": [], 
        "productIds": ["5267337-05", "5393345-05", "5400545-05", "5400812-05", "5404575-05", "5409557-05", "5410466-05", "5412282-05", "5412314-05", "5412318-05", "5412323-05", "5421276-05"],
        "partnerId": None, 
        "options": [None, None, None]
    }

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.arp.fr/filter/page.json',
            headers={
                'queryString': generate_query_string(self.query),
            },
            callback=self.parse
        )
    
    def parse(self, response):
        # with open('Samples/Arp.json', 'wb') as f:
        #    f.write(response.body)
        # We need to parse JSON response and get HTML code from it
        data = json.loads(response.text)
        # print(data['products'])
        response = HtmlResponse(url="My URL", body=data['products'], encoding='utf-8')
        # Now we need to parse HTML and get Javascript object with all data we need
        javascript = response.xpath('//script[contains(., "dataLayer.push")]/text()').re_first(r'dataLayer\.push\(([\s\S]+?)\);')
        if javascript:
            data = chompjs.parse_js_object(javascript)
            for item in data['ecommerce']['impressions']:
                name = item['name']
                price = item['price']
                print(item)

【讨论】:

  • 谢谢。一切都很好,但我不了解 chompjs 的用法。如何安装它。我试图将 pip install 安装到我的scrapy虚拟环境中,但它显示错误。你能帮我提供chompjs的替代品或如何安装pip吗?它的社区非常低。我搜索了它,没有发现任何有用的东西。
  • @Mohna 我使用了 chompjs,因为网站在 javascript 对象中使用了单引号(而 json.loads() 无法解析它)。您可以为此使用其他库或编写自己的解析逻辑。
  • 谢谢。它很好地产生了输出。但我不理解“我的网址”的含义和? ;从正则表达式部分。你能解释一下吗?
  • re_first 的含义以及为什么?
  • @Mohna My URL 只是一个字符串,将用作加载 HTML 的基本 URL。
猜你喜欢
  • 2021-08-17
  • 2017-09-12
  • 2019-10-17
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-07-17
  • 1970-01-01
相关资源
最近更新 更多