将 Scrapy spider 作为脚本运行不会获得所有代码，但来自项目的 scrapy spider 会答案

【问题标题】：Running Scrapy spider as script doesn't get all the code, but scrapy spider from project does将 Scrapy spider 作为脚本运行不会获得所有代码，但来自项目的 scrapy spider 会
【发布时间】：2021-08-07 20:42:46
【问题描述】：

我有一个简单的蜘蛛，它从页面上的脚本中抓取一些东西。

我是这样抓取脚本的

jsData = json.loads(response.xpath('//script[@type="application/ld+json"]//text()').extract_first())

当我从项目中的蜘蛛中运行它时，我会获取所有数据，但如果我从常规脚本而不是项目中运行它，它不会从脚本中获取所有数据。这是为什么呢？

这是我的脚本蜘蛛

import scrapy
import json
from scrapy.crawler import CrawlerProcess


class MySpider(scrapy.Spider):
    name = "target"
    start_urls = ['https://www.target.com/p/madden-nfl-22-xbox-one-series-x/-/A-83744898#lnk=sametab']

    def parse(self, response):
        jsData = json.loads(response.xpath('//script[@type="application/ld+json"]//text()').extract_first())
        NAME_SELECTOR = jsData['@graph'][0]

        yield {
            'name': NAME_SELECTOR,
        }


process = CrawlerProcess()

process.crawl(MySpider)
process.start()

它给了我

...'offers': {'@type': 'Offer', 'priceCurrency': 'USD', 'availability': 'InStock', 'availableDeliveryMethod': 'ParcelService', 'potentialAction': {'@type': 'BuyAction'}, 'url': 'https://www.target.com/p/madden-nfl-22-xbox-one-series-x/-/A-83744898'}}}

我的项目蜘蛛代码是

import scrapy
import json

class targetSpider(scrapy.Spider):
    name = "target"
    start_urls = ['https://www.target.com/p/madden-nfl-22-xbox-one-series-x/-/A-83744898#lnk=sametab']

    def parse(self, response):
        jsData = json.loads(response.xpath('//script[@type="application/ld+json"]//text()').extract_first())
        test = jsData['@graph'][0]

        yield {
            'test': test
        }

它给了我

...'offers': {'@type': 'Offer', 'price': '59.99', 'priceCurrency': 'USD', 'availability': 'PreOrder', 'availableDeliveryMethod': 'ParcelService', 'potentialAction': {'@type': 'BuyAction'}, 'url': 'https://www.target.com/p/madden-nfl-22-xbox-one-series-x/-/A-8
3744898'}}}

【问题讨论】：

标签： javascript python scrapy

【解决方案1】：

这是关于 javascript 的。 'price': '59.99' 之类的内容由 javascript 加载。而Scrapy中的Downloader默认不支持运行javascript。

问题的可能原因

你的一个蜘蛛 settings.py 启用了一些外部下载器中间件（如 Selenium、Splash、Playwright），而另一个没有。
以CrawlerProcess()启动spider的脚本不在项目根目录下运行，导致settings.py加载失败。

更新：抱歉，我忘记了使用CrawlerProcess() 时需要手动加载设置。 Run scrapy from a script.

【讨论】：

导入from scrapy.utils.project import get_project_settings 并将get_project_settings() 添加到我的CrawlerProcess() 中解决了这个问题。谢谢！