【问题标题】:Run scrapy program from within python script从python脚本中运行scrapy程序
【发布时间】:2018-09-25 20:34:10
【问题描述】:

我正在尝试从python脚本运行scrapy。我几乎成功(我认为)做到了这一点,但有些东西不起作用。在我的代码中,我有这样一行run_spider(quotes5)quotes5 是我以前在 cmd 中这样执行的 scrapy 的名称:scrapy crawl quotes5。请问有什么帮助吗? 错误是 quotes5 未定义。

这是我的代码:

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
import json
import csv
import re
from crochet import setup
from importlib import import_module
from scrapy.utils.project import get_project_settings
setup()


def run_spider(spiderName):
    module_name="WS_Vardata.spiders.{}".format(spiderName)
    scrapy_var = import_module(module_name)   #do some dynamic import of selected spider   
    spiderObj= scrapy_var.QuotesSpider()           #get mySpider-object from spider module
    crawler = CrawlerRunner(get_project_settings())   #from Scrapy docs
    crawler.crawl(spiderObj)  

run_spider(quotes5)

Scrapy 代码(quotes_spider.py):

import scrapy
import json
import csv
import re

class QuotesSpider(scrapy.Spider):
name = "quotes5"

def start_requests(self):
    with open('input.csv','r') as csvf:
        urlreader = csv.reader(csvf, delimiter=',',quotechar='"')
        for url in urlreader:
            if url[0]=="y":
                yield scrapy.Request(url[1])
    #with open('so_52069753_out.csv', 'w') as csvfile:
        #fieldnames = ['Category', 'Type', 'Model', 'SK']
        #writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        #writer.writeheader()

def parse(self, response):

    regex = re.compile(r'"product"\s*:\s*(.+?\})', re.DOTALL)
    regex1 = re.compile(r'"pathIndicator"\s*:\s*(.+?\})', re.DOTALL)
    source_json1 = response.xpath("//script[contains(., 'var digitalData')]/text()").re_first(regex)
    source_json2 = response.xpath("//script[contains(., 'var digitalData')]/text()").re_first(regex1)
    model_code = response.xpath('//script').re_first('modelCode.*?"(.*)"')

    if source_json1 and source_json2:
        source_json1 = re.sub(r'//[^\n]+', "", source_json1)
        source_json2 = re.sub(r'//[^\n]+', "", source_json2)
        product = json.loads(source_json1)
        path = json.loads(source_json2)
        product_category = product["pvi_type_name"]
        product_type = product["pvi_subtype_name"]
        product_model = path["depth_5"]
        product_name = product["model_name"]


    if source_json1 and source_json2:
        source1 = source_json1[0]
        source2 = source_json2[0]
        with open('output.csv','a',newline='') as csvfile:
            fieldnames = ['Category','Type','Model','Name','SK']
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            if product_category:
                writer.writerow({'Category': product_category, 'Type': product_type, 'Model': product_model, 'Name': product_name, 'SK': model_code})

enter image description here

【问题讨论】:

    标签: python scrapy


    【解决方案1】:

    正如错误所说,quote5 未定义,您需要在将quote5 传递给方法之前定义quote5。或者试试这样的:

    run_spider(“quotes5”)
    

    已编辑:

    import WS_Vardata.spiders.quotes_spiders as quote_spider_module
    def run_spider(spiderName):
        #get the class from within the module
        spiderClass = getattr(quote_spider_module, spiderName)
        #create the object and your good to go
        spiderObj= spiderClass()
        crawler = CrawlerRunner(get_project_settings())   #from Scrapy docs
        crawler.crawl(spiderObj)  
    
    run_spider("QuotesSpider")
    

    此脚本应与 WS_Vardata 运行在同一目录中

    所以在你的情况下:

    - TEST
    | the_code.py
    | WS_Vardata
       | spiders
         | quotes_spider <= containing QuotesSpider class 
    

    【讨论】:

    • run_spider("quotes5") 有效!谢谢你,先生!我有一个新错误..“ModuleNotFoundError: No module named 'WS_Vardata.spiders'”。我的scrapy 程序的位置是这个“C:\Users\raresb\Desktop\TEST\WS_Vardata\spiders\quotes_spider”。 “quotes_spider”是scrapy程序。
    • WS_Vardata 和 spiders 文件夹中是否有“init.py”?
    • 是的,先生。 “init.py”在他们两个中。
    • 上面贴的代码?它在一个完全不同的位置。在桌面上。
    • 将您的代码放在与 WS_Vardata 相同的目录中并运行它
    猜你喜欢
    • 1970-01-01
    • 2012-11-06
    • 2020-09-26
    • 2014-03-06
    • 2011-09-23
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-12-28
    相关资源
    最近更新 更多