【问题标题】:Increase items count web-scraping增加项目计数网络抓取
【发布时间】:2021-11-20 21:32:27
【问题描述】:

我是 Scrapy 框架的初学者,我有 2 个问题/问题:

  1. 我为一个网站制作了一个“scrapy.Spider”,但它在检索到 960 个元素后停止,我该如何增加这个值,我需要检索大约 ~1600 个元素....:/
  2. 是否可以通过为每个“scrapy.Spider”添加等待时间来无限启动scrapy?

更新

class Spell(scrapy.Item):
    name = scrapy.Field()
    level = scrapy.Field()
    components = scrapy.Field()
    resistance = scrapy.Field()

class Pathfinder2Spider(scrapy.Spider):
    name = "Pathfinder2"
    allowed_domains = ["d20pfsrd.com"]
    start_urls = ["https://www.d20pfsrd.com/magic/spell-lists-and-domains/spell-lists-sorcerer-and-wizard/"]

    def parse(self, response):
        # Recovering all wizard's spell links
        spells_links = response.xpath('//div/table/tbody/tr/td/a[has-class("spell")]')
        print("len(spells_links) : ", len(spells_links))
        for spell_link in spells_links:
            url = spell_link.xpath('@href').get().strip()
            # Recovering all spell information
            yield response.follow(url, self.parse_spell)
        
    def parse_spell(self, response):
        # Getting all content from spell
        article = response.xpath('//article[has-class("magic")]')
        contents = article.xpath('//div[has-class("article-content")]')
        # Extract useful information
        all_names = article.xpath("h1/text()").getall()
        all_contents = contents.get()
        all_levels = RE_LEVEL.findall(all_contents)
        all_components = RE_COMPONENTS.findall(all_contents)
        all_resistances = RE_RESISTANCE.findall(all_contents)

        for name, level, components, resistance in zip(all_names, all_levels, all_components, all_resistances):

            # Treatment here ...

            yield Spell(
                name=spell_name,
                level=spell_level,
                components=spell_components,
                resistance=spell_resistance,
            )

共有1600个链接

len(spells_links) : 1565

但是只刮了 953 个

 'httperror/response_ignored_count': 2,
 'httperror/response_ignored_status_count/404': 2,
 'item_scraped_count': 953,

我用这个命令运行蜘蛛 Scrapy crawl Pathfinder2 -O XXX.json"

CLI informations

提前谢谢你!

【问题讨论】:

  • 你能不能也提一下start_urls
  • 是的,我已经更新了我的帖子

标签: python web-scraping scrapy web-crawler


【解决方案1】:

先查看url数量:

In [3]: len(response.xpath("//span[@id='ctl00_MainContent_DataListTypes_ctl00_LabelName']/b/a"))
Out[3]: 1073

所以你有 1073 个 url,每一个都是一个“拼写”页面,所以你总共有 1073 个拼写,而不是 2000 个。

运行你的代码后,我得到了这个:

'downloader/request_count': 1074,
 'downloader/request_method_count/GET': 1074,
 'downloader/response_bytes': 11368517,
 'downloader/response_count': 1074,
 'downloader/response_status_count/200': 1074,
 'elapsed_time_seconds': 31.657692,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 9, 29, 7, 17, 2, 877042),
 'httpcompression/response_bytes': 31520000,
 'httpcompression/response_count': 1074,
 'item_scraped_count': 1073,

它刮了 1073 所以蜘蛛没问题

但是我删除了这部分:

all_levels = RE_LEVEL.findall(all_contents)
all_components = RE_COMPONENTS.findall(all_contents)
all_resistances = RE_RESISTANCE.findall(all_contents)

如果出现错误,请再次检查此部分。

regex in python

编辑:

有些链接出现了不止一次:

所以链接的数量大于项目的数量。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2021-05-07
    • 2014-03-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-08-30
    • 1970-01-01
    相关资源
    最近更新 更多