【问题标题】:scrapy.core.engine DEBUG: Crawled (200) Scrapy Frameworkscrapy.core.engine DEBUG: 爬取 (200) Scrapy 框架
【发布时间】:2025-12-21 19:00:16
【问题描述】:

最近我开始使用scrapy框架。我试图从这个页面提取内容:libgen.io,我在执行命令时遇到了一个错误:

scrapy crawl libgen -t csv

而且我不明白错误是由于。

如果你能帮助我,我将非常感激:c

我的主文件夹中的文件是:

libGenFolder
|
|
|_ __pycache__
|_ spiders
     |
     |_ __pycache__
     |_ spider.py
        |                

这是我的“spider.py”

import scrapy
from scrapy import Selector
from scrapy.spiders import CrawlSpider
from scrapy.linkextractors import LinkExtractor
from scrapy.exceptions import CloseSpider
from getMeMore.items import GetmemoreItem

class libgenSpider(CrawlSpider):
    name = 'libgen'
    item_count = 0
    allowed_domain = ['www.libgen.io']
    start_urls = ['http://libgen.io/search.php?req=ciencia&lg_topic=libgen&open=0&view=detailed&res=25&phrase=1&column=def']
    
    # for url in start_urls:
    #     yield scrapy.Request(url=url, callback=self.parse_item)

    def parse_item (self, response):
        ml_item = GetmemoreItem()

        # info de link
        ml_item['titulo'] = response.xpath('//td[@colspan="2"]/b/a/text()').extract()
        ml_item['autor'] = response.xpath('//td[@colspan="3"]/b/a/text()').extract()
        ml_item['img'] = response.xpath('//td[@rowspan="20"]/a/img[@width="240"]/@src').extract()
        ml_item['language'] = response.xpath('//tr[7]/td[2]/text()').extract()
        ml_item['link'] = response.xpath('//tr[11]/td[2]/a/@href').extract()
        self.item_count += 1
        if self.item_count > 5:
            raise CloseSpider('item_exceeded')
        yield ml_item
|_ items.py
|_ middlewares.py
|_ pipelines.py
     |

这是我的“pipelines.py”

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import scrapy
from scrapy import signals
from scrapy.exporters import CsvItemExporter
# from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy import Request
import csv

class GetmemorePipeline(object):
    def __init__(self):
        self.files = {}

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        file = open('%s_items.csv' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = CsvItemExporter(file)
        self.exporter.fields_to_export = ['titulo', 'autor', 'img', 'language', 'link']
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

# class GetmemorePipeline(ImagesPipeline):

#     def get_media_requests(self, item, info):
#         return [Request(x, meta={'image_name': item["image_name"]})
#                 for x in item.get('image_urls', [])]

#     def file_path(self, request, response=None, info=None):
#         return '%s.jpg' % request.meta['image_name']
|_ settings.py
     |

这是我的“settings.py”

BOT_NAME = 'getMeMore'

SPIDER_MODULES = ['getMeMore.spiders']
NEWSPIDER_MODULE = 'getMeMore.spiders'

# CSV export
ITEM_PIPELINES = {'getMeMore.pipelines.GetmemorePipeline': 300}

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

【问题讨论】:

    标签: python web-scraping scrapy scrapy-spider


    【解决方案1】:

    错误清楚地表明您尝试抓取的 URL 被他们的robots.txt 禁止

    要抓取它,请更改settings.py中的以下变量

    ROBOTSTXT_OBEY = False
    

    【讨论】: