【问题标题】:scrapy start_urls from txt file来自txt文件的scrapy start_urls
【发布时间】:2022-01-13 09:46:33
【问题描述】:

我有大约 100K 的网址要抓取,所以我想从 txt 文件中读取它们 这是代码

import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess

class ConadstoresSpider(scrapy.Spider):
    name = 'conadstores'
    headers = {'user_agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
    allowed_domains = ['conad.it']
    #start_urls = ['http://www.conad.it/ricerca-negozi/negozio.002781.html','https://www.conad.it/ricerca-negozi/negozio.006804.html']
    #start_urls = [l.strip() for l in open("/Users/macbook/PycharmProjects/conad/conad/conadlinks.txt").readlines()]
    #f = open("/Users/macbook/PycharmProjects/conad/conad/conadlinks.txt")
    #start_urls = [url.strip() for url in f.readlines()]
    #f.close()

    with open('/Users/macbook/PycharmProjects/conad/conad/conadlinks.txt') as file:
        start_urls = [line.strip() for line in file]


    def start_request(self):
        request = Request(url = self.start_urls, callback=self.parse)
        yield request

    def parse(self, response):
        yield {
            'address' : response.css('.address-oswald::text').extract(),
            'phone' : response.css('span.phone::text').extract(),

        }

但我不断收到此错误

2021-12-08 13:27:48 [scrapy.core.engine] 错误:获取启动请求时出错 回溯(最近一次通话最后): _next_request 中的文件“/Users/macbook/PycharmProjects/conad/venv/lib/python3.9/site-packages/scrapy/core/engine.py”,第 127 行 请求 = 下一个(slot.start_requests) 文件“/Users/macbook/PycharmProjects/conad/conad/conad/middlewares.py”,第 52 行,在 process_start_requests 对于 start_requests 中的 r: 文件“/Users/macbook/PycharmProjects/conad/venv/lib/python3.9/site-packages/scrapy/spiders/init.py”,第 83 行,在 start_requests 产量请求(网址,dont_filter=True) 文件“/Users/macbook/PycharmProjects/conad/venv/lib/python3.9/site-packages/scrapy/http/request/init.py”,第 25 行,在 init self._set_url(url) _set_url 中的文件“/Users/macbook/PycharmProjects/conad/venv/lib/python3.9/site-packages/scrapy/http/request/init.py”,第 62 行 raise ValueError('请求 url 中缺少方案:%s' % self._url) ValueError:请求网址中缺少方案:%7B%5Crtf1%5Cansi%5Cansicpg1252%5Ccocoartf2580

有什么想法吗? 谢谢!

【问题讨论】:

    标签: python scrapy


    【解决方案1】:

    我们可以覆盖spider的start_requests()方法中的start_urls逻辑

    这是提取数据的简单方法

    import scrapy
    
    
    class ConadstoresSpider(scrapy.Spider):
        name = 'conadstores'
    
        def start_requests(self):
            # read file data (you can use different logic for extract URLS from text files)
            a_file = open("/Users/macbook/PycharmProjects/conad/conad/conadlinks.txt")
            file_contents = a_file.read()
            contents_split = file_contents.splitlines()
            # extract urls from text file and store in list
            for url in contents_split:
                # send request to extracted URL.
                yield scrapy.Request(url)
    
        def parse(self, response, **kwargs):
            yield {
                'address': response.css('.address-oswald::text').extract(),
                'phone': response.css('span.phone::text').extract(),
    
            }
    

    您可以使用不同的文件读取逻辑,但请确保它是返回 url 列表。

    【讨论】:

    • 嗨,感谢您的回复,但我仍然收到以 File "/Users/macbook/PycharmProjects/conad/venv/lib/python3.9/site-packages/scrapy/http 结尾的错误/request/__init__.py",第 62 行,在 _set_url 中引发 ValueError('Missing scheme in request url: %s' % self._url)
    • 好的,看起来 txt 文件不知何故损坏了!我创建了一个新的 csv,现在可以正常工作了,非常感谢!
    • 是的,编码愉快
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-03-27
    • 1970-01-01
    • 1970-01-01
    • 2014-06-22
    • 2013-12-05
    相关资源
    最近更新 更多