【问题标题】:Passing real URL through Scrapy-Splash to dictionary通过 Scrapy-Splash 将真实 URL 传递给字典
【发布时间】:2023-03-23 12:12:01
【问题描述】:

当试图通过 ('url' : response.request.url) 在字典中保存 URL 时,Scrapy 会保存来自 Scrapy-Splash 的 URL,它们都是相同的 (http://localhost:8050/render.html)

我尝试添加额外的参数来传递真实的 URL,但无济于事。

from scrapy import Spider
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
from scrapy import Request
import scrapy
from scrapy_splash import SplashRequest

class QuotesJSSpider(scrapy.Spider):
    name = 'quotesjs'
    start_urls = ('https://www.facebook.com/login',)
    custom_settings = {
        'SPLASH_URL': 'http://localhost:8050',
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        'SPIDER_MIDDLEWARES': {
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
        },
        'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
    }

    def parse(self, response):
        token = response.xpath('//*[@id="u_0_a"]').extract_first()
        return FormRequest.from_response(response,
                                         formdata={'lgndim' : token,
                                                   'pass': 'xxx',
                                                   'email': 'xxxx'},
                                         callback=self.load_sites)

    def load_sites(self, response):
            urls = [
                'https://www.facebook.com/page1/about',
                'https://www.facebook.com/page2/about',
            ]
            for url in urls:
                yield SplashRequest(url=url, callback=self.scrape_pages)

    def scrape_pages(self, response):
        shops = {
            'company_name' : response.css('title::text').extract(),
            'url' : response.request.url,

        }

        yield shops

结果应该是这样的: '网址' : https://www.facebook.com/page1/about'

取而代之的是: '网址' : http://localhost:8050/render.html,

【问题讨论】:

    标签: python scrapy scrapy-splash


    【解决方案1】:

    原始请求的网址可在此处获得:response.request._original_url

    为避免访问内部属性,您还可以尝试:

    • 在元数据中传递网址:
        def load_sites(self, response):
                    urls = [
                        'https://www.facebook.com/page1/about',
                        'https://www.facebook.com/page2/about',
                    ]
                    for url in urls:
                        yield SplashRequest(url=url, callback=self.scrape_pages, meta={'original_url': url})
    
        def scrape_pages(self, response)
            shops = {
                    'company_name' : response.css('title::text').extract(),
                    'url' : response.meta['original_url'],
            }
            yield shops
    
    • 使用响应中的 url:
        def scrape_pages(self, response):
            shops = {
                'company_name' : response.css('title::text').extract(),
                'url' : response.url,
            }
    

    【讨论】:

    • 谢谢。成功了!
    • 不客气!不要忘记接受答案;-)
    猜你喜欢
    • 1970-01-01
    • 2021-10-22
    • 2014-08-11
    • 2022-01-06
    • 2013-05-18
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多