【问题标题】:Why scrapy is not returning any links?为什么scrapy不返回任何链接?
【发布时间】:2021-11-13 17:54:11
【问题描述】:

最近,我尝试制作一些工具来简化自己的公寓搜索并尽快获取相关信息(该网站不是那么用户友好),但我遇到了一个问题,也许我是目前只是盲人......或者只是愚蠢,因为这不是我的专长。

所以,无论如何。我有一个过滤结果的链接:

class BostadSpider(scrapy.Spider):
    name = "bostadformedlingen"
    start_urls = ['https://bostad.stockholm.se/Lista/?s=58.66266&n=59.99899&w=17.07550&e=19.23431&sort=annonserad-fran-desc']

    def parse(self, response):
        for ad in response.css(
            "div.apartment-search-hits > ul.apartment-search-ad-list > li.ad-list__item > a::attr('href')"):
        print(ad.get())

这是来自网站的结构:

<main class="display-flex flex-column search-wrapper u-m-a-0 u-p-a-0" id="main-content">
    <div class="row no-gutters search-wrapper__inner">
        <div id="apartment-search-hits" class="apartment-search-hits" aria-hidden="false">
            <ul id="apartment-search-ad-list" class="ad-list" aria-hidden="false">
                <li class="ad-list__item"> <a href="/Lista/Details?aid=190412" class="ad-list__link">

我应该“更上一层楼”并包含“main”吗?

【问题讨论】:

  • 我尝试包括一步一步的父母,但......没有运气!
  • 我将变量“url”更改为“start_urls”(初学者的错误),但现在仍然响应是[protego] DEBUG: Rule at line 24 without any user agent to enforce it on.(在第26、27、31、38行...)
  • DEBUG: Crawled (404)

标签: python html css scrapy web-crawler


【解决方案1】:

实际上数据是从 api 调用 json 响应生成的。如果您禁用 javascript,那么您将看到该页面变为空白,这意味着该 url 是动态的。这就是为什么我们不能以这种方式获取数据的原因。这是可行的解决方案:

代码:

import scrapy
import json

class BostSpider(scrapy.Spider):

    name = 'bost'

    def start_requests(self):
        yield scrapy.Request(
            url='https://bostad.stockholm.se/Lista/AllaAnnonser',
            method='GET',
            callback=self.parse)
       

    def parse(self, response):
        resp = json.loads(response.body)
        
        for h in resp:
            url = h['Url']
            abs_url = response.urljoin(url)
            yield {
                'URL': abs_url
            }

输出:

{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190400'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190401'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190360'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190325'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190413'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190412'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190383'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190229'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190230'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190414'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190407'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190432'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190377'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190424'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190291'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190382'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190384'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190356'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190349'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190287'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190399'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190428'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190404'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190368'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190371'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190373'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190390'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190385'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190416'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190396'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190394'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190402'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190359'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190358'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190357'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190265'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190264'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190422'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190420'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190410'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190398'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190429'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190403'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190423'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190417'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190362'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190361'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190387'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190376'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190386'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190391'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190369'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190363'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190409'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190427'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190364'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190378'}
2021-09-20 05:43:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bostad.stockholm.se/Lista/AllaAnnonser>  
{'URL': 'https://bostad.stockholm.se/Lista/Details?aid=190375'}
        

...等等

【讨论】:

  • 啊...解决方案很棒,但在“bostad.stockholm.se/Lista/AllaAnnonser”中该死的 JSON 响应实际上并没有返回所有可用的广告。我刚刚意识到它缺少 40% 的广告。在响应返回的广告中,它丢失了 50% 的信息,例如租金、房间数量(当然有这些信息可用)。没想到会这样:(
  • 也许我可以找到链接来生成动态链接,然后将其用作 start_urls?
  • AllaAnnonser 怎么可能不保存所有链接
  • 是否可以改为生成链接?
  • 要抓取所有链接,您必须进行分页。我想你,作为一个scrapy用户,你对分页很熟悉。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2017-01-14
  • 2021-04-30
  • 1970-01-01
  • 1970-01-01
  • 2017-12-12
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多