【问题标题】:extracting a json response in scrapy在scrapy中提取json响应
【发布时间】:2018-01-27 00:52:47
【问题描述】:

如何使用 Scrapy 抓取使用 JSON 格式的 API? JSON 如下所示:

  "records": [
    {
      "uri": "https://www.example.com",
      "access": {
        "update": false
      },
      "id": 17059,
      "vid": 37614,
      "name": "MyLibery",
      "claim": null,
      "claimedBy": null,
      "authorUid": "3",
      "lifecycle": "L",
      "companyType": "S",
      "ugcState": 10,
      "companyLogo": {
        "fileName": "mylibery-logo.png",
        "filePath": "sites/default/files/imagecache/company_logo_70/mylibery-logo.png"
      }

我试过这段代码:

import scrapy
import json


class ApiItem(scrapy.Item):
    url = scrapy.Field()
    Name = scrapy.Field()


class ExampleSpider(scrapy.Spider):
    name = 'API'
    allowed_domains = ["site.com"]
    start_urls = [l.strip() for l in open('pages.txt').readlines()]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)
        jsonresponse = json.loads(response.body_as_unicode())
        item = ApiItem()
        item["url"] = jsonresponse["uri"]
        item["Name"] = jsonresponse["name"]
        return item

“Pages.txt”是我想要抓取的 API 页面列表,我只想提取“uri”和“name”并将其保存到 csv。

但它会抛出一个错误:

2017-08-18 13:23:02 [scrapy] ERROR: Spider error processing <GET https://www.investiere.ch/proxy/api2/v1/companies?extra%5Bimagecache%5D=company_logo_70&fields=companyType,lifecycle&page=8&parameters%5Binclude_skipped%5D=yes> (referer: None)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 651, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/habenn/Projects/inapi/inapi/spiders/example.py", line 22, in parse
    item["url"] = jsonresponse["uri"]
KeyError: 'uri'

【问题讨论】:

    标签: python json api web-scraping scrapy


    【解决方案1】:

    从给出的例子来看,应该是这样的:

    item["url"] = jsonresponse["records"][0]["uri"]
    item["Name"] = jsonresponse["records"][0]["name"]
    

    编辑:

    要从响应中获取所有 uris 和 names,请使用以下命令:

    def parse(self, response):
        ...
        for record in jsonresponse["records"]:
            item = ApiItem()
            item["url"] = record["uri"]
            item["Name"] = record["name"]
            yield item
    

    请特别注意将return 替换为yield

    【讨论】:

    • 像魅力一样工作。谢谢@Tomáš
    • 还有一件事,那个 json 中大约有 20 个“uri”和“name”,有没有办法循环该代码?