Scrapy：如何解析 JSON 响应？答案

【问题标题】：Scrapy: How can I parse a JSON response?Scrapy：如何解析 JSON 响应？
【发布时间】：2015-05-25 11:14:11
【问题描述】：

我有一个spider（点击查看源代码），它非常适合常规的 html 页面抓取。但是，我想添加一个附加功能。我想解析一个 JSON 页面。

这是我想做的（这里是手动完成的，没有scrapy）：

import requests, json
import datetime

def main():
    user_agent = {
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
    }

    # This is the URL that outputs JSON:
    externalj = 'http://www.thestudentroom.co.uk/externaljson.php?&s='
    # Form the end of the URL, it is based on the time (unixtime):

    past = datetime.datetime.now() - datetime.timedelta(minutes=15)
    time = past.strftime('%s')
    # This is the full URL:
    url = externalj + time

    # Make the HTTP get request:
    tsr_data = requests.get(url, headers= user_agent).json()

    # Iterate over the json data and form the URLs 
    # (there are no URLs at all in the JSON data, they must be formed manually):

    # URL is formed simply by concatenating the canonical link with a thread-id:

    for post in tsr_data['discussions-recent']:
        link= 'www.thestudentroom.co.uk/showthread.php?t='
        return link + post['threadid']

此函数将返回指向我要抓取的 HTML 页面的正确链接（指向论坛主题的链接）。看来我需要创建自己的请求对象以发送到spider 中的parse_link？

我的问题是，我应该把这段代码放在哪里？我很困惑如何将其合并到scrapy中？我需要创建另一个蜘蛛吗？

理想情况下，我希望它与 the spider that I already have 一起使用，但不确定是否可行。

对于如何在 scrapy 中实现这一点非常困惑。希望有大神指教！

我现在的蜘蛛是这样的：

import scrapy
from tutorial.items import TsrItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class TsrSpider(CrawlSpider):
    name = 'tsr'
    allowed_domains = ['thestudentroom.co.uk']

    start_urls = ['http://www.thestudentroom.co.uk/forumdisplay.php?f=89']

    download_delay = 2
    user_agent = 'youruseragenthere'

    thread_xpaths = ("//tr[@class='thread  unread    ']",
            "//*[@id='discussions-recent']/li/a",
            "//*[@id='discussions-popular']/li/a")

    rules = [
        Rule(LinkExtractor(allow=('showthread\.php\?t=\d+',),
            restrict_xpaths=thread_xpaths),
        callback='parse_link', follow=True),]

    def parse_link(self, response):
        for sel in response.xpath("//li[@class='post threadpost old   ']"):
            item = TsrItem()
            item['id'] = sel.xpath(
"div[@class='post-header']//li[@class='post-number museo']/a/span/text()").extract()
            item['rating'] = sel.xpath(
"div[@class='post-footer']//span[@class='score']/text()").extract()
            item['post'] = sel.xpath(
"div[@class='post-content']/blockquote[@class='postcontent restore']/text()").extract()
            item['link'] = response.url
            item['topic'] = response.xpath(
"//div[@class='forum-header section-header']/h1/span/text()").extract()
            yield item

【问题讨论】：

你见过this previous SO post吗？也许它可以回答你的问题。
是的，我看到了。只是这不能与我当前的蜘蛛合并。根据文档，不应更改 CrawlSpider 的 parse 方法。

标签： python json scrapy

【解决方案1】：

它似乎我找到了一种让它工作的方法。也许我原来的帖子不清楚。

我想解析一个 JSON 响应，然后发送一个请求以供 scrapy 进一步处理。

我在我的 Spider 中添加了以下内容：

# A request object is required.
from scrapy.http import Request

还有：

def parse_start_url(self, response):
    if  'externaljson.php' in str(response.url):
        return self.make_json_links(response)

parse_start_url 似乎按照它说的做。它解析初始网址（起始网址）。这里应该只处理 JSON 页面。

因此，我需要在我的 html 网址中添加我的特殊 JSON 网址：

start_urls = ['http://tsr.com/externaljson.php', 'http://tsr.com/thread.html']

我现在需要从 JSON 页面的响应中以请求的形式生成 URL：

def make_json_links(self, response):
    ''' Creates requests from JSON page. '''
    data = json.loads(response.body_as_unicode())
    for post in data['discussions-recent']:
        link = 'http://www.tsr.co.uk/showthread.php?t='
        full_link = link + str(post['threadid'])
        json_request = Request(url=full_link)
        return json_request

现在它似乎起作用了。但是，我确信这是实现此目的的一种笨拙且不雅的方式。感觉有点不对劲。

它似乎有效，它遵循我从 JSON 页面创建的所有链接。我也不确定我是否应该在某处使用yield 而不是return...

【讨论】：

【解决方案2】：

链接是否始终遵循相同的格式？难道不能为 JSON 链接创建新规则，并使用单独的 parse_json 函数作为回调函数吗？

【讨论】：

链接格式相同。但是 JSON 页面本身没有链接。