【发布时间】:2015-05-25 11:14:11
【问题描述】:
我有一个spider(点击查看源代码),它非常适合常规的 html 页面抓取。 但是,我想添加一个附加功能。我想解析一个 JSON 页面。
这是我想做的(这里是手动完成的,没有scrapy):
import requests, json
import datetime
def main():
user_agent = {
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
}
# This is the URL that outputs JSON:
externalj = 'http://www.thestudentroom.co.uk/externaljson.php?&s='
# Form the end of the URL, it is based on the time (unixtime):
past = datetime.datetime.now() - datetime.timedelta(minutes=15)
time = past.strftime('%s')
# This is the full URL:
url = externalj + time
# Make the HTTP get request:
tsr_data = requests.get(url, headers= user_agent).json()
# Iterate over the json data and form the URLs
# (there are no URLs at all in the JSON data, they must be formed manually):
# URL is formed simply by concatenating the canonical link with a thread-id:
for post in tsr_data['discussions-recent']:
link= 'www.thestudentroom.co.uk/showthread.php?t='
return link + post['threadid']
此函数将返回指向我要抓取的 HTML 页面的正确链接(指向论坛主题的链接)。看来我需要创建自己的请求对象以发送到spider 中的parse_link?
我的问题是,我应该把这段代码放在哪里?我很困惑如何将其合并到scrapy中?我需要创建另一个蜘蛛吗?
理想情况下,我希望它与 the spider that I already have 一起使用,但不确定是否可行。
对于如何在 scrapy 中实现这一点非常困惑。希望有大神指教!
我现在的蜘蛛是这样的:
import scrapy
from tutorial.items import TsrItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class TsrSpider(CrawlSpider):
name = 'tsr'
allowed_domains = ['thestudentroom.co.uk']
start_urls = ['http://www.thestudentroom.co.uk/forumdisplay.php?f=89']
download_delay = 2
user_agent = 'youruseragenthere'
thread_xpaths = ("//tr[@class='thread unread ']",
"//*[@id='discussions-recent']/li/a",
"//*[@id='discussions-popular']/li/a")
rules = [
Rule(LinkExtractor(allow=('showthread\.php\?t=\d+',),
restrict_xpaths=thread_xpaths),
callback='parse_link', follow=True),]
def parse_link(self, response):
for sel in response.xpath("//li[@class='post threadpost old ']"):
item = TsrItem()
item['id'] = sel.xpath(
"div[@class='post-header']//li[@class='post-number museo']/a/span/text()").extract()
item['rating'] = sel.xpath(
"div[@class='post-footer']//span[@class='score']/text()").extract()
item['post'] = sel.xpath(
"div[@class='post-content']/blockquote[@class='postcontent restore']/text()").extract()
item['link'] = response.url
item['topic'] = response.xpath(
"//div[@class='forum-header section-header']/h1/span/text()").extract()
yield item
【问题讨论】:
-
你见过this previous SO post吗?也许它可以回答你的问题。
-
是的,我看到了。只是这不能与我当前的蜘蛛合并。根据文档,不应更改 CrawlSpider 的 parse 方法。