【发布时间】:2017-03-12 14:05:48
【问题描述】:
我正在尝试使用 ItemLoader 将三个项目组合成一个数组,如下所示:
[
{
site_title: "Some Site Title",
anchor_text: "Click Here",
link: "http://example.com/page"
}
]
正如您在下面的 JSON 中看到的那样,它将一个类型的所有项目组合在一起。
我应该如何处理它以输出带有我正在寻找的数组的 JSON?
蜘蛛文件:
import scrapy
from linkfinder.items import LinkfinderItem
from scrapy.loader import ItemLoader
class LinksSpider(scrapy.Spider):
name = "links"
allowed_domains = ["wpseotest.com"]
start_urls = ["https://wpseotest.com"]
def parse(self, response):
l = ItemLoader(item=LinkfinderItem(), response=response)
l.add_xpath('site_title', '//title/text()')
l.add_xpath('anchor_text', '//a//text()')
l.add_xpath('link', '//a/@href')
return l.load_item()
pass
Items.py
import scrapy
from scrapy import item, Field
class LinkfinderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
site_title = Field()
anchor_text = Field()
link = Field()
pass
JSON 输出
[
{"anchor_text": ["Globex Corporation", "Skip to content", "Home", "About", "Globex News", "Events", "Contact Us", "3999 Mission Boulevard,\r", "San Diego, CA 92109", "This is a test scheduled\u00a0post.", "Test Title", "Globex Subsidiary Ice Cream Inc. Creates Chicken Wing\u00a0Flavor", "Globex Inc.", "\r\n", "Blog at WordPress.com."], "link": ["https://wpseotest.com/", "#content", "https://wpseotest.com/", "https://wpseotest.com/about/", "https://wpseotest.com/globex-news/", "https://wpseotest.com/events/", "https://wpseotest.com/contact-us/", "http://maps.google.com/maps?z=16&q=3999+mission+boulevard,+san+diego,+ca+92109", "https://wpseotest.com/2016/08/19/this-is-a-test-scheduled-post/", "https://wpseotest.com/2016/06/28/test-title/", "https://wpseotest.com/2015/10/18/globex-subsidiary-ice-cream-inc-creates-chicken-wing-flavor/", "https://wpseotest.wordpress.com", "https://wordpress.com/?ref=footer_blog"], "site_title": ["Globex Corporation \u2013 We make things better, or, sometimes, worse."]}
]
【问题讨论】:
-
你可以使用管道来制作/创建你想要的输出
标签: python json web-scraping scrapy scrapy-spider