Scrapy ItemLoader 项目组合答案

【问题标题】：Scrapy ItemLoader item combiningScrapy ItemLoader 项目组合
【发布时间】：2017-03-12 14:05:48
【问题描述】：

我正在尝试使用 ItemLoader 将三个项目组合成一个数组，如下所示：

[
    {
        site_title: "Some Site Title",
        anchor_text: "Click Here",
        link: "http://example.com/page"
    }
]

正如您在下面的 JSON 中看到的那样，它将一个类型的所有项目组合在一起。

我应该如何处理它以输出带有我正在寻找的数组的 JSON？

蜘蛛文件：

import scrapy
from linkfinder.items import LinkfinderItem
from scrapy.loader import ItemLoader

class LinksSpider(scrapy.Spider):
    name = "links"
    allowed_domains = ["wpseotest.com"]
    start_urls = ["https://wpseotest.com"]

    def parse(self, response):

        l = ItemLoader(item=LinkfinderItem(), response=response)
        l.add_xpath('site_title', '//title/text()')
        l.add_xpath('anchor_text', '//a//text()')
        l.add_xpath('link', '//a/@href')
        return l.load_item()

        pass

Items.py

import scrapy
from scrapy import item, Field

class LinkfinderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    site_title = Field()
    anchor_text = Field()
    link = Field()
    pass

JSON 输出

[
{"anchor_text": ["Globex Corporation", "Skip to content", "Home", "About", "Globex News", "Events", "Contact Us", "3999 Mission Boulevard,\r", "San Diego, CA 92109", "This is a test scheduled\u00a0post.", "Test Title", "Globex Subsidiary Ice Cream Inc. Creates Chicken Wing\u00a0Flavor", "Globex Inc.", "\r\n", "Blog at WordPress.com."], "link": ["https://wpseotest.com/", "#content", "https://wpseotest.com/", "https://wpseotest.com/about/", "https://wpseotest.com/globex-news/", "https://wpseotest.com/events/", "https://wpseotest.com/contact-us/", "http://maps.google.com/maps?z=16&q=3999+mission+boulevard,+san+diego,+ca+92109", "https://wpseotest.com/2016/08/19/this-is-a-test-scheduled-post/", "https://wpseotest.com/2016/06/28/test-title/", "https://wpseotest.com/2015/10/18/globex-subsidiary-ice-cream-inc-creates-chicken-wing-flavor/", "https://wpseotest.wordpress.com", "https://wordpress.com/?ref=footer_blog"], "site_title": ["Globex Corporation \u2013 We make things better, or, sometimes, worse."]}
]

【问题讨论】：

你可以使用管道来制作/创建你想要的输出

标签： python json web-scraping scrapy scrapy-spider

【解决方案1】：

您想为这里的每个链接生成一个项目吗？
要做到这一点，您要做的是找到文章节点，然后遍历它们并找到您稍后组合成字典/scrapy.Item 的字段。

def parse(self, response):
    site_title = response.xpath("//title/text()").extract_first() 
    links = response.xpath("//a")
    for link in links:
        l = ItemLoader(selector=link)
        l.add_value('site_title', site_title)
        l.add_xpath('anchor_text', 'text()')
        l.add_xpath('link', '@href')
        yield l.load_item()

现在你可以运行scrapy crawl myspider -o output.json，你应该得到类似的东西：

{[
    {"site_title": "title",
     "anchor_text": "foo",
     "link": "http://foo.com"},
    {"site_title": "title",
     "anchor_text": "bar",
     "link": "http://bar.com"}
    ...
  ]
}

【讨论】：

我最终使用了管道；我永远无法让你的榜样奏效。我不确定是否还有其他原因会导致它无法正常工作。
我已将它上传到此处的 repo（使用管道的版本）github.com/chrisfromthelc/scrapy-linkfinder 如果您发现任何明显导致问题的内容，那不在您的代码建议中。

【解决方案2】：

@Granitosaurus，我最初是这样做的，甚至在使用 items/itemloader 并使用此方法简单地构建字典之前。

然后我发现了 itemloaders，我说我应该使用它，（认为它的性能更好）。好吧，这导致我得到了 OP 的结果并试图弄清楚如何将它重新组合在一起，我之前只是在构建自己的字典时是如何获得它的。

现在我倾向于按照我的做法（与您的方法相同）仅包含项目和项目加载器。

这似乎是最容易理解的。在我的示例中，我有一组由

找到的产品

product_items = //div[contains(@class,"item-div")]

我对它们进行迭代并提取产品详细信息。然后我简单地将它们放入字典中。

        for item in product_items:
            name = item.xpath(product_name).extract_first()
            print(name)
            if name not in products:
                products[name] = {}
                products[name].update({'product_supplier': item.xpath(product_supplier).extract_first(),
                                       'product_weight': item.xpath(product_weight).extract_first(),
                                       'product_image': item.xpath(product_image).extract_first(),
                                       'a_level': item.xpath(a_level).extract_first(),
                                       'b_level': item.xpath(b_level).extract_first(),
                                       'price_tag': item.xpath(price_tag).extract_first().strip()
                                     })

现在使用 items/itemloader 我将执行 selector=link 并且它应该是与我之前所做的类似的方法。我想知道我是否浪费了时间试图让它通过项目和项目加载器工作。 Grimmy 度过了美好的一天。

我想我最终会使用管道或提要导出。但是，除了代码看起来很干净之外，我还没有真正在网上发现它们的好处。

【讨论】：