爬虫代码检查答案

【问题标题】：scrapy spider code check爬虫代码检查
【发布时间】：2016-01-08 17:23:30
【问题描述】：

所以我试图用scrapy在网站下方的SgmlLinkExtractor参数中抓取网站，这就是我的蜘蛛的样子：

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from desidime_sample.items import DesidimeItem
import string

class DesidimeSpider(CrawlSpider):
    name = "desidime"
    allowed_domains = ["desidime.com"]
    start_urls = ["http://www.desidime.com/forums/hot-deals-online"]
    rules = (
        Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('''//td[not(@*)]/div
        [not(@*)]/a[not(@class)]/@href''')), callback="parse_items", follow=True),
)
    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        deals = hxs.select('''//div[@class='user-comment-text'][1]''')
        items = []
        for deals in deals:
            item = DesidimeItem()
            item["deal"]  = deals.select("//div[@class='user-comment-text'][1]/p/text()").extract()
            item["link"] = deals.select("//div[@class='user-comment-text'][1]/p[1]/a[1]/@href").extract()
            items.append(item)
        return items

我要尝试做什么应该很明显，但由于某种原因，当我告诉蜘蛛抓取并将文本和链接导出到 CVS 文件时，我最终得到：

链接，交易http://wwww.facebook.com/desidime， http://wwww.facebook.com/desidime, （对于更多的行也是一样的，那么:) “，” , "相同的网址" , （对于更多的行也是一样的，那么:) “链接，交易”

那么，谁能告诉我问题出在哪里？如果你在scrapy shell "//corresponingcrawlruleurl" 之后以reponse.xpath("xpath").extract() 运行我上面的每个xpath，你会得到正确的结果。

【问题讨论】：

标签： python web-scraping web-crawler scrapy

【解决方案1】：

问题出在parse_items 回调中。当您迭代交易时，特定于交易上下文的定位器必须是相对的。换句话说，在循环中用一个点开始你的 XPath 表达式：

def parse_items(self, response):
    for deal in response.xpath("//div[@class='user-comment-text'][1]"):
        item = DesidimeItem()

        item["deal"]  = deal.xpath(".//p/text()").extract()
        item["link"] = deal.xpath(".//p[1]/a[1]/@href").extract()

        yield item

（请注意，我还简化了代码）。

这是完整的蜘蛛，我正在执行（它会抓取文本和链接，虽然我不知道你想要的输出是什么）：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class DesidimeItem(scrapy.Item):
    deal = scrapy.Field()
    link = scrapy.Field()


class DesidimeSpider(CrawlSpider):
    name = "desidime"
    allowed_domains = ["desidime.com"]
    start_urls = ["http://www.desidime.com/forums/hot-deals-online"]

    rules = [
        Rule(LinkExtractor(restrict_xpaths="//td[not(@*)]/div[not(@*)]/a[not(@class)]"),
             callback="parse_items",
             follow=True),
    ]

    def parse_items(self, response):
        for deal in response.xpath("//div[@class='user-comment-text'][1]"):
            item = DesidimeItem()

            item["deal"] = deal.xpath(".//p/text()").extract()
            item["link"] = deal.xpath(".//p[1]/a[1]/@href").extract()

            yield item

【讨论】：

我认为“//”相对于 'response.xpath() 中的路径？我还需要 './/div[@class='user-comment-text'][1]' 在 xpath fo item["deal"] 和 item["link"] 中吗？
@user3108815 你肯定需要这个点。是的，我也删除了//div[@class='user-comment-text'][1]。请注意，我没有测试提供的代码 - 它只是来自经验。希望对您有所帮助。
快速问题：例如，如果我只想获取此页面 link 上“PC”之前的文本怎么办？ pc 之后出现的东西对于我抓取的任何交易页面都是无用的。我将如何格式化 item["deal"] 的 xpath？
@user3108815 我想我需要执行和测试蜘蛛。希望以后看看。谢谢。
@user3108815 我已经更新了代码并提供了我正在执行的蜘蛛。提取的数据可能不是您真正想要的（您可能需要调整蜘蛛或定位器），但问题中没有描述任何问题。希望对您有所帮助。