递归 Scrapy 爬取问题答案

【问题标题】：Recursive Scrapy crawling issues递归 Scrapy 爬取问题
【发布时间】：2014-08-12 04:14:20
【问题描述】：

我正在尝试使用递归蜘蛛从具有特定链接结构的站点（例如：web.com）中提取内容。例如：

http://web.com/location/profile/12345678?qid=1403226397.5971&source=location&rank=21

http://web.com/location/profile/98765432?qid=1403366850.3991&source=locaton&rank=1

如您所见，只有 URL 的数字部分发生了变化，我需要抓取此 URL 结构后的所有链接并提取 itemX、itemY 和 itemZ。

我已将链接结构翻译成正则表达式，如下所示：'\d+?qid=\d+.\d+&source=location&rank=\d+'。 Python-Scrapy 代码如下，但是，在我运行蜘蛛之后，蜘蛛没有提取任何内容：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from web.items import webItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy import log
import re
import urllib

class web_RecursiveSpider(CrawlSpider):
    name = "web_RecursiveSpider"
    allowed_domains = ["web.com"]
    start_urls = ["http://web.com/location/profile",]

    rules = (Rule (SgmlLinkExtractor(allow=('\d+?qid=\d+.\d+&source=location&rank=\d+', ),) 
    , callback="parse_item", follow= True),
    )

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//*')
        items = []

        for site in sites:
            item = webItem()
            item["itemX"] = site.select("//span[@itemprop='X']/text()").extract()
            item["itemY"] = site.select("//span[@itemprop='Y']/text()").extract()
            item["itemZ"] = site.select("//span[@itemprop='Z']/text()").extract()
            items.append(item)
        return items

【问题讨论】：

标签： python web-scraping scrapy web-crawler scrapy-spider

【解决方案1】：

你需要在正则表达式中转义?标记：

'\d+\?qid=\d+.\d+&source=location&rank=\d+'
    ^

演示：

>>> import re
>>> url = "http://web.com/location/profile/12345678?qid=1403226397.5971&source=location&rank=21"
>>> print re.search('\d+?qid=\d+.\d+&source=location&rank=\d+', url)
None
>>> print re.search('\d+\?qid=\d+.\d+&source=location&rank=\d+', url)
<_sre.SRE_Match object at 0x10be538b8>

请注意，您还需要对点进行转义，但这不会影响您提供的示例：

'\d+\?qid=\d+\.\d+&source=location&rank=\d+'
             ^

【讨论】：

谢谢，这似乎是问题之一。我已经在 Scrapy shell 中尝试过你的演示，它确认正则表达式现在应该可以工作了。但是，当我通过“scrapy crawl web_RecursiveSpider -o items.csv -t csv”运行整个代码时，CSV 文件仍然没有提取任何内容。 cmd 输出显示“... ScrapyDeprecationWarning... .Myspider 继承自已弃用的类 scrapy.spider.BaseSpider，请从 scrapy.spider.Spider 继承... INFO: Crawled 0 pages”
我不确定“弃用”警告是否是问题所在，因为尽管有此警告，但我还有其他工作爬虫。任何建议或提示将不胜感激。
@KubiK888 不，弃用警告与问题无关。你能检查一下链接是否至少被遵循了吗？在parse_item 和循环内放置一个打印语句。谢谢。
我已经尝试了你的建议，这就是我得到的。 1) 在最后一次导入之后和 'class FourOneOne_RecursiveSpider(CrawlSpider)' 之前放置打印语句：OK 2) 在 'rules = (Rule (SgmlLinkExtractor... follow= True),)' 之后和 'def parse_item' 之前：OK 3 ) 在 'def parse_item(self, response): ... items = []' 之后和 'for site in sites:' 之前：NO 4) 在 'for site in sites:' 内循环之前 'return items': NO.
嗨@alecxe。不，按照您的指示，我有点等待额外的帮助，对不起，我是 Python 新手。我确实查看了该网站上的其他类似主题，但找不到答案。我确实尝试为正则表达式取出所有内容，这意味着在“允许”之后留下一个空的 ()，然后运行蜘蛛。它似乎正在运行并通过该特定站点中的一般链接。所以我猜正则表达式还有问题吗？我确实对目录和正则表达式尝试了不同的东西，似乎没有解决问题。