【问题标题】:Scrapy SitemapSpider not workingScrapy SitemapSpider 不工作
【发布时间】:2017-09-02 13:34:46
【问题描述】:

我正在尝试爬取英国某知名零售商的网站,并得到如下属性错误:

nl_env/lib/python3.6/site-packages/scrapy/spiders/sitemap.py”,第 52 行,在 _parse_sitemap for r, c in self._cbs:

AttributeError: 'NlSMCrawlerSpider' 对象没有属性 '_cbs'

可能是我没有完全理解 SitemapSpider 的运作方式 - 请参阅下面的代码:

class NlSMCrawlerSpider(SitemapSpider):
name = 'nl_smcrawler'
allowed_domains = ['newlook.com']
sitemap_urls = ['http://www.newlook.com/uk/sitemap/maps/sitemap_uk_product_en_1.xml']
sitemap_follow = ['/uk/womens/clothing/']

# sitemap_rules = [
#     ('/uk/womens/clothing/', 'parse_product'),
# ]


def __init__(self):
    self.driver = webdriver.Safari()
    self.driver.set_window_size(800,600)
    time.sleep(2)


def parse_product(self, response):
    driver = self.driver
    driver.get(response.url)
    time.sleep(1)

    # Collect products
    itemDetails = driver.find_elements_by_class_name('product-details-page content')


    # Pull features
    desc = itemDetails[0].find_element_by_class_name('product-description__name').text
    href = driver.current_url

    # Generate a product identifier
    identifier = href.split('/p/')[1].split('?comp')[0]
    identifier = int(identifier)

    # datetime
    dt = date.today()
    dt = dt.isoformat()

    # Price Symbol removal and integer conversion
    try:
        priceString = itemDetails[0].find_element_by_class_name('price product-description__price').text
    except:
        priceString = itemDetails[0].find_element_by_class_name('price--previous-price product-description__price--previous-price ng-scope').text
    priceInt = priceString.split('£')[1]
    originalPrice = float(priceInt)

    # discountedPrice Logic
    try:
        discountedPriceString = itemDetails[0].find_element_by_class_name('price price--marked-down product-description__price').text
        discountedPriceInt = discountedPriceString.split('£')[1]
        discountedPrice = float(discountedPriceInt)
    except:
        discountedPrice = 'N/A'

    # NlScrapeItem
    item = NlScrapeItem()

    # Append product to NlScrapeItem
    item['identifier'] = identifier
    item['href'] = href
    item['description'] = desc
    item['originalPrice'] = originalPrice
    item['discountedPrice'] = discountedPrice
    item['firstSighted'] = dt
    item['lastSighted'] = dt

    yield item

此外,请不要犹豫,询问更多详细信息,请参阅指向 sitemap 的链接以及指向 Scrapy 包中引发错误的实际文件的链接 (link - github)。衷心感谢您的帮助。

编辑:一个想法 查看 2nd link(来自 Scrapy 包),我可以看到 _cbs 在 def __init__(self, *a, **kw): 函数中初始化 - 是我有自己的 init 逻辑将其丢弃吗?

【问题讨论】:

    标签: python selenium-webdriver scrapy scrapy-spider


    【解决方案1】:

    您的刮刀中有两个问题。一种是__init__方法

    def __init__(self):
        self.driver = webdriver.Safari()
        self.driver.set_window_size(800, 600)
        time.sleep(2)
    

    现在您已经定义了一个新的__init__ 并覆盖了基类__init__。您的 init 不会调用它,因此 _cbs 未初始化。您可以通过如下更改 init 方法轻松解决此问题

    def __init__(self, *a, **kw):
        super(NlSMCrawlerSpider, self).__init__(*a, **kw)
    
        self.driver = webdriver.Safari()
        self.driver.set_window_size(800, 600)
        time.sleep(2)
    

    接下来,SitemapScraper 将始终向 parse 方法发送响应。而且您根本没有定义 parse 方法。所以我添加了一个简单的来打印网址

    def parse(self, response):
        print(response.url)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-01-01
      相关资源
      最近更新 更多