【发布时间】:2017-09-02 13:34:46
【问题描述】:
我正在尝试爬取英国某知名零售商的网站,并得到如下属性错误:
nl_env/lib/python3.6/site-packages/scrapy/spiders/sitemap.py”,第 52 行,在 _parse_sitemap for r, c in self._cbs:
AttributeError: 'NlSMCrawlerSpider' 对象没有属性 '_cbs'
可能是我没有完全理解 SitemapSpider 的运作方式 - 请参阅下面的代码:
class NlSMCrawlerSpider(SitemapSpider):
name = 'nl_smcrawler'
allowed_domains = ['newlook.com']
sitemap_urls = ['http://www.newlook.com/uk/sitemap/maps/sitemap_uk_product_en_1.xml']
sitemap_follow = ['/uk/womens/clothing/']
# sitemap_rules = [
# ('/uk/womens/clothing/', 'parse_product'),
# ]
def __init__(self):
self.driver = webdriver.Safari()
self.driver.set_window_size(800,600)
time.sleep(2)
def parse_product(self, response):
driver = self.driver
driver.get(response.url)
time.sleep(1)
# Collect products
itemDetails = driver.find_elements_by_class_name('product-details-page content')
# Pull features
desc = itemDetails[0].find_element_by_class_name('product-description__name').text
href = driver.current_url
# Generate a product identifier
identifier = href.split('/p/')[1].split('?comp')[0]
identifier = int(identifier)
# datetime
dt = date.today()
dt = dt.isoformat()
# Price Symbol removal and integer conversion
try:
priceString = itemDetails[0].find_element_by_class_name('price product-description__price').text
except:
priceString = itemDetails[0].find_element_by_class_name('price--previous-price product-description__price--previous-price ng-scope').text
priceInt = priceString.split('£')[1]
originalPrice = float(priceInt)
# discountedPrice Logic
try:
discountedPriceString = itemDetails[0].find_element_by_class_name('price price--marked-down product-description__price').text
discountedPriceInt = discountedPriceString.split('£')[1]
discountedPrice = float(discountedPriceInt)
except:
discountedPrice = 'N/A'
# NlScrapeItem
item = NlScrapeItem()
# Append product to NlScrapeItem
item['identifier'] = identifier
item['href'] = href
item['description'] = desc
item['originalPrice'] = originalPrice
item['discountedPrice'] = discountedPrice
item['firstSighted'] = dt
item['lastSighted'] = dt
yield item
此外,请不要犹豫,询问更多详细信息,请参阅指向 sitemap 的链接以及指向 Scrapy 包中引发错误的实际文件的链接 (link - github)。衷心感谢您的帮助。
编辑:一个想法
查看 2nd link(来自 Scrapy 包),我可以看到 _cbs 在 def __init__(self, *a, **kw): 函数中初始化 - 是我有自己的 init 逻辑将其丢弃吗?
【问题讨论】:
标签: python selenium-webdriver scrapy scrapy-spider