【发布时间】:2016-01-08 17:23:30
【问题描述】:
所以我试图用scrapy在网站下方的SgmlLinkExtractor参数中抓取网站,这就是我的蜘蛛的样子:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from desidime_sample.items import DesidimeItem
import string
class DesidimeSpider(CrawlSpider):
name = "desidime"
allowed_domains = ["desidime.com"]
start_urls = ["http://www.desidime.com/forums/hot-deals-online"]
rules = (
Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('''//td[not(@*)]/div
[not(@*)]/a[not(@class)]/@href''')), callback="parse_items", follow=True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
deals = hxs.select('''//div[@class='user-comment-text'][1]''')
items = []
for deals in deals:
item = DesidimeItem()
item["deal"] = deals.select("//div[@class='user-comment-text'][1]/p/text()").extract()
item["link"] = deals.select("//div[@class='user-comment-text'][1]/p[1]/a[1]/@href").extract()
items.append(item)
return items
我要尝试做什么应该很明显,但由于某种原因,当我告诉蜘蛛抓取并将文本和链接导出到 CVS 文件时,我最终得到:
链接,交易http://wwww.facebook.com/desidime, http://wwww.facebook.com/desidime, (对于更多的行也是一样的,那么:) “,” , "相同的网址" , (对于更多的行也是一样的,那么:) “链接,交易”
那么,谁能告诉我问题出在哪里?如果你在scrapy shell "//corresponingcrawlruleurl" 之后以reponse.xpath("xpath").extract() 运行我上面的每个xpath,你会得到正确的结果。
【问题讨论】:
标签: python web-scraping web-crawler scrapy