Scrapy Spider Xpath 图像网址答案

【问题标题】：Scrapy Spider Xpath Image UrlScrapy Spider Xpath 图像网址
【发布时间】：2016-05-20 16:52:55
【问题描述】：

我有一个 scrapy 蜘蛛，它接收所需关键字的输入，然后生成一个搜索结果 url。然后它会爬取该 URL 以在“项目”中抓取有关每个汽车结果的所需值。我正在尝试在生成的项目中添加车辆结果列表中每辆汽车随附的每个全尺寸汽车图像链接的 url。

当我输入关键字为“honda”时，正在抓取的具体网址如下： Honda search results example

我一直无法找出编写 xpath 的正确方法，然后将我获得的任何图像 url 列表包含到我在代码的最后部分产生的蜘蛛“项目”中。现在，当使用命令“scrapy crawl lkq -o items.csv -t csv”运行以下 lkq.py 蜘蛛将项目保存到 .csv 文件时，图片的 items.csv 文件列全为零而不是图片网址。

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.shell import inspect_response
from scrapy.utils.response import open_in_browser

keyword = raw_input('Keyword: ')
url =     'http://www.lkqpickyourpart.com/DesktopModules/pyp_vehicleInventory/getVehicleInventory.aspx?store=224&page=0&filter=%s&sp=&cl=&carbuyYardCode=1224&pageSize=1000&language=en-US' % (keyword,)
class Cars(scrapy.Item):
Make = scrapy.Field()
Model = scrapy.Field()
Year = scrapy.Field()
Entered_Yard = scrapy.Field()
Section = scrapy.Field()
Color = scrapy.Field()
Picture = scrapy.Field()


class LkqSpider(scrapy.Spider):
name = "lkq"
allowed_domains = ["lkqpickyourpart.com"]
start_urls = (
    url,
)

def parse(self, response):
    picture = response.xpath(
        '//href=/text()').extract()
    section_color = response.xpath(
        '//div[@class="pypvi_notes"]/p/text()').extract()
    info = response.xpath('//td["pypvi_make"]/text()').extract()
    for element in range(0, len(info), 4):
        item = Cars()
        item["Make"] = info[element]
        item["Model"] = info[element + 1]
        item["Year"] = info[element + 2]
        item["Entered_Yard"] = info[element + 3]
        item["Section"] = section_color.pop(
            0).replace("Section:", "").strip()
        item["Color"] = section_color.pop(0).replace("Color:",   "").strip()
        item["Picture"] = picture.pop(0).strip()
        yield item

【问题讨论】：

标签： python csv xpath scrapy scrapy-spider

【解决方案1】：

我不太明白你为什么要使用像 '//href=/text()' 这样的 xpath，我建议先阅读一些 xpath 教程，here 是一个非常好的教程。

如果你想获取所有图片的 url，我想这就是你想要的

pictures = response.xpath('//img/@src').extract()

现在picture.pop(0).strip() 只会为您提供最后一个网址，而strip 它，请记住.extract() 返回一个列表，因此pictures 现在包含所有图片链接，只需在此处选择您需要的链接即可。

【讨论】：