对抗多个 xpath 选择器的正确方法是什么？答案

【问题标题】：What would be the correct way to to this to counter multiple xpath selectors?对抗多个 xpath 选择器的正确方法是什么？
【发布时间】：2015-02-07 18:34:31
【问题描述】：

只睡了 3 小时，醒了 20 多小时很累，请原谅我的错误。

我正在尝试实现多个 xpath 选择器，但似乎无法获得它，显然这段代码有缺陷代码，重复描述，它最终获取最后一项的描述并将其分配给所有项目、屏幕截图和代码：

显示我在视觉表示中看到的意思： this http://puu.sh/fBjA9/da85290fc2.png

代码（Scrapy Web Crawler Python）：蜘蛛

 def parse(self, response):
     item = DmozItem()
     for sel in response.xpath("//td[@class='nblu tabcontent']"):
         item['title'] = sel.xpath("a/big/text()").extract()
         item['link'] = sel.xpath("a/@href").extract()
         for sel in response.xpath("//td[contains(@class,'framed')]"):
             item['description'] = sel.xpath("b/text()").extract()    
         yield item

管道

 def process_item(self, item, spider):
        self.cursor.execute("SELECT * FROM data WHERE title= %s", item['title'])
        result = self.cursor.fetchall()
        if result:

            log.msg("Item already in database: %s" % item, level=log.DEBUG)
        else:
            self.cursor.execute(
               "INSERT INTO data(title, url, description) VALUES (%s, %s, %s)",
                    (item['title'][0], item['link'][0], item['description'][0]))
            self.connection.commit()

            log.msg("Item stored : " % item, level=log.DEBUG)
        return item

    def handle_error(self, e):
            log.err(e)

感谢您阅读并提供帮助。

【问题讨论】：

不看html的scrapy代码意义不大；有网址吗？
@HughBothwell 在这里，谢谢。 phpclasses.org/browse/class/130.html
@HughBothwell 要睡觉了，6 小时后就会起床。近24小时不睡觉

标签： python mysql for-loop xpath scrapy

【解决方案1】：

问题是"//td[@class='nblu tabcontent']"和"//td[contains(@class,'framed')]"是一一对应的；您不能在另一个中迭代一个，或者，正如您所发现的，您只能从内部列表中获取最后一项。

相反，试试

def parse(self, response):
    title_links  = response.xpath("//td[@class='nblu tabcontent']")
    descriptions = response.xpath("//td[contains(@class,'framed')]")
    for tl,d in zip(title_links, descriptions):
        item = DmozItem()
        item['title']       = tl.xpath("a/big/text()").extract()
        item['link']        = tl.xpath("a/@href").extract()
        item['description'] = d.xpath("b/text()").extract()    
        yield item

【讨论】：

天哪，谢谢，确实有效。我想这只是循环的错误放置。非常感谢。

【解决方案2】：

我认为您只需将项目实例化移动到 for 循环中：

def parse(self, response):
   for sel in response.xpath("//td[@class='nblu tabcontent']"):
       item = DmozItem()
       item['title'] = sel.xpath("a/big/text()").extract()
       item['link'] = sel.xpath("a/@href").extract()
       for sel in response.xpath("//td[contains(@class,'framed')]"):
         item['description'] = sel.xpath("b/text()").extract()    
     yield item

【讨论】：

嗯，还是没有效果。
尝试使用 //html 作为主 response.xpath 代码：hastebin.com/tinaduwezu.coffee 截图（得到这个）：puu.sh/fCyVD/6707bc2d82.png 但会导致 mysql 错误 - ProgrammingError: Not all parameters are used in the SQL语句