Scrapy 使用 /u 向 DB 生成项目答案

【问题标题】：Scrapy yields items to DB with /uScrapy 使用 /u 向 DB 生成项目
【发布时间】：2017-05-18 04:37:48
【问题描述】：

我正在运行一个将数据保存到 DynamoDB 的蜘蛛。我一直在通过 StackOverflow 寻找答案，但找不到。它将 stamp 和 title 保存到 DynamoDB 中，其中包含 /u 和括号等所有不同的字符。 url 被正确保存，没有多余的字符。没有它们我如何保存？

我的蜘蛛：

def parse(self, response):

    for item in response.xpath("//li[contains(@class, 'river-block')]"):
        url = item.xpath(".//h2[@class='post-title']/a/@href").extract()[0]
        stamp = item.xpath(".//time/@datetime").extract()
        yield scrapy.Request(url, callback=self.get_details, meta={'stamp': stamp})

def get_details(self, response):
        article = ArticleItem()
        article['title'] = response.xpath("//h1/text()").extract()
        article['url'] = format(shortener.short(response.url))
        article['stamp'] = response.meta['stamp']
        yield article

我的管道文件：

class DynamoDBStorePipeline(object):

def process_item(self, item, spider):
    dynamodb = boto3.resource('dynamodb',region_name="us-west-2")

    table = dynamodb.Table('TechCrunch')

    table.put_item(
    Item={
    'url': str(item['url']),
    'title': str(item['title']),
    'stamp': str(item['stamp']),
    }
    )
    return item

样本输出：
url：链接（没关系）
戳：[u'2017-05-17 08:06:47']
标题：[u'title']

【问题讨论】：

请提供title、sample 和url 的示例值以及预期输出，以及您要废弃的网站网址。
使用extract_first()而不是extract()，如果这不能解决您的问题，请提供您正在抓取的链接的更新帖子。
@JkShaw 就像一个魅力。谢谢你，先生。你能把它写成一个回答，这样我就可以投票并接受你的回答吗？
不客气。

标签： python scrapy amazon-dynamodb nosql

【解决方案1】：

在Scrapy中可以使用extract得到textual data，但是如果你想extract只有first matched element，你可以调用选择器extract_first()。

在您的情况下，更新 stamp 和 title 选择器需要为 extract_first()，如下所示：

def parse(self, response):

    for item in response.xpath("//li[contains(@class, 'river-block')]"):
        url = item.xpath(".//h2[@class='post-title']/a/@href").extract_first()
        stamp = item.xpath(".//time/@datetime").extract_first()
        yield scrapy.Request(url, callback=self.get_details, meta={'stamp': stamp})

def get_details(self, response):
        article = ArticleItem()
        article['title'] = response.xpath("//h1/text()").extract_first()
        article['url'] = format(shortener.short(response.url))
        article['stamp'] = response.meta['stamp']
        yield article

【讨论】：