【问题标题】:Scrapy: How to clean response ?Scrapy:如何清理响应?
【发布时间】:2016-12-24 12:12:11
【问题描述】:

这是我的代码 sn-p。我正在尝试使用 Scrapy 抓取网站,然后将数据存储在 Elasticsearch 中以进行索引。

def parse(self, response):
    for news in response.xpath('head'):
        yield {
            'pagetype': news.xpath('//meta[@name="pagetype"]/@content').extract(),
            'description': news.xpath('//div[@class="module__content"]/*/node()/text()').extract(),
              }

现在我的问题是保存在“描述”字段中的值。

    [u'\n              \n              ', u'"For\n              many of us what we eat on Christmas day isn\'t what we would usually consume and\n              that\u2019s perfectly ok," Dr said.', u'"However\n              it is not uncommon for festive season celebrations to begin in November and\n              continue well in to the New Year.', u'"So\n              if health is on the agenda, being mindful about what we put into our bodies\n              with a balanced approach, throughout the whole festive season, is important."', u"Dr\n              , a lecturer at School\n              Sciences, said balancing fresh, healthy food with being physically active was a\n              good start.", u'"Whatever\n              the celebration, try to limit processed foods, often high in fat, sugar and\n              salt," she said.', u'"Taking\n              time during holidays to prepare food and make the most of fresh ingredients is\n              often a much healthier option than relying on convenience foods and take away.', u'"Being\n              mindful about going back for seconds is important too.\xa0 We don\u2019t need to eat until we feel\n              uncomfortable and eating the foods we enjoy doesn\'t necessarily mean we need to\n              eat copious amounts."', u"Dr\n             own healthy tips and substitutes for the Christmas season\n              include:", u'But\n              just because Dr  is a dietitian, doesn\u2019t mean she doesn\u2019t enjoy a\n              Christmas treat or two.', u'"I\n              would have to say my sister in law\'s homemade rocky road is my favourite\n              festive treat. She makes it every Christmas day and it gets better each year," she\n              said.', u'"I\n              also enjoy a summer cocktail every so often during the festive season and a\n              mojito would be one of my favourites on Christmas day. We make it with extra\n              mint from the garden which is a nice, fresh addition.', u'"Rather\n              than focusing on food avoidance, moderation is the best approach.', u'"There\n              are definitely some more healthy choices and some less healthy options when it\n              comes to the typical Christmas day menu, but it\'s more important to be mindful\n              of a healthy, balanced diet throughout the festive period, rather than avoiding\n              specific foods on one day of the year."', u'\n                ', u'\n              \n                ', u'\n                ', u'\n              \n                ', u'\n              ', u'\n                ', u'\n                        ', u'\n                        ', u'\n                        ', u'\n                    ', u'\n            ', u'Related News', u'\n          ', u'\n        ', u'\n          ', u'\n        ', u'\n          ', u'\n        ', u'Search for related news']

有很多空格、换行符和“u”字母......

如何进一步处理此代码以仅包含普通文本,没有多余的空格、换行符 (\n) 代码和“u”字母?

我读到 BeautifulSoup 与 Scrapy 配合得很好,但我找不到任何关于如何将 Scrapy 与 BeautifulSoup 集成的示例。我也愿意使用任何其他方法。非常感谢任何帮助。

谢谢

【问题讨论】:

  • u 仅是您在列表中具有 unicode 文本的信息。如果您从列表中打印单个元素,那么您会看到没有 u 的文本
  • 要清楚,您只是想从这些字符串中删除换行符和空格?
  • 嗨 glS,是的,没错。

标签: python scrapy scrapy-pipeline


【解决方案1】:

您可以使用 in this answer 所示的方法从列表中的字符串中去除空格和换行符:

[' '.join(item.split()) for item in list_of_strings]

其中list_of_strings 是您作为示例提供的字符串列表。

关于“u 字母”,您不必担心它们。 它们只是意味着字符串采用 unicode 编码。参见例如this question 关于此事。

【讨论】:

  • 谢谢,这个怎么用?我在 Scrapy shell 中运行了这个
  • 谢谢,这个怎么用?我在 Scrapy shell ' '.join(myString.split()) 中运行了这个并得到了这个错误 AttributeError: 'list' object has no attribute 'split'
  • 如果您将放入问题中的字符串列表保存为变量list_of_string,您只需运行上面的行并获得相同的列表,其中的元素去除了空格和换行符
  • 太好了,成功了。只需稍作调整......它现在在某些地方显示'', '', '', '', '', '', '', '', '', '', '',,我该如何摆脱这些空引号?并且撇号也更改为 \u2019
  • 要过滤掉空值,您可以使用内置的 filter 函数 (docs.python.org/2/library/functions.html#filter),将 bool 作为第一个参数。跨度>
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2016-02-07
  • 2015-05-25
  • 1970-01-01
  • 1970-01-01
  • 2021-03-19
  • 2021-11-29
相关资源
最近更新 更多