【问题标题】:Crawler returning crawled results with \n's, how to get rid of these爬虫返回带有\n的爬取结果,如何摆脱这些
【发布时间】:2022-01-06 15:33:58
【问题描述】:

这个爬虫的目的是返回页面上的所有文本以及链接,我们试图将抓取的数据存储在 json 文件中,但是 json 文件的输出包含冗余,例如 \ n 的

这里是scrapy蜘蛛:

import itemloaders
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from crawl.items import SpideyItem



class crawler(CrawlSpider):
    name = 'spidey'
    start_urls = ['https://quotes.toscrape.com/page/']

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )
    custom_settings = {
        'DEPTH_LIMIT': 1,
        'DEPTH_PRIORITY': 1,
    }

    def parse_item(self, response):

        item = dict()
        item['url'] = response.url.strip()
        item['title'] = response.meta['link_text'].strip()
        # extracting basic body
        item['body'] = '\n'.join(response.xpath(
            '//text()').extract())
        # or better just save whole source
        #item['source'] = response.body

        yield item

json 文件中的示例输出:

{"url": "https://quotes.toscrape.com/tag/miracles/page/1/", "title": "miracles", "body": "\n\n\n\t\n\n\t\nQuotes to Scrape\n\n    \n\n    \n\n\n\n\n\n    \n\n        \n\n            \n\n                \n\n                    \nQuotes to Scrape\n\n                \n\n            \n\n            \n\n                \n\n                \n                    \nLogin\n\n                \n                \n\n            \n\n        \n\n    \n\n\nViewing tag: \nmiracles\n\n\n\n\n    \n\n\n    \n\n        \n\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d\n\n        \nby \nAlbert Einstein\n\n        \n(about)\n\n        \n\n        \n\n            Tags:\n            \n \n            \n            \ninspirational\n\n            \n            \nlife\n\n            \n            \nlive\n\n            \n            \nmiracle\n\n            \n            \nmiracles\n\n            \n        \n\n    \n\n\n    \n\n        \n\n            \n            \n        \n\n    \n\n    \n\n    \n\n        \n            \nTop Ten tags\n\n            \n            \n\n            \nlove\n\n            \n\n            \n            \n\n            \ninspirational\n\n            \n\n            \n            \n\n            \nlife\n\n            \n\n            \n            \n\n            \nhumor\n\n            \n\n            \n            \n\n            \nbooks\n\n            \n\n            \n            \n\n            \nreading\n\n            \n\n            \n            \n\n            \nfriendship\n\n            \n\n            \n            \n\n            \nfriends\n\n            \n\n            \n            \n\n            \ntruth\n\n            \n\n            \n            \n\n            \nsimile\n\n            \n\n            \n        \n    \n\n\n\n\n    \n\n    \n\n        \n\n            \n\n                Quotes by: \nGoodReads.com\n\n            \n\n            \n\n                Made with \n\u2764\n by \nScrapinghub\n\n            \n\n        \n\n    \n\n\n\n"},

如何解决这个问题?

【问题讨论】:

    标签: python json scrapy


    【解决方案1】:

    您的问题的一个可能答案是使用replace

    >>> "A lot of newline\n\n\n    characters\n\n\n\n\n\n\n\n\n\n\n\n\n".replace("\n", "")
    'A lot of newline    characters'
    

    不过,清理抓取的内容通常会涉及更多内容。您通常不想无条件地删除所有换行符,另一件事可能是存在过多的空格(例如在您的示例中)。 对于这些情况,您可能希望改用正则表达式。一个非常简单的例子是:

    >>> s = "A lot of newline\n\n\n  \t\t  characters\n\n\n\n\n\n\n\n\n\n\n\n\n"
    >>> re.sub("(\s)+", r"\1", s)
    'A lot of newline characters\n'
    

    上面的表达式很简单,但是正则表达式可以变得非常复杂,编码很多规则,在清理、搜索或验证文本数据等时可以替换很多行代码。

    【讨论】:

    • 抱歉,我有点迷路了,您能否澄清一下我究竟需要在哪里进行哪些更改才能清除所有 json 文件?因为由于这个原因,我也无法将这些加载到弹性搜索中。
    • 既然你想清理item["body"],这是一个字符串,你可以做re.sub("(\s)+", r"\1", item["body"])。请注意,它不会就地替换,因此您需要(重新)分配返回值。顺便说一句,您可能希望在示例中使用 return 而不是 yield - 您的函数只返回一项。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2017-09-30
    • 2019-11-07
    • 2023-04-04
    • 2019-06-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多