【发布时间】:2022-01-06 15:33:58
【问题描述】:
这个爬虫的目的是返回页面上的所有文本以及链接,我们试图将抓取的数据存储在 json 文件中,但是 json 文件的输出包含冗余,例如 \ n 的
这里是scrapy蜘蛛:
import itemloaders
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from crawl.items import SpideyItem
class crawler(CrawlSpider):
name = 'spidey'
start_urls = ['https://quotes.toscrape.com/page/']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
custom_settings = {
'DEPTH_LIMIT': 1,
'DEPTH_PRIORITY': 1,
}
def parse_item(self, response):
item = dict()
item['url'] = response.url.strip()
item['title'] = response.meta['link_text'].strip()
# extracting basic body
item['body'] = '\n'.join(response.xpath(
'//text()').extract())
# or better just save whole source
#item['source'] = response.body
yield item
json 文件中的示例输出:
{"url": "https://quotes.toscrape.com/tag/miracles/page/1/", "title": "miracles", "body": "\n\n\n\t\n\n\t\nQuotes to Scrape\n\n \n\n \n\n\n\n\n\n \n\n \n\n \n\n \n\n \nQuotes to Scrape\n\n \n\n \n\n \n\n \n\n \n \nLogin\n\n \n \n\n \n\n \n\n \n\n\nViewing tag: \nmiracles\n\n\n\n\n \n\n\n \n\n \n\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d\n\n \nby \nAlbert Einstein\n\n \n(about)\n\n \n\n \n\n Tags:\n \n \n \n \ninspirational\n\n \n \nlife\n\n \n \nlive\n\n \n \nmiracle\n\n \n \nmiracles\n\n \n \n\n \n\n\n \n\n \n\n \n \n \n\n \n\n \n\n \n\n \n \nTop Ten tags\n\n \n \n\n \nlove\n\n \n\n \n \n\n \ninspirational\n\n \n\n \n \n\n \nlife\n\n \n\n \n \n\n \nhumor\n\n \n\n \n \n\n \nbooks\n\n \n\n \n \n\n \nreading\n\n \n\n \n \n\n \nfriendship\n\n \n\n \n \n\n \nfriends\n\n \n\n \n \n\n \ntruth\n\n \n\n \n \n\n \nsimile\n\n \n\n \n \n \n\n\n\n\n \n\n \n\n \n\n \n\n Quotes by: \nGoodReads.com\n\n \n\n \n\n Made with \n\u2764\n by \nScrapinghub\n\n \n\n \n\n \n\n\n\n"},
如何解决这个问题?
【问题讨论】: