【问题标题】:remove the unicode from the output of JSON using scrapy使用 scrapy 从 JSON 的输出中删除 unicode
【发布时间】:2015-03-07 10:54:12
【问题描述】:
import scrapy

from ex.items import ExItem

class reddit(scrapy.Spider):

    name = "dmoz"
    allowed_domains = ["reddit.com"]
    start_urls = [
    "http://www.reddit.com/"]

    """docstring for reddit"""
    def parse(self, response):
        item = ExItem()
        item ["title"] = response.xpath('//p[contains(@class,"title")]/a/text()').extract()
        item ["rank"] = response.xpath('//span[contains(@class,"rank")]/text()').extract()
        item ["votes_dislike"] = response.xpath('//div[contains(@class,"score dislikes")]/text()').extract()
        item ["votes_unvoted"] = response.xpath('//div[contains(@class,"score unvoted")]/text()').extract()
        item ["votes_likes"] = response.xpath('//div[contains(@class,"score likes")]/text()').extract()
        item ["video_reference"] = response.xpath('//a[contains(@class,"thumbnail may-blank")]/@href').extract()
        item ["image"] = response.xpath('//a[contains(@class,"thumbnail may-blank")]/img/@src').extract()

我能够将其转换为 JSON,但在输出中我得到了 JSON 中的一个项目符号,如何删除它并仍然具有 JSON 格式?

【问题讨论】:

  • 我想从我的 json 输出中完全删除它

标签: python json python-2.7 web-scraping scrapy


【解决方案1】:

有些隐藏元素是您在浏览器中看不到的。 Scrapy 看到了它们。

您只需要在页面相关部分搜索数据(divid="siteTable"):

def parse(self, response):
    # make a selector and search the fields inside it
    sel = response.xpath('//div[@id="siteTable"]')

    item = ExItem()
    item["title"] = sel.xpath('.//p[contains(@class,"title")]/a/text()').extract()
    item["rank"] = sel.xpath('.//span[contains(@class,"rank")]/text()').extract()
    item["votes_dislike"] = sel.xpath('.//div[contains(@class,"score dislikes")]/text()').extract()
    item["votes_unvoted"] = sel.xpath('.//div[contains(@class,"score unvoted")]/text()').extract()
    item["votes_likes"] = sel.xpath('.//div[contains(@class,"score likes")]/text()').extract()
    item["video_reference"] = sel.xpath('.//a[contains(@class,"thumbnail may-blank")]/@href').extract()
    item["image"] = sel.xpath('.//a[contains(@class,"thumbnail may-blank")]/img/@src').extract()
    return item

经过测试,这是我得到的,例如votes_likes

 'votes_likes': [u'5340',
                 u'4041',
                 u'4080',
                 u'5055',
                 u'4385',
                 u'4784',
                 u'3842',
                 u'3734',
                 u'4081',
                 u'3731',
                 u'4580',
                 u'5279',
                 u'2540',
                 u'4345',
                 u'2068',
                 u'3715',
                 u'3249',
                 u'4232',
                 u'4025',
                 u'522',
                 u'2993',
                 u'2789',
                 u'3529',
                 u'3450',
                 u'3533'],

【讨论】:

    猜你喜欢
    • 2016-09-18
    • 2015-09-06
    • 1970-01-01
    • 2015-10-11
    • 2016-03-12
    • 2022-01-19
    • 2020-12-30
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多